DDR memory bit errors with SocKit (Cyclone V SoC device)

This post was written by eli on November 12, 2013
Posted Under: Intel FPGA (Altera)

The problem

There seems to be a minor DDR memory reliability issue with the SocKit, having the 5CSXFC6D6F31C8NES device marked “F AAAAU1319A”.

This can be detected by copying pseudorandom data from one buffer to another repeatedly, and then comparing the data between the buffers. The buffers must be large, to make sure the cache is flushed all the time. A single bit is flipped typically after a few Gigabytes of copied data or so.

A simple test program demonstrating this is at the bottom of this post. It should be compiled for Linux. The program accepts one single argument, which is the buffer size to use (in bytes).

This is what a typical session looks like:

# time ./memtest 16777216
Initialized lsr_state to 7ffeb059
On byte count 2985973160, position 0x7e3699, memcpy() length 16714658:

Destination:
7e3680 0d 1a 35 6b d6 ad 5a b5 6a d5 aa 55 aa 54 a9 52
7e3690 a4 48 91 22 45 8b 17 2e 5d bb 74 e8 d0 a1 42 85
7e36a0 0b 16 2c 59 b2 64 c8 90 21 43 87 0f 1e 3d 7b f6
7e36b0 ed da b4 68 d1 a2 44 88 10 20 40 80 01 03 07 0e
7e36c0 1d 3a 75 eb d7 af 5f bf 7e fd fb f6 ed db b6 6d
7e36d0 da b5 6a d4 a8 50 a1 43 87 0e 1c 38 70 e1 c3 86
7e36e0 0c 18 30 61 c2 85 0b 16 2d 5b b7 6e dc b9 72 e4
7e36f0 c9 93 27 4f 9f 3f 7f fe fd fa f5 eb d7 ae 5c b9
7e3700 72 e5 cb 96 2c 58 b1 63 c7 8e 1c 39 73 e6 cc 98

Source:
7e3680 0d 1a 35 6b d6 ad 5a b5 6a d5 aa 55 aa 54 a9 52
7e3690 a4 48 91 22 45 8b 17 2e 5d ba 74 e8 d0 a1 42 85
7e36a0 0b 16 2c 59 b2 64 c8 90 21 43 87 0f 1e 3d 7b f6
7e36b0 ed da b4 68 d1 a2 44 88 10 20 40 80 01 03 07 0e
7e36c0 1d 3a 75 eb d7 af 5f bf 7e fd fb f6 ed db b6 6d
7e36d0 da b5 6a d4 a8 50 a1 43 87 0e 1c 38 70 e1 c3 86
7e36e0 0c 18 30 61 c2 85 0b 16 2d 5b b7 6e dc b9 72 e4
7e36f0 c9 93 27 4f 9f 3f 7f fe fd fa f5 eb d7 ae 5c b9
7e3700 72 e5 cb 96 2c 58 b1 63 c7 8e 1c 39 73 e6 cc 98

real    1m0.834s
user    1m0.710s
sys    0m0.070

In this test run, an error was detected after about 60 seconds and almost 3 GB of data (2985973160 bytes, to be exact). Since we’re dealing with rare events, both the time and byte count may vary significantly until an error occurs. This can run for several minutes without anything happening too.

It may be significant to do this test after the system has been powered up from cold (i.e. been unpowered for a few minutes).

As seen above, the program dumps the hex data around the error, and points out the offset in the failed attempt, where the error was detected, 0x7e3699 in the case above. And indeed, the source buffer had the value 0xba, but in the destination buffer it was 0xbb. One single bit was flipped. It seems like it’s bits 0 and 1 that tend to turn out ’1′ instead of ’0′, but let’s skip the witchcraft.

It seems like the bit flipping occurs on writing to the memory, so the error is recorded in the DDR memory’s memory array, as opposed to a momentary error while reading. This speculation is backed by a test not shown in the program listed below, in which a second test is run when an error is detected. In this second test, the buffers are compared only by reading. The error was found consistently through several runs of this second test, indicating that the error is in fact written in memory, and not read wrong. Since the entire buffer was compared on each read-only comparison, finding the same error consistently cannot be attributed to caching.

The processor was configured as in the soc_system.qsys file included in soc_system_13_0_0_06252013_90253.tar.gz, which can be downloaded as a reference design for Linaro Desktop at Rocketboards. To be specific, the hps_0 settings for the memory interface and other hardware peripherals was bytewise identical (the bridges to FPGA had different settings, but that isn’t relevant to this issue).

A few words about terminations

It’s possible to eliminate these bit errors by modifying the ODT settings of the DDR memory. But let’s first explain what it’s all about.

As the signals going between the Cyclone chip and the DDR memory switch extremely fast, the short copper wires that connect these two devices are passing through electromagnetic waves, rather than steady voltages. These wires are analyzed in the same terms as antennas and waveguides, with the goal of reducing back-and-forth reflections, and damping them as fast as possible.

One of the means for reducing reflections is to place resistors, called terminations, at the ends of these wires. In order to achieve a good result and avoid a dense placement of a lot of components on the board, these resistors are often included on the chip’s silicon. In other words, it’s an On Die Termination (ODT). Whether they should be applied, and what resistance they have is programmable, both on the FPGA’s side and on the DDR memory. The choice is usually made in conjunction with running electromagnetic simulations on the PCB’s physical layout, and picking values that produce good waveforms. If this crucial part in the PCB design process is done improperly, memory corruption occurs, sometimes to the level of rendering the system useless, and sometimes causing rare bit flipping, as experienced with the SocKit.

There are four major parameters influencing the signal integrity:

  • The termination on the Cyclone V device: Whether applied, and its resistance
  • The Nominal ODT of the DDR memory: Whether applied, and its resistance. The term “nominal” is just a fancy word to distinguish from the one listed next;
  • The Dynamic ODT of the DDR memory: Whether applied, and its resistance. This optional feature allows programming a different resistance which is applied only when the data lines are used for a write operation (i.e. the lines are driven by the FPGA). When this feature is disabled (“off”) the Nominal ODT’s setting holds all the time.
  • The Output Drive Strength or Output Impedance of the DDR memory: This controls the current applied when the DDR memory drives the wires either high or low. The magnitude is given in terms of an equivalent resistor, connected either to the power supply or to ground.

Except for the item above, all parameters are set in Qsys by editing the HPS block, on the SDRAM tab, going to the “Memory Parameters” sub-tab.

When the reference design is followed, the Cyclone device is programmed to apply a 50 Ohm termination on all data wires. This is a result of the reference resistor on the board, R295 connected to D27, which is 100 Ohms.

The DDR is programmed to a nominal termination of RZQ/4 = 60 Ohms. The dynamic termination is enabled and set to RZQ/4 = 60 Ohms as well. The output drive strength is RZQ/7 = 34 Ohms. These figures are derived from the reference resistors to the memories, R288 and R269, both 240 Ohms.

There’s something peculiar about setting the dynamic ODT to the same value as the nominal, as turning the dynamic termination off altogether should have the same effect, in theory. As seen below, reality has it’s own say about this.

Tweaking termination settings

By all means, the correct way to set up the parameters related to signal integrity is applying the correct values that were chosen when the PCB was designed, based upon proper simulations. Since the performance of the reference design’s settings aren’t satisfactory, there’s no choice but tweaking the parameters until the bit errors vanish, hoping that the new setting will work well on other boards and throughout a reasonable temperature range. This is not a desired solution, but a last resort.

Several settings were tried out. I have to admit that I was surprised how little effect these settings had: The system had no problem booting in any of the experiments I made, and the difference was only sensed while running heavy tests.

The following three settings appeared to result in no errors (each one described with the one change relative to the reference design):

  • Output drive strength set to RZQ/6 = 40 Ohms
  • ODT completely turned off (Nominal and dynamic alike)
  • Disabling dynamic ODT off only.

Things that didn’t reduce errors: Setting the dynamic ODT to RZQ/2, setting the nominal ODT to RZQ/2 or RZQ/6, or disabling the nominal ODT while leaving the dynamic ODT as before.

The only change that could make sense is reducing the output drive strength: Recall that the errors were most likely generated on writes to the memory. One possible reason for bit flipping is that reflections keep running on the wires from a previous read operation when the lines are turned over for writing, so that the voltage levels of bits intended for writing is disrupted by this noise. Reducing the memory’s driving current obviously reduces the this noise as well.

Turning off the ODT should increase the reflections, so it’s not clear why this helped. And disabling the dynamic ODT should make any difference at all, since the nominal resistance is the same anyhow. Nevertheless, it was verified that this change made a difference.

When dealing with signal integrity without the proper simulation tools, it’s not rare that one can’t explain why one action helped and another didn’t.

As for my own attempt to solve the problem, I initially chose reducing the drive strength to RZQ/6. At least, this change doesn’t contradict common sense. But extensive tests exposed a bit error for each ~1 TB of data handled (this is very crude error rate estimation). Turning off ODT completely ran through the same longer test with no errors detected at all, so this was the setting I chose. This might be specific to my own board, though.

The program

So here it is, if you want to try it out yourself. It’s a hack of pieces of code I had around, so it’s not really top-notch software engineering… (and WordPress killed the indentation)

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <signal.h>
#include <errno.h>
#include <string.h>

static unsigned int lsr_state;
static long long count = 0;
static char rand_state[32];

void randseed() {
 int fd;
 fd = open("/dev/urandom", O_RDONLY);

 if (fd < 0) {
 perror("open");
 exit(1);
 }

 if (!initstate(0, rand_state, sizeof(rand_state))) {
 fprintf(stderr, "Call to initstate() failed.\n");
 exit(1);
 }

 if (read(fd, rand_state, sizeof(rand_state) ) != sizeof(rand_state)) {
 fprintf(stderr, "Failed to read from /dev/urandom\n");
 exit(1);
 }

 close(fd);

 if (!setstate(rand_state)) {
 fprintf(stderr, "Call to setstate() failed.\n");
 exit(1);
 }
}

void hexprint (unsigned char *buf, int at, unsigned long int n) {
 int i, j, from, to;

 from = (at & 0xfffffff0) - 16;
 if (from < 0)
 from = 0;

 to = (at & 0xfffffff0) + 127;
 if (to >= n)
 to = n-1;

 for (i=from; i<to; i+=16) {
 printf("%04x", i);

 for (j=i; ((j<(i+16)) && (j<=to)); j++)
 printf(" %02x", buf[j]);

 printf("\n");
 }
}

void exit_program(int sig) {
 fprintf(stderr, "memtest: Checked %lld bytes\n", count);

 exit(0);
}

int main(int argc, char *argv[]) {

 int bufsize, startpos1, startpos2, bytecount, i, bit;
 unsigned char *buf, *destbuf, *p, *b1, *b2;

 if (argc != 2) {
 fprintf(stderr, "Usage: %s buffer-size\n", argv[0]);
 exit(1);
 }

 (void) signal(SIGINT, exit_program);
 (void) signal(SIGQUIT, exit_program);
 (void) signal(SIGTERM, exit_program);
 (void) signal(SIGALRM, exit_program);

 bufsize = atoi(argv[1]);

 if (bufsize < 65536) {
 fprintf(stderr, "Bufsize %d too small (at least 65536)\n", bufsize);
 exit(1);    
 }

 if (!(buf = malloc(bufsize))) {
 fprintf(stderr, "Failed to allocate %d bytes for buffer\n", bufsize);
 exit(1);
 }

 if (!(destbuf = malloc(bufsize))) {
 fprintf(stderr, "Failed to allocate %d bytes for buffer\n", bufsize);
 exit(1);
 }

 randseed();

 do {
 lsr_state = random();
 fprintf(stderr, "Initialized lsr_state to %08x\n", lsr_state);
 } while (lsr_state == 0);

 for (i=0; i<bufsize; i++) {
 p = (unsigned char *) &lsr_state;

 buf[i] = *p++;

 bit = ((lsr_state >> 19) ^ (lsr_state >> 2)) & 0x01;

 lsr_state = (lsr_state << 1) | bit;

 if (lsr_state == 0) {
 fprintf(stderr, "Huh? The LSR state is zero!\n");
 exit(1);
 }
 }

 while (1) {
 startpos1 = random() & 0x7fff;
 startpos2 = random() & 0x7fff;
 bytecount = bufsize - 32768 - (random() & 0x7fff);

 b1 = destbuf + startpos1;
 b2 = buf + startpos2;

 memcpy(b1, b2, bytecount);

 for (i=0; i<bytecount; i++, count++)
 if (*b1++ != *b2++) {
 printf("On byte count %lld, position 0x%x, memcpy() length %d:\n",
 count, i, bytecount);

 printf("\nDestination:\n");
 hexprint(destbuf + startpos1, i, bufsize);
 printf("\nSource:\n");
 hexprint(buf + startpos2, i, bufsize);

 exit(1);
 }
 }
 return 0;
}

Reader Comments

hii , nice blog :)
I see yours program use atoi function. , I am using Xilinx Virtex-5 and using “atoi” to converts a string to an integer , but atoi function can not work,
Can you help me, thanks you

#1 
Written By bagus on January 6th, 2014 @ 14:39

atoi() is a C function. In what way did you attempt to call it on Virtex-5?

#2 
Written By eli on January 6th, 2014 @ 14:43

Hi,
I am facing a similar problem…
Have you consulted Altera about the problem and the solution ?

#3 
Written By ronen on March 18th, 2015 @ 13:14

Hi,

As mentioned above, I managed to kill the errors on my specific board. So I didn’t have a specific problem to solve, which makes contacting support meaningless.

As for reporting the problem, my experience is that it’s hopeless with large companies.

#4 
Written By eli on March 18th, 2015 @ 15:07

Add a Comment

required, use real name
required, will not be published
optional, your blog address