The Xilinx EDK “update bitstream” process: A closer look

Introduction

The Xilinx Platform Studio (EDK) has this “update bitstream” function, which I wasn’t so clear about, despite its documentation page. Its icon says “BRAM INIT” which turns out to be more accurate than expected. So what happens during this process? When is it necessary?

If you’re into running a Linux kernel, you’re most likely wasting your time reading this, because the Linux kernel is kicked off directly from the external RAM, and hence this mangling isn’t necessary. To set up a Linux bitstream, see another post of mine.

Having that said, let’s look at the problem this functions solves: A Microblaze processor starts executing at address 0 unless told otherwise. Its interrupt vectors are at near-zero addresses as well. These addresses are mapped to an FPGA block RAM.

What this block RAM should contain is a jump to the application’s entry point. On a SP605 board, this is most likely the beginning of the DDR memory, Oxc0000000. So when the processor kicks off, this block RAM’s address zero should contain:

00000000 <_start>:
 0:    b000c000     imm    -16384
 4:    b8080000     brai    0

Which is Microblazish for “Jump to Oxc0000000″ (note the lower 16 bits of both commands).

When a system is booted, there are two phases: First, the FPGA is loaded with its bitstream, and then the external memory, containing the bulk of execution code. And then the processor is unleashed.

So the block memory’s correct content needs to be included in the bitstream itself. But when the processor is implemented from its logic elements, it isn’t clear what should be written there. It’s only when the software is linked, that the addresses of the different segments are known.

But software compilation and linking requires the knowledge of the processor’s memory map, which is generated while the processor is implemented. So there’s a chicken-and-egg situation here.

The egg was first

The solution is that block RAM’s content is fixed after the software is compiled and linked. The reset and interrupt vectors are included in the ELF file generated by the software linker, and are mapped to the block RAM’s addresses. The “update bitstream” process reads the ELF file, finds the relevant region, and updates the bitstream file, producing the download.bit file. That’s why choosing the ELF file is necessary for this process.

Necessity

The original problem was that the execution starts from address zero. But if the ELF file points at the real starting point, and this is properly communicated to the processor at startup, there’s no need to set up the block RAM at all. Well, assuming that the executable takes care of interrupts and exception vectors soon enough. This is the case with Linux kernel images, for example, for which there is no need to update the bitstream.

Some gory details

The “update bitstream” process launches a command like

bitinit -p xc6slx45tfgg484-3 system.mhs -pe microblaze_0 sdk/peripheral_tests_0/Debug/peripheral_tests_0.elf \
 -bt implementation/system.bit -o implementation/download.bit

which takes place in two phases. In the first phase, the system.mhs file is read and parsed, so that the memory map is known and the block RAM is identified. This program then runs something like

data2mem -bm "implementation/system_bd" -p xc6slx45tfgg484-3 -bt "implementation/system.bit" -bd "sdk/peripheral_tests_0/Debug/peripheral_tests_0.elf" tag microblaze_0 -o b implementation/download.bit

Which is the action itself. Data2mem is a utility for mangling bitstreams so that their block RAMs contain desired data. The -bm flag tells data2mem to get the block RAM map from implementation/system_bd.bmm, which can be

// BMM LOC annotation file.
//
// Release 13.2 - Data2MEM O.61xd, build 2.2 May 20, 2011
// Copyright (c) 1995-2011 Xilinx, Inc.  All rights reserved.

///////////////////////////////////////////////////////////////////////////////
//
// Processor 'microblaze_0', ID 100, memory map.
//
///////////////////////////////////////////////////////////////////////////////

ADDRESS_MAP microblaze_0 MICROBLAZE-LE 100

 ///////////////////////////////////////////////////////////////////////////////
 //
 // Processor 'microblaze_0' address space 'microblaze_0_bram_block_combined' 0x00000000:0x00001FFF (8 KBytes).
 //
 ///////////////////////////////////////////////////////////////////////////////

 ADDRESS_SPACE microblaze_0_bram_block_combined RAMB16 [0x00000000:0x00001FFF]
 BUS_BLOCK
 microblaze_0_bram_block/microblaze_0_bram_block/ramb16bwer_0 [31:24] INPUT = microblaze_0_bram_block_combined_0.mem PLACED = X3Y30;
 microblaze_0_bram_block/microblaze_0_bram_block/ramb16bwer_1 [23:16] INPUT = microblaze_0_bram_block_combined_1.mem PLACED = X2Y30;
 microblaze_0_bram_block/microblaze_0_bram_block/ramb16bwer_2 [15:8] INPUT = microblaze_0_bram_block_combined_2.mem PLACED = X2Y32;
 microblaze_0_bram_block/microblaze_0_bram_block/ramb16bwer_3 [7:0] INPUT = microblaze_0_bram_block_combined_3.mem PLACED = X2Y36;
 END_BUS_BLOCK;
 END_ADDRESS_SPACE;

END_ADDRESS_MAP;

So this file defines the addresses covered as well as the physical positions of these block RAMs in the logic fabric.

The -bd flag points at the ELF file to get the data from, with the “tag microblaze_0″ part saying that only the memories tagged microblaze_0 in the .bmm file should be handled, and the rest ignored.

 

Microblaze ELF: A small look inside

This is a small reverse-engineering of the ELF file, as generated by Xilinx’ SDK for a simple standalone application targeted for the SP605 board.

ELF headers

Looking into the ELF file, we have something like this:

> mb-objdump --headers sdk/peripheral_tests_1/Debug/peripheral_tests_1.elf

sdk/peripheral_tests_1/Debug/peripheral_tests_1.elf:     file format elf32-microblazele

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
 0 .vectors.reset 00000008  00000000  00000000  000000b4  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 1 .vectors.sw_exception 00000008  00000008  00000008  000000bc  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 2 .vectors.interrupt 00000008  00000010  00000010  000000c4  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 3 .vectors.hw_exception 00000008  00000020  00000020  000000cc  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 4 .text         0000653c  c0000000  c0000000  000000d4  2**2
 CONTENTS, ALLOC, LOAD, CODE
 5 .init         0000003c  c000653c  c000653c  00006610  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 6 .fini         0000001c  c0006578  c0006578  0000664c  2**2
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 7 .ctors        00000008  c0006594  c0006594  00006668  2**2
 CONTENTS, ALLOC, LOAD, DATA
 8 .dtors        00000008  c000659c  c000659c  00006670  2**2
 CONTENTS, ALLOC, LOAD, DATA
 9 .rodata       00000986  c00065a4  c00065a4  00006678  2**2
 CONTENTS, ALLOC, LOAD, READONLY, DATA
 10 .sdata2       00000006  c0006f2a  c0006f2a  00006ffe  2**0
 ALLOC
 11 .sbss2        00000000  c0006f30  c0006f30  000071d8  2**0
 CONTENTS
 12 .data         000001d0  c0006f30  c0006f30  00007000  2**2
 CONTENTS, ALLOC, LOAD, DATA
 13 .eh_frame     00000004  c0007100  c0007100  000071d0  2**2
 CONTENTS, ALLOC, LOAD, DATA
 14 .jcr          00000004  c0007104  c0007104  000071d4  2**2
 CONTENTS, ALLOC, LOAD, DATA
 15 .sdata        00000000  c0007108  c0007108  000071d8  2**0
 CONTENTS
 16 .sbss         00000000  c0007108  c0007108  000071d8  2**0
 CONTENTS
 17 .tdata        00000000  c0007108  c0007108  000071d8  2**0
 CONTENTS
 18 .tbss         00000000  c0007108  c0007108  000071d8  2**0

 19 .bss          00000d78  c0007108  c0007108  000071d8  2**2
 ALLOC
 20 .heap         00000400  c0007e80  c0007e80  000071d8  2**0
 ALLOC
 21 .stack        00000400  c0008280  c0008280  000071d8  2**0
 ALLOC
 22 .debug_line   0000779f  00000000  00000000  000071d8  2**0
 CONTENTS, READONLY, DEBUGGING
 23 .debug_info   00008b11  00000000  00000000  0000e977  2**0
 CONTENTS, READONLY, DEBUGGING
 24 .debug_abbrev 000028e7  00000000  00000000  00017488  2**0
 CONTENTS, READONLY, DEBUGGING
 25 .debug_aranges 000006c0  00000000  00000000  00019d70  2**3
 CONTENTS, READONLY, DEBUGGING
 26 .debug_macinfo 0007f541  00000000  00000000  0001a430  2**0
 CONTENTS, READONLY, DEBUGGING
 27 .debug_frame  00000f10  00000000  00000000  00099974  2**2
 CONTENTS, READONLY, DEBUGGING
 28 .debug_loc    00003f80  00000000  00000000  0009a884  2**0
 CONTENTS, READONLY, DEBUGGING
 29 .debug_pubnames 00000fbe  00000000  00000000  0009e804  2**0
 CONTENTS, READONLY, DEBUGGING
 30 .debug_str    000018d5  00000000  00000000  0009f7c2  2**0
 CONTENTS, READONLY, DEBUGGING
 31 .debug_ranges 00000078  00000000  00000000  000a1097  2**0
 CONTENTS, READONLY, DEBUGGING

Even though this is a lot of mumbo-jumbo, there are three main parts. The reset and interrupt vectors, around address zero, the main parts of the ELF (.text, .data and such) at Oxc0000000 and on, and the debug parts which have no memory allocation at all.

The reset branch to application

This is interesting to compare with the Microblaze’s memory map. It can be deduced from the .mhs file, but hey, the log file (with .log suffix) has this segment:

Address Map for Processor microblaze_0
 (0000000000-0x00001fff) microblaze_0_d_bram_ctrl    microblaze_0_dlmb
 (0000000000-0x00001fff) microblaze_0_i_bram_ctrl    microblaze_0_ilmb
 (0x40000000-0x4000ffff) Push_Buttons_4Bits    axi4lite_0
 (0x40020000-0x4002ffff) LEDs_4Bits    axi4lite_0
 (0x40040000-0x4004ffff) DIP_Switches_4Bits    axi4lite_0
 (0x40600000-0x4060ffff) RS232_Uart_1    axi4lite_0
 (0x40800000-0x4080ffff) IIC_SFP    axi4lite_0
 (0x40820000-0x4082ffff) IIC_EEPROM    axi4lite_0
 (0x40840000-0x4084ffff) IIC_DVI    axi4lite_0
 (0x40a00000-0x40a0ffff) SPI_FLASH    axi4lite_0
 (0x40e00000-0x40e0ffff) Ethernet_Lite    axi4lite_0
 (0x41800000-0x4180ffff) SysACE_CompactFlash    axi4lite_0
 (0x74800000-0x7480ffff) debug_module    axi4lite_0
 (0xc0000000-0xc7ffffff) MCB_DDR3    axi4_0

So obviously all the main ELF parts go directly to the DDR memory (that isn’t much of a surprise), and the reset/interrupt go to the internal block ram.

A quick disassembly reveals the gory details:

> mb-objdump --disassemble sdk/peripheral_tests_1/Debug/peripheral_tests_1.elf
sdk/peripheral_tests_1/Debug/peripheral_tests_1.elf:     file format elf32-microblazele

Disassembly of section .vectors.reset:

00000000 <_start>:
 0:    b000c000     imm    -16384
 4:    b8080000     brai    0
Disassembly of section .vectors.sw_exception:

00000008 <_vector_sw_exception>:
 8:    b000c000     imm    -16384
 c:    b8081858     brai    6232
Disassembly of section .vectors.interrupt:

00000010 <_vector_interrupt>:
 10:    b000c000     imm    -16384
 14:    b80818a4     brai    6308
Disassembly of section .vectors.hw_exception:

00000020 <_vector_hw_exception>:
 20:    b000c000     imm    -16384
 24:    b8081870     brai    6256
Disassembly of section .text:

c0000000 <_start1>:
c0000000:    b000c000     imm    -16384
c0000004:    31a07108     addik    r13, r0, 28936
c0000008:    b000c000     imm    -16384
c000000c:    30406f30     addik    r2, r0, 28464
(... and it goes on and on ...)

So let’s look at the reset vector at address zero. The first IMM opcode loads C000 as the upper 16 bits for the command following, which is a branch immediate command. Together, they make a jump to Oxc000000. Likewise, the software exception jumps to Oxc0001858 and so on.

Since only the block RAM part can be included in the download.bit bitfile, only these jump vectors depend on the ELF file during the “Update bitfile” process. That’s why one gets away with not running this process, even when the ELF has been modified with a plain recompilation.

And now to the bootloop ELF

So what is the bootloop code doing? The headers are no more impressive than

> mb-objdump --headers bootloops/microblaze_0.elf

bootloops/microblaze_0.elf:     file format elf32-microblazele

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
 0 .boot         00000004  00000000  00000000  00000074  2**0
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 1 .text         00000000  00000000  00000000  00000074  2**0
 CONTENTS, ALLOC, LOAD, READONLY, CODE
 2 .data         00000000  00000000  00000000  00000074  2**0
 CONTENTS, ALLOC, LOAD, DATA
 3 .bss          00000000  00000000  00000000  00000078  2**0
 ALLOC

Note the Size column: All entries are empty, except for the .boot section, which is four bytes small (one single instruction). That doesn’t leave room for sophisticated software, and the disassembly is indeed

> mb-objdump --disassemble bootloops/microblaze_0.elf

bootloops/microblaze_0.elf:     file format elf32-microblazele

Disassembly of section .boot:

00000000 <_boot>:
 0:    b8000000     bri    0        // 0

Which is simply an endless loop. So they called it bootloop for a reason.

 

Booting a Microblaze processor + software using Compact Flash

This is a small guide to loading a standalone application + bitstream to an FPGA using the CompactFlash card. Or put otherwise, how to make the System ACE chip happy.

For loading a Linux kernel in the same way, I suggest referring to a special post in that subject.

Formatting the flash

Rule #1: Don’t format it unless you have to. And if you have to, read the System ACE CompactFlash Solution datasheet (DS080.pdf), in particular “System ACE CF Formatting Requirements” which basically says that if you format the flash under XP, it won’t work. To summarize it shortly,

  • Make it a FAT12 or FAT16, and not a FAT32 (the usual choice)
  • More than one sector per cluster
  • Only one reserved sector (XP may very well allocate more)
  • Maximum 2GB capacity (note that when it says 2GB commercially, it’s usually slightly less, but can be more. Partitioning is recommended)

It’s recommended to rewrite the partition table, as it may arrive messy. With fdisk, this is a desired final format (give or take sizes):

Disk /dev/sdd: 2017 MB, 2017419264 bytes
64 heads, 63 sectors/track, 977 cylinders
Units = cylinders of 4032 * 512 = 2064384 bytes
Disk identifier: 0x00000000

 Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1         977     1969600+   6  FAT16

NOTE: My Flash Disk appeared as /dev/sdd, yours may appear as something else. Don’t forget to fix this when running these commands, or you may wipe your hard disk!

Note the file system ID 6 (FAT16).  The card originally arrived with type 4, which is “FAT16 < 32MB”. To format the Compact Flash correctly in Linux, go (change sdd1 with the correct device, or erase something you didn’t want to):

# mkdosfs -R 1 -F 16 /dev/sdd1

And then verify that you got one single reserved sector (it’s likely you got it wrong):

# hexdump -n 32 -C /dev/sdd1
00000000  eb 3c 90 6d 6b 64 6f 73  66 73 00 00 02 20 01 00  |.<.mkdosfs... ..|
00000010  02 00 02 00 00 f8 f5 00  3f 00 40 00 00 00 00 00  |........?.@.....

The 16-bit word at 0x0e is the reserved sector count, as detailed in Wikipedia. If it isn’t as shown above, SystemACE won’t boot. Unfortunately, recent version of mkdosfs has a new “feature” which silently rounds up the number of reserved sectors to align with clusters. So it gets wrong. The solution for this is to downgrade this simple utility, possibly by downloading it from here. Version 3.0.9 is too new, 2.11 is fine.

Minimalistic setting

If there’s no xilinx.sys file in the root directory, and there is a file with an .ace extension, System ACE will boot from that file. Make sure there’s only one file with the .ace extension in flash’ the root directory. This setting doesn’t take advantage of the possibility to configure which image to boot from at powerup, but it’s easy to start off with.

Configurable setting

We shall now look on a setting which has only one .ace image to boot from, but is easily expanded to several images, chosen by the levels of three pins of the System ACE chip at powerup.

In the root directory, there should be a xilinx.sys file, saying something like this:

# Any comment goes here
dir = trydir;
cfgaddr0 = cfg0;
cfgaddr1 = cfg0;
cfgaddr2 = cfg0;
cfgaddr3 = cfg0;
cfgaddr4 = cfg0;
cfgaddr5 = cfg0;
cfgaddr6 = cfg0;
cfgaddr7 = cfg0;

The eight different cfgaddr lines tell the (Xilinx) System ACE chip which directory to go to, depending on the state of the three CFGADDR pins of the chip. So different profiles can be chosen from with DIP switches and such. In the case above, all eight configuration point at the same directory, cfg0.

The first line, declares the main working directory, which is trydir.

So in the case above, the root directory must have a directory called trydir, and within that directory, there must be a directory called cfg0.

And in cfg0, there must be a single file with .ace suffix, which is the ACE file to be loaded into the FPGA. Or more precisely, the ACE file is a translation of an SVF file, which is a sequence of JTAG instructions.

In order to allow configuration at powerup, create other directories (cfg1, cfg2 etc) and assign them to the desired cfgaddrN in the xilinx.sys file.

Generating the ACE file

Everything said here is related to the software arriving with ISE 13.2. It looks like there have been some significant changes from past versions.

In the Xilinx Platform Studio (EDK), pick Hardware > Generate bitstream on the processor configured. Basically, this generates netlists, builds them, and run the map, place and route and bitgen which creates a file such as system.bit.

Export the hardware format to SDK (Project > Export hardware design to SDK…), and then develop with SDK based upon that hardware. The bundle includes a hardware description as an XML file as well as the bitfile.

Once the project is built, it generates an .elf file, usually in the Debug subfolder. Its name and path is easily found in the Executable tab at the bottom of the SDK. Back in the EDK, pick Project > Select ELF file… and choose the relevant executable (for implementation). Then pick Device Configuration > Update Bitstream. That creates download.bit. This step is mandatory every time the ELF is changed, even though things will most likely work even without updating download.bit every time, since the relevant parts stay the same.

Create a directory to gather the relevant files, and copy the following into it:

  • The Tcl script generating ACE file: ISE_DS/EDK/data/xmd/genace.tcl (relative to the path where Xilinx ISE is installed)
  • The bitstream (download.bit) file
  • The ELF file

Open a command shell (Project > Launch Xilinx Shell if you like), change to this directory and go:

xmd -tcl genace.tcl -hw download.bit -elf myelf.elf -ace myace.ace -board sp605 -target mdm

which generates a lot of junk files (.svf most notably, which contain JTAG commands in a portable format), and eventually the myace.ace is created (any file name is OK, of course).

In the example above, I assumed that the target is the SP605 board. Looking at the genace.tcl script reveals easily which boards are supported. If it isn’t, it’s not such a big deal. The only reason the board matters is because the System ACE needs to know which device in the JTAG chain to talk with plus some programming parameters. The -board flags to this scrips allows setting the options in a “genace option file” (whatever that means). I would hack the script, though. It looks easier. See here for more information.

A test run

At times, the SP605 board’s green LED went on, but nothing happened. Pressing SYS_ACE_RESET is pressed (the middle button out of three close to the Compact Flash jack) caused a reload, which was OK. Probably some kind of race condition during powerup.

References

The walkthrough above is based upon this somewhat outdated guide. The BIST sources (rdf0032.zip) are indeed recommended for download, because of other issues of interest:

  • The ready_for_download subdirectory, which shows another example of a Compact Flash layout
  • The bootloader/src subdirectory, which has sources for loading executables from the Flash’ filesystem in SREC format (using sysace_fopen and the like).
  • The genace_all.sh file in the ready_for_download subdirectory, showing how to create SREC files from ELFs with mb-objcopy.

 

Random Microblaze notes to self


A mix of issues not deserving a post of their own.

COM port issues (with Windows XP)

The SDK has its own terminal, which can be set to run with a serial port. It works fine.

As for Hyperterminal, by all means configure a connection with a specified Hyperterminal configuration file. Just setting the properties of the current connection holds the terminal in disconnected mode until some key is pressed on the keyboard, ignoring incoming data. This can make it look as if nothing is sent from the other end.

More important, when the card is turned on, the COM port will not appear if Hyperterminal is open. So Hyperterminal has to be closed and reopened every time the card is powered on.

The setting is basically 9600 baud, 8 bits, 1 stop bit, no parity and no flow control. Despite some official guides, it looks like it’s not necessary to go to the Device Manager, right-click CP210x USB to UART Bridge Controller and set up Properties of the interface on that level. Note that they revert to default every time the USB interfaces disappear and reappear. At least with Hyperterminal, there have been no problems with having wrong values in the Bridge’s own settings.

A simple led toggling application

XGpio GpioOutput;

#define LED_CHANNEL 1

int main()
{
 volatile int Delay;
 int Status;
 int count = 0;

 Xil_ICacheEnable();
 Xil_DCacheEnable();

 print("---Entering main---\n\r");

 Status = XGpio_Initialize(&GpioOutput, XPAR_LEDS_4BITS_DEVICE_ID);
 XGpio_SetDataDirection(&GpioOutput, LED_CHANNEL, 0x0);

 while (1) {
 count++;

 XGpio_DiscreteWrite(&GpioOutput, LED_CHANNEL, (count & 0xf));

 for (Delay = 0; Delay < 1000000; Delay++);
 }

 // Never reached

 Xil_DCacheDisable();
 Xil_ICacheDisable();

 return 0;
}

Making the processor with command line

Example set for the generation of a Microblaze processor:

platgen -p xc6slx45tfgg484-3 -lang verilog    -msg __xps/ise/xmsgprops.lst system.mhs

If the processor isn’t going to be at top level, go:

platgen -p xc6slx45tfgg484-3 -lang verilog   -toplevel no -ti system_i -msg __xps/ise/xmsgprops.lst system.mhs

or ngdbuild will complain about double IO buffers.

That creates an hdl directory with a toplevel system module, a system_stub.v for instantiation, and several other HDL files. Configuration files for synthesis are written into the “synthesis” directory. The actual cores are in NGC format. Almost all core HDL files are wrappers (in VHDL).

To synthesize, change directory to “synthesis”

cd synthesis

and run the main synthesis script

synthesis.cmd

That’s a quick synthesis, because it’s all wrappers. The script ends with an exit 0, possibly making the command window close in the end.

Anyhow, a system.ncd file (netlist) was just created in the implementation directory.

Implementation with:

xflow -wd implementation -p xc6slx45tfgg484-3 -implement xflow.opt system.ngc

After PAR is OK (and a Perl script verifies that). But hey, the xflow.opt is generated by EDK, so this hardly helps. But this looks like a common implementation.

Notes for using system.ngc directly

That is, creating a black box within a regular project for the processor. This can also be done by embedding the processor into an ISE project, but sometimes ISE needs to be avoided.

  • Create the netlist files manually with platgen, with the non-toplevel option mentioned above. Or alternatively, include a system.xmp in a plain ISE project, and allow the NGC files to be generated from there.
  • Copy all NGC and NCF files in the “implementation” directory (possibly excluding system.ngc) to somewhere ngdbuild looks for binaries (as specified with the -sd flag). Don’t copy NGC files from its subdirectories.
  • Copy the system.v file from the “hdl” directory. This has several black modules for all .ngc files except for system.ngc
  • For non-Linux use, copy edkBmmFile.bmm from the main implementation directory to somewhere, and use -bm flag on ngdbuild to point at this file. This helps the data2mem block RAM initialization utility change the right places in the bitstream. This is necessary on standalone applications, for which the start address is zero. Linux systems start directly from external memory.
  • Add -sd flag in the .xst file used for parameters by the XST synthesizer, so it points at where the Microblaze’s NGC files can be found. This will make XST read the cores so it reads cores at the beginning of Advanced HDL Synthesis. It’s recommended to verify that this indeed happens. This is important, because some of the cores include the I/O buffers. When the cores are read, XST prevents itself from putting its own I/O buffers where they are already instantiated by the cores. Failing to read these cores will result in Ngdbuild complaining about I/O buffers being connected in series: One generated by XST and one by the core.
  • Implementing a bitstream file directly from an system.ngc may fail if too many I/Os are connected. A large number can make sense when they go to logic, but not to actual pins. If the purpose of this bitstream generation is to export it to the SDK for the sake of setting up a BSP (or generating a Device Tree), the solution is to remove these external ports, implement, and then return these ports. This is easiest done by editing the MHS file directly. It also seems like running Project Navigator’s “Export Hardware Design To SDK without Bitstream” process, which is available for XMP sources in the design, will work without removing ports.

References

  • Main start-off source: xilinx.wikidot.com
  • Using the genace (TCL) script
  • Linux 2.6 for Microblaze main page
  • Linux on Xilinx devices (PPC) — useful, also has the command line for formatting the Compact Flash
  • A bit about setting up the device tree: In the Linux source, Documentation/devicetree/bindings/xilinx.txt
  • In the Linux source, arch/microblaze/boot/dts/system.dts — A sample DTS file (not the one to use!)

PCIe: Is your card silently struggling with TLP retransmits?

Introduction

The PCI Express standard requires an error detection and retransmit mechanism, which ensures that the TLP packets indeed arrive correctly. The need for reliable communication on a system bus is obvious, but this mechanism also sweeps problems under the carpet: If data packets arrive faulty or are lost in the lower layers, nobody will practically notice this. While error reporting mechanisms exist in the hardware level, there is no mechanism to inform the end user that something isn’t working so well.

Update, 19.10.15: The Linux kernel nowadays has a mechanism for turning AER messages into kernel messages. In fact, they can easily flood the log, as discussed in this post of mine.

Errors in the low-level packets are not only a performance issue (retransmissions are a waste of bandwidth). With properly designed hardware, there is no reason for their appearance at all, so their very existence indicates that something might be close to stop working.

When developing hardware or using PCIe extension cables, this issue is even more important. A setting which hasn’t been verified extensively may appear to work, but in fact it’s just barely getting the data through.

The methodology

According to the PCIe spec, correctable (as well as uncorrectable) errors are noted in PCI Express Capability structure by setting bits matching the type of error. Using command-line application in Linux, we’ll detect the status of a specific device.

By checking the status register of our specific device, it’s possible to tell if it has detected (and fixed) something wrong in the TLP packets it has received. To detect corrected errors in TLPs going in the other direction, it’s necessary to locate the device’s link partner (a switch, bridge or the root complex). Even then, it will be difficult to say something definite: If the link partner reports an error, there may not be a way to tell which link (and hence device) caused it.

In this example, we’ll check a Xillybus peripheral (custom hardware), because we can control the amount of data flowing from and to it. For example, in order to send 100 MB of zeros in a loop, just go:

$ dd if=/dev/zero of=/dev/xillybus_write_32 bs=1k count=100k &
$ cat /dev/xillybus_read_32 > /dev/null

The Device Status Register

This register is part of the PCI Express Capability structure, at offset 0x0a. This register’s 4 least significant bits can supply information about the device’s health:

  • Bit 0 — Correctable Error Detected. This bit is set if e.g. a TLP packet doesn’t pass the CRC check. This error is correctable with a retransmit, and hence sets this bit.
  • Bit 1 — Non-Fatal Error Detected. A condition which wasn’t expected, but could be recovered from. This may indicate some incompatibility between the link partners, or an physical layer error, which caused a recoverable mishap in the protocol.
  • Bit 2 — Fatal Error Detected. This means that the device should be considered unreliable. Unrecoverable packet loss is one of the reasons for setting this bit.
  • Bit 3 — Unsupported Request Detected. When the device receives a request packet which it doesn’t support, this bit goes high. It may be harmless, in particular if the hosting hardware is significantly newer than the device.

(See section 6.2 for the classification of errors)

Checking status

This requires a fairly recent version of setpci (3.1.7 is enough). Earlier version may not recognize extended capability registers by their name.

As mentioned earlier, we’ll query a Xillybus peripheral. This allows running a script loop of sending a known amount of data, and then check if something went wrong.

To read the Device Status Register, become root and go:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Despite the command’s name, setpci, it actually reads a word (the “.w” suffix) at offset 0xa on the PCI Express Capability (CAP_EXP) structure. The device is selected by its Vendor/Product IDs, which are 0x10ee and 0xebeb respectively. This works well when there’s a single device with that pair.

Otherwise, it can be singled out by its bus position. For example, check one of the switches:

# lspci
(... some devices ...)
00:1b.0 Audio device: Intel Corporation Ibex Peak High Definition Audio (rev 05)
00:1c.0 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 2 (rev 05)
00:1c.3 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 4 (rev 05)
00:1d.0 USB Controller: Intel Corporation Ibex Peak USB Universal Host Controller (rev 05)
(... more devices ...)
[root@ocho eli]# setpci -s 00:1c.0 CAP_EXP+0xa.w
0010

In both cases the return value was zeros on bits 3-0, indicating that no errors whatsoever were detected. But suppose we got something like this (which is a result of playing nasty games with the PCIe connector):

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
000a

Bits 1 and 3 are set here, indicating a non-fatal error has been detected as well as an unsupported request. Surprisingly enough, playing with the connector didn’t cause a correctable error.

When writing to this register, any bit which is ’1′ in the written word is cleared in the status register. So to clear all four error bits, write the word 0x000f:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w=0x000f
# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Alternatively, the output of lspci -vv can be used to spot an AER condition quickly. For example, a bridge not being happy with some packets sent its way:

# lspci -vv

[ ... ]

00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07) (prog-if 00 [Normal decode])
[ ... ]
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot+

[ ... ]

Identifiying what went wrong

AER-capable endpoints are very likely to have related capability registers. These can be polled, in order to figure out the nature of the errors. For example, to periodically poll and reset the Correctable Status Register, this little bash script can be used (note that the bus positions of the devices it polls are hardcoded, and are marked green below):

#!/bin/bash
clear

while [ 1 ] ; do
 echo -en \\033[H

 for DEVICE in 00:1c.6 02:00.0 04:00.0 05:00.0 ; do
 echo $DEVICE: `setpci -s $DEVICE ECAP_AER+10.l`
 setpci -s $DEVICE ECAP_AER+10.l=31c1
 done

 usleep 100000
done

Some general notes

  • setpci writes directly to the PCIe peripheral’s configuration space. Typos may be as harmful as with any conduct as root. Note that almost all peripherals, including disk controllers are linked to the PCIe bus somehow.
  • The truth is that all these 0x prefixes are redundant. lspci assumed hex values anyhow.
  • When lspci answers “Capability 0010 not found” it doesn’t necessarily mean that the PCI Express capability structure doesn’t exist on some device. It can also mean that no device was matched, or that you don’t have permissions for the relevant operation.

Embedded PC talking with an FPGA: Make it simple

Why embedded PC

Embedded PC computers are commonly used instead of simple microcontrollers when more than a basic interface with the outer world is needed, e.g.

  • Disk storage (ATA, SATA or ATAPI)
  • USB connection with disk-on-key storage or other peripherals
  • Ethernet connection (TCP/IP in particular)
  • VGA/DVI for display of GUI, possibly based upon a high-level standard widget library

With PC/104 board computers and their derivatives available in the market at modest prices, the adopted solution is often to design the custom peripherals using these interfaces. The non-trivial task is many times not to design the custom logic, but rather to interface with the PC through the dedicated pins. Writing the drivers for the PC can also turn out frustrating. Things don’t become easier when high data bandwidths are required, and hence DMA becomes a must.

Using standard peripherals

Dedicated signal processing or data acquisition cards are sometimes used with traditional PCI/PCIe interface when data capture is an integral part of the project. These dedicated cards are not only expensive, but their configuration and adaptation to the dedicated application can sometimes turn out to be as demanding as designing the components from scratch.

A custom, yet painless solution

An elegant shortcut is to design a simple daughterboard which is based upon a Spartan-6 FPGA with a built-in PCIe component. With an embedded computer supporting the PC/104-Express form factor, the communication with the board is immediate, and requires just 7 wires of connection. True, designing the PCIe interfaces on both sides in by no means a simple task, but Xillybus has already taken care of that. The user application talks with a FIFO on the FPGA, and through a device file on a Linux computer. All the low-level communication is transparent, leaving the application designer with an intuitive channel of data running at up to 200 MBytes/s.

This works with any processor  supporting PCIe, of course, but the embedded SoC processors with PCIe supported natively is a new market, and well, fullblown PCs are not really embedded. This way or another, there is no reason to struggle with getting data transported between a PC and a custom peripheral anymore.

“FPGA-printf”: When Chipscope isn’t fast or deep enough

The concept of having a debugging agent within the FPGA design to probe the logic is by no means new. Xilinx’ Chipscope has presented a neat solution as an internal logic analyzer for several years.

Since Chipscope uses the JTAG channel for its data transfer, it’s sampling depth is effectively limited by the block RAMs allocated for its use. Real-time continuous sampling is out of the question.

When printf-style debugging is more in place, Xillybus can come handy. Based upon the FPGA’s PCI Express hardware core, it allows for up to 200 MByte/s sustained data transfer. The Xillybus IP core interfaces with user application logic through a standard FIFO: Whatever is written to the FIFO appears at the host side as a data stream represented as a file. In this sense, the debugging method resembles printf-debugging: The designer chooses what data and when to write it to the FIFO. It can be a notification of certain events or a bulk stream of metadata. Either way, Xillybus offers an immediate way to transfer debug information to the host.

On the host side, a simple shell script command (such as “cat”) can store the received data on the disk, or it may be analyzed by a plain application or script for detecting events of interest.

Data can be sent in the other direction as well, typically up to 100 MByte/s. This feature can be used for behavioral testing of logic in hardware. When exhaustive verification (or validation)  of core logic is desired, massive data needs to be sent to the logic for input, and its output is then sent back for comparison with the expected results. The overall data transfer reaches gigabytes easily, so not only is a high-bandwidth channel necessary, but a convenient interface for simple user-space applications: Many times the reasonable solution is to generate the test data on the fly as well as comparing the logic’s results against the software application.

Xillybus offers both: With the logic under test connected to FIFOs in the FPGA, the application software on host merely opens one file for input and another one for output. The entire testing process then consists of writing the test data to the output file, read returning data from the input file, and compare the latter with the expected input.

And if this sounds too good to be true, a complete evaluation kit is available for download with no strings attached. The evaluation hardware can be purchased directly from Xilinx at around $500.

Data acquisition with FPGA: Not a headache necessarily.

The headache…

Data acquisition is one of those tasks, which always seem so easy until they get real. In the block diagram, capturing data is just an arrow to the computer. In reality, getting the data from the FPGA to something that feels like the real world can turn out to be the hardest part in the entire project.

And the striking question is: If getting data on and off an FPGA is such a commonly needed task, how come there isn’t a generic package, which does the job for you? How come getting the data to a computer just has to involve learning how a PCI or PCI express bus works, DMA, bus master issues, TLP, RCB, you name it. Or if the channel is through USB, the engineer needs to become friends with USB endpoints, understand bulk transfer, and how to configure a device to hotplug nicely.

… and its painkiller

And here comes the answer. This is exactly what Xillybus is: A generic solution for transporting data from and to an FPGA with a Windows or Linux computer at the other end.

In order to make things as simple as possible, the interfaces are the most easily understood: The FPGA designer works with plain FIFOs and possibly dual-port RAMs, and the programmer with plain userspace (device) files. No drivers to write, no complicated API to follow. As a matter or fact, the host software can be written even as scripts.

And the board design consists of getting 7 wires correctly. That’s all.

Several Xilinx FPGAs with T suffix are supported as well as Altera devices. Their hardware PCIe core is an invitation to connect with a computer, even if it’s just for testing logic components on hardware. Or getting debug information. Or getting a large chunk of samples from analog-to-digital converters (ADC) to check against simulation.

The barrier is gone. There is no need to consider the data transport a project of its own. A FIFO on one side, a file on the other, and the rest is a black box.

Drupal 7 Views: Making a block of links to related pages

Views and SQL

Using Drupal views basically is trying to figure out how to trick the machine into making the SQL query I would have written in five minutes. As a matter of fact, I don’t think I had a chance of getting this right, hadn’t I known SQL pretty well.

Or, as one of the help pages was kind enough to clarify (the left side is merely examples):

SELECT n.title, u.name <–> fields
FROM {node} n base table <–> view type
INNER JOIN {users} u ON n.uid = u.uid <–> relationship
WHERE n.status = 1 <–> filter
AND u.uid = arg(1) <–> argument
ORDER BY n.changed DESC <–> sort

and I’ll add:

WHERE n.nid = p.nid <–> Contextual filter
GROUP BY <–> Aggregation

A word about relationships: The relationship entries in the Views page define the “ON” part in the JOIN. To access the values of the fields in the joined tables (e.g. as a displayed field), just select the desired value in the first menu (e.g. “add field”), and then pick the administrative name in the “relationship” drop-down menu. For some reason, I had the initial expectation that the “ON” value would appear in itself in the field list, but it doesn’t, since it’s available just as any value, only picked from the respective table.

Recommended modules

The Development module (“devel”) allows a dump of all SQL queries made while producing the currently shown page. It’s recommended just to get an idea of what a database hog Drupal is. The second, very recommended module is PHP views, despite its development release status. The latter allows injecting small pieces of PHP exactly where they are needed. In particular, there’s the PHP:Global pseudofield for both display and sorting, so rather than banging the head on how to twist things around, just write a small and elegant snippet in PHP.

Injecting a view with PHP

Possibly using the PHP text format, just type in any node’s text input (useful for making a view within a book node):

<?php print views_embed_view('doc_outline', $display_id = 'default') ?>

where ‘doc_outline’ is the machine name for the view (as it appears in URLs related to the view) and ‘default’ could be exchanged with ‘page’ or ‘block’, but why bother if the view has a single format? See the API page.

There has also been suggestions about embedding a block in PHP, but I never tried that.

Strategy

The purpose: Making a “relevant pages” block, based upon common tags of the pages. Don’t tell me Drupal has something doing that, because I know. It’s just that I’ve learned that these easy solutions always end up with more work. Besides, I have a special thing I want.

The setting is as follows: I have a special taxonomy named “relevance terms”. Pages with common terms in this taxonomy are considered to have similar content. I didn’t use the original tags, because I may want to use them for something representative.

Also, each content page has an extra field “LinkText”, which contains the text to appear on links to the page. For example, the download page’s title is simply “Download” but the link to this page should say something more imperative.

The immediate (and wrong) way to go is to make a view of content titles. Without any filtering, you get all content pages. So use the context filter to get only the current page, and use relations to list all taxonomy terms. Now another relation filter to expand this to pages for each taxonomy term? But the context filter kills any other page than the currently displayed. It’s a dead end.

So the right way is to make a view of the page’s taxonomy terms. For each term, list the pages using it, and then squash duplicates. And then make the output nice. Easier said than done.

First steps

Add a new View, showing Taxonomy Terms of type Relevance Taxonomy. Don’t create a page, but a block. Display as an HTML list of fields. Save & Exit, and insert the block somewhere in the page, so it can be tested. Previews won’t work here, because it runs  on Taxonomy nodes, not pages. Set title and such.

The number of items should be limited, and I don’t use a pager.

Contextual Filter & Relationship

This is done first, so we don’t mess with aggregation, which is going to be set up pretty soon. Under Advanced, add a contextual filter on “Taxonomy Term ID”. The following window will complain that there’s no source for contextual filter, so a default must be supplied. This is because we’re running in block context. The source is taken from the page.

We want the node ID to be compared with the current page, so pick “provide default value” and “Taxonomy term ID from URL”. Uncheck “Load default filter from term page” but check “Load default page from node page” and also “Limit terms by vocabulary” and pick the Relevance Taxonomy as the chosen vocabulary. Under “More” check “Allow multiple values” This is necessary, since we don’t want just the first term to be used. I’m not sure if this item appears without setting up relationships. So if it’s missing, set up a relationship and come back to add this.

That’s it. Save and check up. We should now have a simple list of relevance terms in the view.

Next we add a relationship with the pages having the terms: Check “Taxonomy term: Content using relevance” (note that “relevance” is the name of the vocabulary here), check “Require this relationship” on the next screen (I suppose this makes an INNER JOIN as opposed to a LEFT JOIN), and save this.

Checking where we stand, we have each taxonomy term appearing a number of times. This is the natural behaviour of an inner join: Each combinations of terms and pages using them creates a line. Since the pages aren’t listed, we just see each term repeated.

And since we’re at it, let’s eliminate the shown page’s entry in the related pages’ list. We need negative contextual filtering here: So add a new contextual filter, check “Content: Nid” (it wasn’t there until we added the relationship). Provide default value as “Content ID of URL”, and under “More” check “Exclude”. So if the current page matches the shown page, it’s not shown.

Save and verify than one or a few items have disappeared from the list.

Aggregation

Aggregation is “GROUP BY” in SQL, meaning that several rows with the same value in one of the fields are turned into a single row. Instead of the field’s value we have the count of rows grouped together or the maximum, minimum, average value or whatever the database allows. Aggregation is needed to eliminate the duplicate rows created by the relationship (that is, the inner join).

True, there is a “Distinct” checkbox under “Query settings” but it’s ineffective, since each of these duplicate rows are indeed distinct when the database answers the query. What makes them duplicate is the fact that the taxonomy term is dropped in the display. “Distinct” just adds the DISTINCT word to the SQL query.

So at this point, change “Use aggregation” to Yes. Things will get slightly messier from this point on.

Adding fields

Rule #1 for fields: The order they appear matters. In particular when using rewrite rules: Inserting data from other fields in substitution patterns works only for fields above the displayed one (those declared before).

Remember that the goal is to show the Linktext field as a link to the page, and not just the title.

So the first field to add is the Node’s path (aliased link). We will use it later on. In the list, check “Content: Path”. Under “Aggregation type” pick “Group results together” which is what we pick all the time if not for any special reason. This choice wouldn’t appear without enabling aggregation, of course. On the third and last window, check “Exclude from display” unless you want to see it for debugging.

The second field to add is the link text. In the list, check “Content: LinkText”. Under “Aggregation type” pick “Group results together” and pick the “Entity ID” as group column, and no additional ones.

On the third page uncheck “Create a label” (no “LinkText:” shown to user). Under “Rewrite results” check “Output this field as a link”. Write [path] in the Link path text box. This string can be found in the Replacement Pattern list just below. The path was there because it was defined before the current field.

Check “Use absolute path” or the links start with a double-slash and render useless.

At this point I’ll mention that it’s possible to insert arbitrary HTML with replacement patterns. So it’s really useful.

Squashing duplicates

At this point it’s pretty evident that we have double entries, and these taxonomy terms should be removed.

So it’s time to edit the first field: The “Taxonomy Term: Name” and check “Exclude from Display”. But even more important, enter “Aggregation settings” and change “Aggregation type” to “Count”. The magic is that instead of a row for each taxonomy term, we get a single row with the number of them, ending up with a single row for each link.

Filter out inaccessible items

As is, users will see links to items the user can’t access. So let’s add a simple filter (“Filter Criteria”). Pick “Content: Published or admin” and “Group results together” in the two following menus. And then just apply on the next menu. Done.

Note that unpublished items will still appear for admins, since the criterion is access. Pick “Content: Published” and choose “Yes” to remove unpublished items for anyone.

Sorting

I have to admit that I failed on this one at first. My intention was to sort the output depending on the number of mutual tags. That would be easy in SQL, since the COUNT() can be applied a label with the AS keyword. I found nothing to assist this in the Views menus.

As it turned out, values of COUNT() are available, but not through the menu interface. With the Views PHP module, it’s a piece of cake.

Say that there’s already a field saying “COUNT(Taxonomy term: Name)”, then add a sort criteria of type global PHP and set the code to

return ($row2->name - $row1->name);

Since $row1->name is the count of the rows with the same name field, this simple chunk of code does the work.

Using the view in a non-Drupal page

Sometimes the whole framework is just too heavy for a site, and all that’s needed is just a view in a plain PHP file. So if a view block with the name “inject” exists in the system, the following code snippet displays it (and a lot of CSS mumbo-jumbo).

<?php

chdir('trydrupal');

define('DRUPAL_ROOT', getcwd());

require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

print views_embed_view('inject', $display_id = 'default')
?>

Note the chdir(). It allows Drupal to be installed in a completely separate directory (“trydrupal” in this case).

 

ASPM makes Spartan-6′s PCIe core miss TLP packets

The fatal error

Let’s break the bad news: Spartan-6′s PCIe core may drop TLP packets sporadically when ASPM (Active State Power Management) is enabled. That means that any TLP given to the core for transmission can silently disappear, as if it was never submitted. I also suspect that the problem exists in the opposite direction.

Hardware involved: Spartan xc6slx45t-fgg484-3-es (evaluation sample version) on an SP605 evaluation board. That mounted on a Gigabyte G31M-ES2L motherboard, having the Intel G33 chipset and a E5700 3.0 GHz processor.

The fairly good news is that he core’s cfg_dstatus[2] ( = fatal error detected) will go high as a result of dropping TLPs. Or at least so it did in my case. So it looks like monitoring this signal, and do something loud if it goes to ’1′ is enough to at least know if the core does the job or not.

Let me spell it out: If you’re designing with Xilinx’ PCIe core, you should verify that cfg_dstatus[2] stays ’0′, and if it goes high you should treat the PCIe endpoint as completely unreliable.

How to know if ASPM is enabled

On a Linux box, become root and go lspci -vv. The output will include all devices, but the relevant part will be something like

01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0, Cache Line Size: 4 bytes
 Interrupt: pin ? routed to IRQ 44
 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Address: 00000000fee0300c  Data: 4181
 Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s unlimited, L1 unlimited
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM L0s Enabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

There we have it: I set up the device with an unlimited L0s latency, hence the BIOS configured the device to have an unlimited L0s latency, and this ended up with ASPM enabled.

What we really want is the output to end with something like:

Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

The elegant solution

The really good news is that there is a simple solution: Disable ASPM. In other words, program the link partners to never reach the L0s nor L1 power saving modes. In a Linux kernel driver, it’s pretty simple:

#include <linux/pci-aspm.h>

pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 |
 PCIE_LINK_STATE_CLKPM)

This is something I would do without thinking twice for any device based upon Xilinx’ PCIe core. Actually, I would do this for any device for which power saving is irrelevant.

The maybe-working solution

In theory, the kernel can run in different ASPM policies, one of which is “powersave”. If it runs in “performance” all transactions to L0s are disabled, and all should be well. In practice, it looks like the kernel community is pushing towards allowing L0s even under the performance policy.

The shaky workaround

When some software wants to allow L0s, it must check if the switching latency from L0s to L0 (that is, from napping to awake) is one the device can take. The device announces its maximal allowed latency in the PCI Express Capability Structure. By setting the acceptable L0s latency limit to the shortest latency allowed (64 ns), one can hope that the hardware will not be able to meet this requirement, and hence give up on using ASPM. This trick happened to work on my own motherboard, but another motherboard may be able to meet the 64 ns requirement, and enable ASPM. So this isn’t really a solution.

Anyhow, the success of this method will yield an lspci -vv output with something like

Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s <64ns, L1 <1us
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

How I know it isn’t my own bug

The transitions from L0 to L0s and back throttle the data flow through the PCIe core, so maybe these on-and-offs exposed a bug in my own HDL code’s data flow? Why do I blame Xilinx?

The answer was found in the dbg_* debug lines supplied from within the PCIe core. These lines go high whenever something bad happens in the core’s lower layers. Running without ASPM these lines stayed zero. When ASPM was enabled, and in conjunction with packet drops, the following lines were asserted:

  • dbg_reg_detected_fatal: Well, I knew this already. A fatal error was detected.
  • dbg_reg_detected_correctable: A correctable error was detected. Nice, but I really don’t care.
  • dbg_rply_timeout_status: The replay timer fired off: A TLP packet was sent, but didn’t receive an acknowledgement. That indicates that things aren’t perfect, but if the packet was retransmitted, this doesn’t indicate a user-visible issue.
  • dbg_dl_protocol_status: Ayeee. This means that an out of range ACK or NAK was received. In other words, the link partners are not on the same page regarding which packets are waiting for acknowledgement.

The last bullet is our smoking gun: It indicates that the PCIe link protocol has been violated. There is nothing the application HDL code can do to make this happen. The two last bullets indicate some problem in the domain of a TLP being lost, retransmitted, and some problem with the acknowledge. Not a sign saying “a packet was lost”, but as close as one gets to that, I suppose.

Update: My attention to some interesting Xilinx Answer records was drawn in a comment below. Answer record #33871 mentions LL_REPLAY_TIMEOUT as the a parameter to fix, in order to solve a fatal error condition, but says nothing about packet dropping. It looks like this issue has been fixed in the official PCIe wrapper lately. This leaves me wondering whether people didn’t notice they lost packets, or if Xilinx decided not to admit that too loud.