PCIe: Is your card silently struggling with TLP retransmits?

This post was written by eli on July 27, 2011
Posted Under: PCI express

Introduction

The PCI Express standard requires an error detection and retransmit mechanism, which ensures that the TLP packets indeed arrive correctly. The need for reliable communication on a system bus is obvious, but this mechanism also sweeps problems under the carpet: If data packets arrive faulty or are lost in the lower layers, nobody will practically notice this. While error reporting mechanisms exist in the hardware level, there is no mechanism to inform the end user that something isn’t working so well.

Update, 19.10.15: The Linux kernel nowadays has a mechanism for turning AER messages into kernel messages. In fact, they can easily flood the log, as discussed in this post of mine.

Errors in the low-level packets are not only a performance issue (retransmissions are a waste of bandwidth). With properly designed hardware, there is no reason for their appearance at all, so their very existence indicates that something might be close to stop working.

When developing hardware or using PCIe extension cables, this issue is even more important. A setting which hasn’t been verified extensively may appear to work, but in fact it’s just barely getting the data through.

The methodology

According to the PCIe spec, correctable (as well as uncorrectable) errors are noted in PCI Express Capability structure by setting bits matching the type of error. Using command-line application in Linux, we’ll detect the status of a specific device.

By checking the status register of our specific device, it’s possible to tell if it has detected (and fixed) something wrong in the TLP packets it has received. To detect corrected errors in TLPs going in the other direction, it’s necessary to locate the device’s link partner (a switch, bridge or the root complex). Even then, it will be difficult to say something definite: If the link partner reports an error, there may not be a way to tell which link (and hence device) caused it.

In this example, we’ll check a Xillybus peripheral (custom hardware), because we can control the amount of data flowing from and to it. For example, in order to send 100 MB of zeros in a loop, just go:

$ dd if=/dev/zero of=/dev/xillybus_write_32 bs=1k count=100k &
$ cat /dev/xillybus_read_32 > /dev/null

The Device Status Register

This register is part of the PCI Express Capability structure, at offset 0x0a. This register’s 4 least significant bits can supply information about the device’s health:

Bit 0 — Correctable Error Detected. This bit is set if e.g. a TLP packet doesn’t pass the CRC check. This error is correctable with a retransmit, and hence sets this bit.
Bit 1 — Non-Fatal Error Detected. A condition which wasn’t expected, but could be recovered from. This may indicate some incompatibility between the link partners, or an physical layer error, which caused a recoverable mishap in the protocol.
Bit 2 — Fatal Error Detected. This means that the device should be considered unreliable. Unrecoverable packet loss is one of the reasons for setting this bit.
Bit 3 — Unsupported Request Detected. When the device receives a request packet which it doesn’t support, this bit goes high. It may be harmless, in particular if the hosting hardware is significantly newer than the device.

(See section 6.2 for the classification of errors)

Checking status

This requires a fairly recent version of setpci (3.1.7 is enough). Earlier version may not recognize extended capability registers by their name.

As mentioned earlier, we’ll query a Xillybus peripheral. This allows running a script loop of sending a known amount of data, and then check if something went wrong.

To read the Device Status Register, become root and go:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Despite the command’s name, setpci, it actually reads a word (the “.w” suffix) at offset 0xa on the PCI Express Capability (CAP_EXP) structure. The device is selected by its Vendor/Product IDs, which are 0x10ee and 0xebeb respectively. This works well when there’s a single device with that pair.

Otherwise, it can be singled out by its bus position. For example, check one of the switches:

# lspci
(... some devices ...)
00:1b.0 Audio device: Intel Corporation Ibex Peak High Definition Audio (rev 05)
00:1c.0 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 2 (rev 05)
00:1c.3 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 4 (rev 05)
00:1d.0 USB Controller: Intel Corporation Ibex Peak USB Universal Host Controller (rev 05)
(... more devices ...)
[root@ocho eli]# setpci -s 00:1c.0 CAP_EXP+0xa.w
0010

In both cases the return value was zeros on bits 3-0, indicating that no errors whatsoever were detected. But suppose we got something like this (which is a result of playing nasty games with the PCIe connector):

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
000a

Bits 1 and 3 are set here, indicating a non-fatal error has been detected as well as an unsupported request. Surprisingly enough, playing with the connector didn’t cause a correctable error.

When writing to this register, any bit which is ’1′ in the written word is cleared in the status register. So to clear all four error bits, write the word 0x000f:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w=0x000f
# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Alternatively, the output of lspci -vv can be used to spot an AER condition quickly. For example, a bridge not being happy with some packets sent its way:

# lspci -vv

[ ... ]

00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07) (prog-if 00 [Normal decode])
[ ... ]
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot+

[ ... ]

Identifiying what went wrong

AER-capable endpoints are very likely to have related capability registers. These can be polled, in order to figure out the nature of the errors. For example, to periodically poll and reset the Correctable Status Register, this little bash script can be used (note that the bus positions of the devices it polls are hardcoded, and are marked green below):

#!/bin/bash
clear

while [ 1 ] ; do
 echo -en \\033[H

 for DEVICE in 00:1c.6 02:00.0 04:00.0 05:00.0 ; do
 echo $DEVICE: `setpci -s $DEVICE ECAP_AER+10.l`
 setpci -s $DEVICE ECAP_AER+10.l=31c1
 done

 usleep 100000
done

Some general notes

setpci writes directly to the PCIe peripheral’s configuration space. Typos may be as harmful as with any conduct as root. Note that almost all peripherals, including disk controllers are linked to the PCIe bus somehow.
The truth is that all these 0x prefixes are redundant. lspci assumed hex values anyhow.
When lspci answers “Capability 0010 not found” it doesn’t necessarily mean that the PCI Express capability structure doesn’t exist on some device. It can also mean that no device was matched, or that you don’t have permissions for the relevant operation.

Reader Comments

Nice!

I have started to look at PCIe monitoring of my systems and am wondering if you have thoughts on higher level monitoring decisions that can be made? Any publicly available discussions of best practices?

This seems like an area that is under-utilized and under-discussed.

Written By Baruch on August 1st, 2011 @ 05:44

The truth is I stopped to care as soon as I saw there are no errors whatsoever. I suppose you could either try to look for testing equipment (my guess: It won’t be cheap) or try to develop a sniffer with an FPGA (which shouldn’t be all that difficult).

Written By eli on August 1st, 2011 @ 10:59

Questions & Comments

Since the comment section of similar posts tends to turn into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required.

This comment section is closed.

Written By eli on April 25th, 2012 @ 08:47

Next Post: Random Microblaze notes to self

Previose Post: Embedded PC talking with an FPGA: Make it simple

my tech blog

Popular Posts

Latest Posts

Archives