ASPM makes Spartan-6′s PCIe core miss TLP packets
The fatal error
Let’s break the bad news: Spartan-6′s PCIe core may drop TLP packets sporadically when ASPM (Active State Power Management) is enabled. That means that any TLP given to the core for transmission can silently disappear, as if it was never submitted. I also suspect that the problem exists in the opposite direction.
Hardware involved: Spartan xc6slx45t-fgg484-3-es (evaluation sample version) on an SP605 evaluation board. That mounted on a Gigabyte G31M-ES2L motherboard, having the Intel G33 chipset and a E5700 3.0 GHz processor.
The fairly good news is that he core’s cfg_dstatus[2] ( = fatal error detected) will go high as a result of dropping TLPs. Or at least so it did in my case. So it looks like monitoring this signal, and do something loud if it goes to ’1′ is enough to at least know if the core does the job or not.
Let me spell it out: If you’re designing with Xilinx’ PCIe core, you should verify that cfg_dstatus[2] stays ’0′, and if it goes high you should treat the PCIe endpoint as completely unreliable.
How to know if ASPM is enabled
On a Linux box, become root and go lspci -vv. The output will include all devices, but the relevant part will be something like
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core Subsystem: Xilinx Corporation Generic FPGA core Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 4 bytes Interrupt: pin ? routed to IRQ 44 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Address: 00000000fee0300c Data: 4181 Capabilities: [58] Express Endpoint IRQ 0 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag- Device: Latency L0s unlimited, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s unlimited, L1 unlimited Link: ASPM L0s Enabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x1
There we have it: I set up the device with an unlimited L0s latency, hence the BIOS configured the device to have an unlimited L0s latency, and this ended up with ASPM enabled.
What we really want is the output to end with something like:
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1
The elegant solution
The really good news is that there is a simple solution: Disable ASPM. In other words, program the link partners to never reach the L0s nor L1 power saving modes. In a Linux kernel driver, it’s pretty simple:
#include <linux/pci-aspm.h> pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 | PCIE_LINK_STATE_CLKPM)
This is something I would do without thinking twice for any device based upon Xilinx’ PCIe core. Actually, I would do this for any device for which power saving is irrelevant.
The maybe-working solution
In theory, the kernel can run in different ASPM policies, one of which is “powersave”. If it runs in “performance” all transactions to L0s are disabled, and all should be well. In practice, it looks like the kernel community is pushing towards allowing L0s even under the performance policy.
The shaky workaround
When some software wants to allow L0s, it must check if the switching latency from L0s to L0 (that is, from napping to awake) is one the device can take. The device announces its maximal allowed latency in the PCI Express Capability Structure. By setting the acceptable L0s latency limit to the shortest latency allowed (64 ns), one can hope that the hardware will not be able to meet this requirement, and hence give up on using ASPM. This trick happened to work on my own motherboard, but another motherboard may be able to meet the 64 ns requirement, and enable ASPM. So this isn’t really a solution.
Anyhow, the success of this method will yield an lspci -vv output with something like
Capabilities: [58] Express Endpoint IRQ 0 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x1
How I know it isn’t my own bug
The transitions from L0 to L0s and back throttle the data flow through the PCIe core, so maybe these on-and-offs exposed a bug in my own HDL code’s data flow? Why do I blame Xilinx?
The answer was found in the dbg_* debug lines supplied from within the PCIe core. These lines go high whenever something bad happens in the core’s lower layers. Running without ASPM these lines stayed zero. When ASPM was enabled, and in conjunction with packet drops, the following lines were asserted:
- dbg_reg_detected_fatal: Well, I knew this already. A fatal error was detected.
- dbg_reg_detected_correctable: A correctable error was detected. Nice, but I really don’t care.
- dbg_rply_timeout_status: The replay timer fired off: A TLP packet was sent, but didn’t receive an acknowledgement. That indicates that things aren’t perfect, but if the packet was retransmitted, this doesn’t indicate a user-visible issue.
- dbg_dl_protocol_status: Ayeee. This means that an out of range ACK or NAK was received. In other words, the link partners are not on the same page regarding which packets are waiting for acknowledgement.
The last bullet is our smoking gun: It indicates that the PCIe link protocol has been violated. There is nothing the application HDL code can do to make this happen. The two last bullets indicate some problem in the domain of a TLP being lost, retransmitted, and some problem with the acknowledge. Not a sign saying “a packet was lost”, but as close as one gets to that, I suppose.
Update: My attention to some interesting Xilinx Answer records was drawn in a comment below. Answer record #33871 mentions LL_REPLAY_TIMEOUT as the a parameter to fix, in order to solve a fatal error condition, but says nothing about packet dropping. It looks like this issue has been fixed in the official PCIe wrapper lately. This leaves me wondering whether people didn’t notice they lost packets, or if Xilinx decided not to admit that too loud.
Reader Comments
Eli,
Thanks for your posting; were experiencing problems with the S6 too and after reviewing your post it helped to focus in on an issue which may or may not be relevant to your issue.
The reason for posting here is that the errors we found were spookily similar to yours in that the Core was reporting EXACTLY the same correctable and fatal errors that you reported but the cause was different
Our board previously operated on lots of Motherboards with no issues but on some customer systems deploying large SuperMicro Xeon Motherboards a problem appeared when larger (256) payload sizes were used.
If you look through the Xilinx site you find that there is an issue with LL_REPLAY_TIMEOUT in the wrapper on their reference design which when changed according to AR39548 fixes the problem.
As I said this may or may not be related to your problem but as I said it results in exactly the same symptoms !!!!!
Note AR39548 does not mention this but I believe the LL_ACK_TIMEOUT needs to be changed too.
Regards
John
Eli,
Sorry quick follow-up – did not check the Xilinx website thoroughly enough
In AR33871 the LL_ACK_TIMEOUT is mentioned and it also related this to PM state so it could be an alternative way of fixing your problem
Thanks a lot for your comment. I’ve updated the post above.
This is an issue with the Virtex-5 Endpoint Block Plus for PCI Express core where turning on ASPM causes the core to cycle in and out of recovery periodically.
This is a known issue with the GTPs in the Virtex-5. The issue is thoroughly documented in UG341 in the “Known Issues” section. The two issues that may come up are titled: “Receive Path L0s Exit to L0″ and “Transceiver Exit from Rx.L0s”
The only workaround is to disable ASPM.
This issue has been resolved in 6-Series and 7-Series.