PCI express maximal payload size: Finding it and its impact on bandwidth
Finding the maximal payload manually
The truth is, there is no need to do this manually. lspci does the work for us. But looking into the configuration table once and for all helps demystifying the issue. So here we go.
According to the PCIe spec (section 7.8), the max_payload_size the card can take is give in the PCIe Device Capabilities Register (Offset 0x04 in the PCI Express Capability structure), bits 2-0. Basically, take that three-bit field as a number, add 7 to it, and you have the log-2 of the number of bytes allowed.
Let me write this in C for clarity:
max_payload_size_capable = 1 << ( (DevCapReg & 0x07) + 7); // In bytes
The actual value used is set by host in the Device Control Register (Offset Ox08 in the PCI Express Capability structure). It’s the same drill, but with bits 7-5 instead. So in C it would be
max_payload_size_in_effect = 1 << ( ( (DevCtrlReg >> 5) & 0x07) + 7); // In bytes
OK, so how can we find these registers? How do we find the structure? Let’s start with dumping the hexadecimal representation of the 256-byte configuration space. Using lspci -xxx on a Linux machine we will get the dump for all devices, but we’ll look at one specific:
# lspci -xxx (...) 01:00.0 Class ff00: Xilinx Corporation Generic FPGA core 00: ee 10 34 12 07 04 10 00 00 00 00 ff 01 00 00 00 10: 04 f0 af fd 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 ee 10 34 12 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 00 00 00 40: 01 48 03 70 08 00 00 00 05 58 81 00 0c 30 e0 fe 50: 00 00 00 00 71 41 00 00 10 00 01 00 c2 8f 28 00 60: 10 28 00 00 11 f4 03 00 00 00 11 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The first important thing to know about lspci -xxx on a little-endian machine (x86 processors included) is that PCI and PCIe work in big endian. And that the data is shown as little-endian DWs (or 32-bit unsigned ints). So the way to look at the output is in groups of four bytes each, and take them for a little-endian unsigned int, whose bit map matches the spec.
For example, according to the spec, bits 15-0 of the word mapped at 00h is the Vendor ID, and bits 31-16 is the Device ID. So we take the first four bytes for a little-endian 32-bit integer, and get Ox123410ee. Bits 15-0 are indeed Ox10ee, the vendor ID Xilinx, and bits 31-16 are Ox1234 which is the Device ID I made up for a custom device. So far so good.
Now we need to find the PCI Express Capability structure. It’s one of the structures in a linked list (would you believe that?), and it’s identified by a Cap ID of Ox10.
The pointer to the list is at bits 7-0 of the configuration word at Ox34. In our little-endian representation above, it’s simply the byte at Ox34, which says Ox40. The capabilities hence start at Ox40.
From here on, we can travel along the list of capability structures. Each starts 32-bit aligned, with the header always having the Capability ID on bits 7-0 (appears as the first byte above), and a pointer to the next structure in bits 15-8 (the second byte).
So we start at offset Ox40, finding it’s of Cap ID Ox01, and that the byte at offset Ox41 tells us that the next entry is at offset Ox48. Moving on to offset Ox48 we find Cap ID Ox05 and the next entry at Ox58. The entry at Ox58 is with Cap ID Ox10 (!!!), and it’s the last one (pointer to next is zero).
So we found our structure at Ox58. The Device Capabilities Register is hence at Ox5c (offset Ox04) and reads Ox00288fc2. The Device Control Register is at Ox60 (offset Ox08), and reads Ox00002810.
So we learn from bits 2-0 of the Device Capabilities Register (having value 2) that the device supports a max_payload_size of 512. But bits 7-5 (having value 0) of the Device Control Register tell us that the effective maximal payload is only 128 bytes.
Getting the info with lspci
As I mentioned above, we didn’t really need to find the addresses by hand. lspci -v gives us, for the specific device:
# lspci -v (...) 01:00.0 Class ff00: Xilinx Corporation Generic FPGA core Subsystem: Xilinx Corporation Generic FPGA core Flags: bus master, fast devsel, latency 0, IRQ 42 Memory at fdaff000 (64-bit, non-prefetchable) [size=128] Capabilities: [40] Power Management version 3 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Capabilities: [58] Express Endpoint IRQ 0 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
So the address to the PCI Express capabilities structure is given to us, but not the internal details (maybe some newer version of lspci does). And by the way, the size=128 above has nothing to do with maximal payload: It’s the size of the memory space allocated to the device by BIOS (BAR address space, if we’re into it).
For the details, including the maximal payload, we use the lspci -vv option.
# lspci -vv (...) 01:00.0 Class ff00: Xilinx Corporation Generic FPGA core Subsystem: Xilinx Corporation Generic FPGA core Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 4 bytes Interrupt: pin ? routed to IRQ 42 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Address: 00000000fee0300c Data: 4171 Capabilities: [58] Express Endpoint IRQ 0 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag- Device: Latency L0s unlimited, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x1 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-0
So there we have it, black on white: The device supports 512 bytes MaxPayload, but below we have MayPayload given as 128 bytes.
Impact on performance
A 128-byte maximal payload is not good news if one wants to get the most out of the bandwidth. By the way, switches are not permitted to split packets (but the Root Complex is allowed) so this number actually tells us how much overhead each TLP (Transaction Layer Packet) carries. I talk about the TLP structure in another post.
Let’s make a quick calculation: Each packet comes with a header of 3 DWs (a DW is a 32-bit word, right?) when using 32 bit addressing, and a header of 4 DWs for 64-bit addressing. Let’s be nice and assume 32-bit addressing, so the header is 3 DWs.
TLPs may optionally carry a one-DW TLP digest (ECRC), which is generally a stupid idea if you trust the switching chipsets not to mess up your data. Otherwise, the Data Link layer’s CRC should be enough. So we’ll assume no TLP digest.
The Data Link layer overhead is a bit more difficult to estimate, because it has its own housekeeping packets. But since most acknowledge and flow control packets go in the opposite direction and hence don’t interfere with a unidirectional bulk data transmission, we’ll focus on the actual data added to each TLP: It consists of a 2-byte header (partially filled with a TLP sequence number) and a 4-byte LCRC.
So all in all, the overhead, assuming a 3-DW header, is 12 bytes for the TLP header and another 6 bytes by the Data Link. All in all, we have 18 bytes, which takes up ~12% if transmitted along a 128-byte TLP, but only ~3.4% for a 512-byte TLP.
For a 1x configuration, which has 2.5 Gbps on the wires, and effective 2.0 Gbps (10/8 bit coding), we could dream about 250 MBytes/sec. But when the TLPs are 128 bytes long each, our upper limit goes down to some ~219 Mbytes/sec. With 512-bytes TLPs it’s ~241 Mbytes/sec. Does it matter at all? I suppose it depends. In benchmark testing, it’s important to know these limits, or you start thinking something is wrong, when it’s actually the packet network limiting the speed.
Reader Comments
Hi,
In the output of the lspci -vv
There are two entries for the MaxPayload as u explained in your article, one was 512 and the second was 128. Why did u take the second one to be the correct value?
Thanks,
Karim.
Well, lspci puts the word “supported” to indicate that the 512 bytes is as much as they device can take, but 128 is what it was programmed to accept.
But the real reason I’m so sure about this, is the dissection I show in this post of the lspci -xxx output.
Hi,
I appreciate your Blog but I am newbie and would like a good start with Read/Write Transaction mit ALTERA Device,
I did the simulation with BFM
I would really appreciate any help
Regards
Questions & Comments
Since the comment section of similar posts tends to turn into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required.
This comment section is closed.