PCI express maximal payload size: Finding it and its impact on bandwidth

This post was written by eli on May 21, 2011
Posted Under: Linux,PCI express

Finding the maximal payload manually

The truth is, there is no need to do this manually. lspci does the work for us. But looking into the configuration table once and for all helps demystifying the issue. So here we go.

According to the PCIe spec (section 7.8), the max_payload_size the card can take is give in the PCIe Device Capabilities Register (Offset 0x04 in the PCI Express Capability structure), bits 2-0. Basically, take that three-bit field as a number, add 7 to it, and you have the log-2 of the number of bytes allowed.

Let me write this in C for clarity:

max_payload_size_capable = 1 << ( (DevCapReg & 0x07) + 7); // In bytes

The actual value used is set by host in the Device Control Register (Offset Ox08 in the PCI Express Capability structure). It’s the same drill, but with bits 7-5 instead. So in C it would be

max_payload_size_in_effect = 1 << ( ( (DevCtrlReg >> 5) & 0x07) + 7); // In bytes

OK, so how can we find these registers? How do we find the structure? Let’s start with dumping the hexadecimal representation of the 256-byte configuration space. Using lspci -xxx on a Linux machine we will get the dump for all devices, but we’ll look at one specific:

# lspci -xxx
(...)

01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
00: ee 10 34 12 07 04 10 00 00 00 00 ff 01 00 00 00
10: 04 f0 af fd 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 ee 10 34 12
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 00 00 00
40: 01 48 03 70 08 00 00 00 05 58 81 00 0c 30 e0 fe
50: 00 00 00 00 71 41 00 00 10 00 01 00 c2 8f 28 00
60: 10 28 00 00 11 f4 03 00 00 00 11 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The first important thing to know about lspci -xxx on a little-endian machine (x86 processors included) is that PCI and PCIe work in big endian. And that the data  is shown as little-endian DWs (or 32-bit unsigned ints). So the way to look at the output is in groups of four bytes each, and take them for a little-endian unsigned int, whose bit map matches the spec.

For example, according to the spec, bits 15-0 of the word mapped at 00h is the Vendor ID, and bits 31-16 is the Device ID. So we take the first four bytes for a little-endian 32-bit integer, and get Ox123410ee. Bits 15-0 are indeed Ox10ee, the vendor ID Xilinx, and bits 31-16 are Ox1234 which is the Device ID I made up for a custom device. So far so good.

Now we need to find the PCI Express Capability structure. It’s one of the structures in a linked list (would you believe that?), and it’s identified by a Cap ID of Ox10.

The pointer to the list is at bits 7-0 of the configuration word at Ox34. In our little-endian representation above, it’s simply the byte at Ox34, which says Ox40. The capabilities hence start at Ox40.

From here on, we can travel along the list of capability structures. Each starts 32-bit aligned, with the header always having the Capability ID on bits 7-0 (appears as the first byte above), and a pointer to the next structure in bits 15-8 (the second byte).

So we start at offset Ox40, finding it’s of Cap ID Ox01, and that the byte at offset Ox41 tells us that the next entry is at offset Ox48. Moving on to offset Ox48 we find Cap ID Ox05 and the next entry at Ox58. The entry at Ox58 is with Cap ID Ox10 (!!!), and it’s the last one (pointer to next is zero).

So we found our structure at Ox58. The Device Capabilities Register is hence at Ox5c (offset Ox04) and reads Ox00288fc2. The Device Control Register is at Ox60 (offset Ox08), and reads Ox00002810.

So we learn from bits 2-0 of the Device Capabilities Register (having value 2) that the device supports a max_payload_size of 512. But bits 7-5 (having value 0) of the Device Control Register tell us that the effective maximal payload is only 128 bytes.

Getting the info with lspci

As I mentioned above, we didn’t really need to find the addresses by hand. lspci -v gives us, for the specific device:

# lspci -v
(...)
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Flags: bus master, fast devsel, latency 0, IRQ 42
 Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Capabilities: [58] Express Endpoint IRQ 0
 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00

So the address to the PCI Express capabilities structure is given to us, but not the internal details (maybe some newer version of lspci does). And by the way, the size=128 above  has nothing to do with maximal payload: It’s the size of the memory space allocated to the device by BIOS (BAR address space, if we’re into it).

For the details, including the maximal payload, we use the lspci -vv option.

# lspci -vv
(...)
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0, Cache Line Size: 4 bytes
 Interrupt: pin ? routed to IRQ 42
 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Address: 00000000fee0300c  Data: 4171
 Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s unlimited, L1 unlimited
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1
 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-0

So there we have it, black on white: The device supports 512 bytes MaxPayload, but below we have MayPayload given as 128 bytes.

Impact on performance

A 128-byte maximal payload is not good news if one wants to get the most out of the bandwidth. By the way, switches are not permitted to split packets (but the Root Complex is allowed) so this number actually tells us how much overhead each TLP (Transaction Layer Packet) carries. I talk about the TLP structure in another post.

Let’s make a quick calculation: Each packet comes with a header of 3 DWs (a DW is a 32-bit word, right?) when using 32 bit addressing, and a header of 4 DWs for 64-bit addressing. Let’s be nice and assume 32-bit addressing, so the header is 3 DWs.

TLPs may optionally carry a one-DW TLP digest (ECRC), which is generally a stupid idea if you trust the switching chipsets not to mess up your data. Otherwise, the Data Link layer’s CRC should be enough. So we’ll assume no TLP digest.

The Data Link layer overhead is a bit more difficult to estimate, because it has its own housekeeping packets. But since most acknowledge and flow control packets go in the opposite direction and hence don’t interfere with a unidirectional bulk data transmission, we’ll focus on the actual data added to each TLP: It consists of a 2-byte header (partially filled with a TLP sequence number) and a 4-byte LCRC.

So all in all, the overhead, assuming a 3-DW header, is 12 bytes for the TLP header and another 6 bytes by the Data Link. All in all, we have 18 bytes, which takes up ~12% if transmitted along a 128-byte TLP, but only ~3.4% for a 512-byte TLP.

For a 1x configuration, which has 2.5 Gbps on the wires, and effective 2.0 Gbps (10/8 bit coding), we could dream about 250 MBytes/sec. But when the TLPs are 128 bytes long each, our upper limit goes down to some ~219 Mbytes/sec. With 512-bytes TLPs it’s ~241 Mbytes/sec. Does it matter at all? I suppose it depends. In benchmark testing, it’s important to know these limits, or you start thinking something is wrong, when it’s actually the packet network limiting the speed.

 

 

 

Reader Comments

Hi,
In the output of the lspci -vv
There are two entries for the MaxPayload as u explained in your article, one was 512 and the second was 128. Why did u take the second one to be the correct value?

Thanks,
Karim.

#1 
Written By karim on May 28th, 2011 @ 01:28

Well, lspci puts the word “supported” to indicate that the 512 bytes is as much as they device can take, but 128 is what it was programmed to accept.

But the real reason I’m so sure about this, is the dissection I show in this post of the lspci -xxx output.

#2 
Written By eli on May 28th, 2011 @ 01:34

Hi,

I appreciate your Blog but I am newbie and would like a good start with Read/Write Transaction mit ALTERA Device,
I did the simulation with BFM
I would really appreciate any help

Regards

#3 
Written By hento on November 14th, 2011 @ 16:29

Questions & Comments

Since the comment section of similar posts tends to turn into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required.

This comment section is closed.

#4 
Written By eli on April 25th, 2012 @ 08:52