PCIe read completion reordering and how it reduces bandwidth efficiency
While the PCI Express standard is impressive in that it actually makes sense (well, most of the time) there is a pretty annoying thing about read requests reordering.
By the way, I talk about TLP packet formation in general in another post.
In section 2.4.1 of the PCIe spec 1.1, it says that read requests may be reordered as they travel across the switching network, and same goes for read completions associated with different read requests. The read-related packets, which must arrive in the same order they were sent, are read completions associated with the same read request, or as the spec puts it: “… must return in address order”.
This would be a good time to mention, that a read request may be larger than the maximal payload size, so obviously the completer must have a means of splitting the completion into several TLPs, which must be sent in rising address order. And if we’re at it, there’s a boundary restriction on how to cut the data in pieces, namely that the cuts are on boundaries of RCB, where RCB can either be 64 bytes or 128 bytes (if you’re not a Root Complex you may cut on 128-byte boundaries only, and if you are a Root Complex, you choose 64 or 128, and configure the endpoints telling then what you chose).
And there’s a restriction on the maximal read request size, but that’s a different story.
So far the specification makes sense: Read completions will be split into several TLPs pretty often, and they have to arrive to the requester in linear address order, so these packets must not be reordered.
But what if the endpoint needs to collect a chunk of data which is larger than the maximal read request size (typically 512 bytes)? It will have to issue several read requests for that. But read requests and completions from different read requests may be reordered.
So if we want to assure that the data arrives in linear order (which is necessary when the data goes into a FIFO-like data sink) each read request can be transmitted only when the last completion TLP from the previous request arrives. Otherwise, a completion TLP from the following request may arrive before that last packet. Hence there’s a time gap of non-utilized bandwidth.
In general there is no problem having several outstanding read requests. So had read requests and read completions been strictly ordered, it would be possible to send the following read request more or less immediately after the first one, and completions would arrive continuously.
Another issue, which is less likely to bother anyone, is that if some software makes assumptions on the order at which data in some buffer is updated, this can cause rare bugs. For example, if the read completions update some dual port RAM, and there’s software reading from this buffer, it may deduce from the update of some high-addressed memory cell that the entire buffer is updated.
And a final word: Since PCIe infrastructure is pretty plain when this post is written, I will be surprised if anyone manages to catch any packet reordering taking place for real. But exactly as write reordering is commonplace in modern CPUs to increase performance, it’s only a matter of time before PCIe switching networks do the same.
Update: I got an email from someone informing me that he spotted reordering taking place on some Intel desktop chipsets. So this isn’t just theory, after all.
Questions & Comments
Since the comment section of similar posts tends to turn into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required. The comment section below is closed.