Down to the TLP: How PCI express devices talk (Part II)

This is part II of my little guide to PCI express. You may want to read Part I first.
Data Link Layer Packets
Aside from wrapping TLPs with its header (2 bytes) and adding a CRC at the end (LCRC actually, 4 bytes), the Data Link layer runs packets of its own for maintaining reliable transmission. These special packets are Data Link Layer Packets (DLLPs). We’ll list them shortly:
- Ack DLLP for acknowledging successfully received TLPs.
- Nack DLLP for indicating that a TLP arrived corrupted, and that a retransmit is due. Note that there’s also a timeout mechanism in case nothing that looks like a TLP arrives.
- Flow Control DLLPs: InitFC1, InitFC2 and UpdateFC, used to announce credits, as described below.
- Power Management DLLPs.
Flow control
As mentioned before, the data link layer has a Flow Control (FC) mechanism, which makes sure that a TLP is transmitted only when the link partner has enough buffer space to accept it.
I used the term “link partner” and not “destination” deliberately. For example, when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. If there are more switches on the way, each leg has its own flow control.
The mechanism is not the simplest, and its description in the spec will give you goosebumps. So I’ll try to put it fairly clear.
The flow control mechanism runs independent accounting for 6 (six!) distinct buffer consumers:
- Posted Requests TLP’s headers
- Posted Requests TLP’s data
- Non-Posted Requests TLP’s headers
- Non-Posted Requests TLP’s data
- Completion TLP’s headers
- Completion TLP’s data
These are the six credit types.
The accounting is done in flow control units, which correspond to 4 DWs of traffic (16 bytes), always rounded up to the nearest integer. Since headers are always 3 or 4 DWs in length, every TLP transmitted consumes one unit from the respective header credit. When data is transmitted, the number of consumed units is the number of data DWs in the TLP, divided by four, rounded upwards. So we can imagine data buckets at the receiver of 16 bytes each, on which we are not allowed to mix data from different TLPs. Each bucket is a flow control unit.
Now lets imagine that there’s a doorkeeper at the transmitter, which counts the total number of flow control units consumed since the link establishment, separately for each credit type. This is six numbers to keep track of. This doorkeeper also has the information about the maximum number each of these credit types is allowed to reach. If a certain TLP for transmission would make any of these counted units exceed its limit, it’s not allowed through. Another TLP may be transmitted instead (subject to reordering rules) or the doorkeeper simply waits for the limit to rise.
This is the way the flow control works. When the link is established, both sides exchange their initial limits. As each receiver processes incoming packets, it updates the limits for its link partner, so it can use the buffer space released. UpdateFC FLLP packets are sent periodically to announce the new credit limits.
Well, I overlooked a small detail: Since we’re counting the total number of units since the link started, there’s always a potential for overflow. The PCIe standard allocates a certain number of bits for each credit type counter and its limit (8 bits for header credits, 12 bits for data credits), knowing that they will overflow pretty soon. This overflow is worked around by making the comparison between each counter and its limit with straightforward modulo arithmetic. So given some restrictions on not setting the limit too high above the counter, the flow control mechanism implements the doorkeeper described above.
Bus entities are allowed to announce an infinite credit limit for any or all of the six credit types, meaning that flow control for that specific credit type is disabled. As a matter of fact, endpoints (as opposed to switches and the Root Complex) must advertise an infinite credit for completion headers and data. In other words, an endpoint can’t refuse to accept a completion TLP based upon flow control. So the Requester of a non-posted transactions must take responsibility for being able to accept the completion by verifying that it has enough buffer space when making the request. This also applies to root complexes not allowing peer-to-peer transactions.
Virtual channels
In part I of this guide, I marked the TC fields in the example TLPs green, saying that those fields are almost always zero. TC stands for Traffic Class and is an identifier used to create Virtual Channels. These Virtual Channels are merely separate sets of data buffers having a separate flow control credits and counters. So by choosing a TC other than zero (and setting up the bus entities accordingly) one can have TLPs being subject to independent flow control systems, preventing TLPs belonging to one channel block the traffic of TLPs belonging to another.
The mapping from TC’s to Virtual Channels is done by software for each bus entity. Anyhow, the real-life PCIe elements I’ve seen so far support only one Virtual Channel, VC0, and hence only TC0 is used, which is the minimum required by spec. So unless some special application requires this, TC will remain zero in all TLPs, and this whole issue can be disregarded.
Packet reordering
One of the issues that comes to mind in a packet network, is to what extent the TLPs may arrive in an order different from how they were sent. The Internet Protocol (IP, as in TCP/IP) for example, allows any packet reshuffling on the way. The PCIe specification allows a certain extent of TLP reordering, and in fact in some cases reordering is mandatory to avoid deadlocks.
Fortunately, the legacy PCI compatibility concern was taken into account in this issue as well, unless the “relaxed ordering” bit is set in the TLP, which it rarely is. This is one of the bits in the Attr field, marked green in the TLP examples in part I of this guide. So all in all, one can trust that things will work as if there was a good old bus we were talking with. Those of us who write to a few registers, and then trigger an event by writing to another one, can go on doing it. I turn off the BAR’s Prefetch bit to be on the safe side, even though there’s nothing to imply that it has anything to do with writes.
The spec defines reordering rules in full detail, but it’s not easy to get the bottom line. So I’ll mention a few results of those rules. All here is said assuming relaxed ordering bit is cleared in all transactions. I’m also ignoring I/O space completely (why use it?):
- Posted writes and MSI’s arrive in the order they were sent. Now, all memory writes are posted, and MSIs are in fact (posted) memory writes. So we know for sure that memory writes are executed in order, and that if we issued an MSI after filling a buffer (writes…) it will arrive after the buffer was actually written to.
- A read request will never arrive before a write request or MSI sent before it. As a matter of fact, performing a Read Request is a safe way to wait for a write to complete.
- Write requests may very well come before read requests sent before them. This mechanism prevents deadlock in certain exotic scenarios. Don’t write to a certain memory area while waiting for the read completion to come in.
- Read completions for a certain request (i.e. with the same Tag and Requester ID) arrive in the order they were sent (so they arrive in order with rising addresses). Read completions of different request may be reordered (but who cares).
Other than that, anything can change order or arrival, including read requests which may be reordered among themselves and with read completions.
To relieve any paranoia about an interrupt message arriving before the write operations that preceded it, section 2.2.7 in the spec spells it out:
The Request format used for MSI/MSI-X transactions is identical to the Memory Write Request format defined above, and MSI/MSI-X Requests are indistinguishable from memory writes with regard to ordering, Flow Control, and data integrity.
Zero-length read request
As just mentioned, reading from a bus entity after writing to it, is a safe way to wait for the write operation to finish for real. But why read anything, if we’re not interested in the data? So they made up a zero-length request, which reads nothing. All four Byte Enables are assigned zeroes, meaning nothing is read. As for the completion, section 2.2.5 in the spec says:
If a Read Request of 1 DW specifies that no bytes are enabled to be read (1st DW BE[3:0] field = 0000b), the corresponding Completion must specify a Length of 1 DW, and include a data payload of 1 DW
So we have one DW of rubbish data in the completion. That’s fair enough.
Payload sizes and boundaries
Every TLP carrying data must limit the number of payload data DWs to Max_Payload_Size, which is a number allocated during configuration (typically 32, 64 or 128). This number applies only to payloads, and not to the Length field itself: Memory Read Requests are not restricted in length by Max_Payload_Size (per spec 2.2.2), but are restricted by Max_Read_Request_Size (per spec 2.2.7).
So a Memory Read Request may ask for more data than is allowed in one TLP, and hence multiple TLP completions are inevitable.
Regardless of the Max_Payload_Size restrictions, completions of (memory) read requests may be split into several completion TLPs. The cuts must be in addresses aligned by RCB bytes (Request Completion Boundary, 128 bytes, for Root Complex possibly 64) per spec 2.3.11. If the Request doesn’t cross such an alignment boundary, only a single Completion TLP is allowed. Multiple Memory Read Completions for a single Read Request must return data in increasing address order (which will be kept by the switching network).
And a last remark, citing the spec 2.2.7: Requests must not specify an Address/Length combination which causes a Memory Space access to cross a 4-KB boundary.
That’s it. I hope reading through the PCI Express specification will be easier now. There’s still a lot to read…
Questions & Comments
Since this post’s comment section has turned into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required. The comment section below is closed.
Reader Comments
many thanks for the brief and clear PCIe intro!
“or example, when a peripheral is connected to the Root Complex through a switch, it runs its flow control mechanism against the switch and not the final destination. In other words, once the TLP is transmitted from the peripheral, it’s still subject to the flow control mechanism between the switch and the Root Complex. If there are more switches on the way, each leg has its own flow control.”
Does this imply that the switches in middle will buffer the packets ?
It implies that they may do so, and actually most likely will.
I do not understand about why bus mastering is needed in PCIE. if every PCIE device has a separate collision domain ( just like the network switch ) and the switch is able to buffer the TLP, why does a PCIE device need to hold the bus before issuing a TLP packet ?
Bus mastering is indeed an old-fashioned term, sometimes used interchangeably with DMA. A PCIe device doesn’t “hold the bus” when transferring packets to its peers or root complex.
For PCIe devices, when the host sets the bus master bit, all it says is that the device is allowed to send requests on the bus (as opposed to respond to them only).
Thank you for your prompt reply, PCIE devices can communicate with each other through TLP if they are attached to different switches , or DLP if they are in the same switch. But can PCIE devices communicate with PCI devices directly ( i.e. without going through the route complex )
I wouldn’t be so 100% sure about your first assumption. Even though the spec says a word or two about peer-to-peer packet routing, I wouldn’t be surprised if this or that specific hardware didn’t support it.
As for PCIe-PCI direct communication, I suppose that the bridge between the two couldn’t care less whether it’s a root complex or another device sending it packets. Either way, there’s no reason I can see why an incoming write packet wouldn’t turn into a write burst on the PCI side.
Reads may be a bit trickier, since a not-so-clever bridge might send all completions to the bus ID 0x0000 (root complex) automatically. I haven’t looked into the spec about this (and even if I had, I wouldn’t be sure what’s really implemented).
So the bottom line is that I would put my money on that writing on the PCI bus will work, writes from PCI to PCIe will work if the fabric is gracious enough to support peer-to-peer routing, and read completions will reach destination if the hardware is nicely done.
Or: I wouldn’t put my money on that. ;)
If PCI devices can communicate with PCIE devices, they have to share the same kind of packet structure, don’t they. but since PCI is not a switched protocol, I imagine it does not have TLP headers , how do they communcate then ? aren’t there have to be a device and unwrap PCIE TLP packet and repacket it into whatever format PCI devices are able to recognize
The device you’re talking about exists and it’s called a bridge. PCI is a plain address/data bus, so of course there aren’t any packets to talk about, but the bridge translates PCIe packets into PCI bus transactions (bursts as necessary) and vice versa. The PCIe spec has gone a long way to allow PCIe devices to work on PCI buses through a bridge, and the other way around. As a matter of fact, most recent motherboards’ root bus is PCIe, and the PCI cards are connected to it through a bridge.
When I’m skeptic about peer-to-peer communication it’s only because I assume that real-life PC hardware is just good enough to support what is needed to make the computer appear to work correctly. If such hardware doesn’t conform to spec on those spots not used by a common computer nobody really cares.
It would have been better if Interrupts (Legacy Interrupt and MSI interrupts) discussed in detail…..
If possible, could you please post the same on your blog with packet diagram so it would be easy for our understanding on the same…..
Thank you for the valuable information…….
Regards,
Murali Krishna K.
As mentioned in part I, MSI interrupts are simply writes to certain addresses.
As for legacy interrupts, I really wonder why anyone would find interest in them today.
Anyhow, these two issues are covered in a pretty readable manner in the standard.
Excellent tutorial on PCIe. I just want to mention that in Part I you state that PCIe is Big Endian. I am not certain that this correct, since one of the criteria for PCIe was to be backward compatible with PCI, which is Little Endian.
Thanks,
Frank
Thanks.
As for endianess, this is a delicate issue as usual. The configuration space field are indeed set up in little endian in both PCI and PCIe. But the PCIe spec is generally written in terms of big endian, which is crucial since addresses are passed as packet information.
But this endianess of PCIe is somewhat artificial, because the data is sent serially. It’s actually the receiver which determines how to turn the stream into parallel 32-bit words. Having said that, it’s pretty clear than any sane hardware will deserialize the stream into big endian DWs, or everyone will be confused.
The idea behind the big endianess was that the first bytes of the address would be enough for devices to decide who the packet is for.
How would a UART interface work with read request re-ordering? If there is a FIFO mapped into PCIe space as well as a fifo fill level register, then couldn’t these get out of step with read re-ordering (e.g. if the software did 8 fifo reads and a fill level read, then couldn’t the fill level read actually come after only 7 fifo reads have occurred?)
First, I doubt if the packet reordering issue arises at all with 16550-like UARTs, since the host reads the data byte by byte from the FIFO, forcing the CPU to wait until the arrival of the completion for each byte. Which is horribly slow, but there’s nothing to reorder. Besides, old-school UARTs are very rarely implemented as a PCIe card anyhow. It’s usually a fake PCIe bridge within the motherboard chipset.
But to answer the underlying question: There is no problem with packets arriving from the same read request, as they always arrive in the correct order. The thing is that you can’t issue a new read request until the last packet from the previous one has arrived, if you care about read side effects. This is nothing a CPU would do anyhow in a sequential execution thread.
Uhhm, that was not accurate. Some CPUs may change the order of any bus operations under some conditions. So in the absence of CPU memory barriers, you can have two simple byte read operations in your program being executed in the reverse order, if the CPU chooses to be a @^&%$#$. But this is really diverting from your question.
Excellent article – very useful!
I have a question about EP bus mastering. When an EP issues a memory write to the root complex – can it write to any address of the CPUs address space? Just its BAR memory (i guess that wouldn’t make sense)? Does that mean if bus mastering is enabled for an EP it could potentially trash system memory – or is there some protection built in (perhaps some MMU for the PCIe RC controller?)
Short answer: Yes, a PCIe device on the bus can access any physical memory address, so it can trash the existing memory and read everything as well. Not very secure indeed.
The longer answer is that an IOMMU, which is available in recent PC hardware, will prevent such unauthorized access, if enabled. But since the awareness of this issue is low, and the IOMMU has some performance impact, it’s not commonly enabled (I don’t know about recent Windows versions, though).
Thanks for such a quick reply and for writing such a useful article. I was thinking about something like an Exar XR17V358. I was under the impression that the PCIe bus could re-order reads arising from different lines of code. If it’s the processor that could re-order then that’s really a layer above PCIe.
Thank you for the quick overview of the PCIe. I came across this as I was in search of more information about the Write from “root” to endpoint.
I just want to clarify because it would be significant to my application, when the PC/root writes to a PCIe device, there’s no credit checking involved before the data is sent. And, as you stated, the endpoint needs to accept the TLP regardless.
If this is accurate, and there is a limit on the part of the endpoint, say an async interface is holding off the transmission of data, then how does the application know if the receive buffers at the endpoint are full or not? Polling seems like it could take a lot of bandwidth.
What you seem to describe here is a case where the PCIe subsystem in the peripheral doesn’t get rid of its packets because its internal client doesn’t collect the data. This is not a very good design practice, to say the least. It’s not clear how the processor will respond when it has data to send on the bus and nobody to accept it on the other side.
It’s like trying to write to a DDR memory, and suddenly the memory holds the bus in wait states for a significant amount of time.
So if there is any application level data control, it should be done in the good old ways: Polling or interrupts. Exactly like a PCI card.
I’ve been familiarizing myself with PCIe. I spent most of the day reading the PCIe 2.0 Base Specification about virtual channels (VCs) and bus arbitration, only to find out on your blog that basically no one uses VCs! I guess this is like a lot of those crazy x86 features no one uses.
Can you comment at all on Linux’s support for virtual channels? I’ve been going through the code in linux/drivers/pci, and I cannot find code for configuring VCs. There is a function that can be called to determine if a device supports VCs (though it doesn’t report how many VCs are supported), but there doesn’t appear to be any code for setting up VCs. Since VCs can conflict/contend with each other on the bus, I thought that VC configuration might be centralized in the OS. Also, I see no methods for configuring the VC arbitration algorithms (strict priority, round robin, weighted round robin). Unless I am looking in the wrong places, this would seem to further support the argument that no one uses VCs.
The first things that comes to my mind is: Why on earth do you need virtual channels? With a regular PC, the setting is such that the bottleneck is the bus’ own bandwidth, and not any issue with the data link credits.
So indeed, this feature is not used. Not in any real scenario I’ve seen so far. Linux included.
Thanks for the info. In defense of virtual channels, I believe the prioritized arbitration mechanisms could be useful in building real-time (aka isochronous) systems, such as those found in professional audio, automotive, and avionic systems. Still, the Intel data sheet for the 5520/5500 chip set implies that Tylersburg only supports round robin arbitration (Section 3.3.5.7). If the widely used Intel chipset won’t support more exotic mechanisms, then it is doubtful that any PCIe device would!
What virtual channels give you is separate data layer credit accounting for each channel. So as long as the data layers packets are acknowledged fast enough between each pair of link partners, the benefit of virtual channel is absolutely zero.
As PCIe endpoints are always equipped with a fair amount of buffers, data level stalling will occur only if one of the link partners can’t keep up with the speed of data. Since the obvious practice is to read or write data from RAM buffers in each side, I would conclude that virtual channels are relevant only of the PCIe bus runs faster than the RAM buffers.
Talking with the RAM buffers faster than the PCIe bus is almost always an obvious fact. If your PCIe bus is extremely fast (a la graphics card) you may need to invest some efforts on fast DDR memories. Even in this case, that easy, cheap and rather brute force method keeps the system simple and kicking.
So the bottom line is that if the system is designed in a fairly sane way, there’s no need for virtual channels, because the data layer won’t stall any communications anyhow. I can think of several situations for which a poorly designed system would take advantage of virtual channels, by they way. Which is why I first asked why you wanted virtual channels at all.
What your saying makes sense to me if there is only one PCIe device, but what if there are several and they share the bus? In system with strict timing constraints, you may want to prioritize the data stream of one device over another. A full implementation of virtual channels would provide this arbitration support at the hardware level, right?
The truth is I never got down to the details of how to configure virtual channels as they cross switches. But I don’t recall anything about prioritization between the channels, just the fact that if one VC finishes its credits, another VC can go on. So I’m not sure VCs would make any difference. Unless you intend to deliberately stall one VC by not accepting the packets in the other end. Or something.
But if there is a risk of the bus getting overloaded with data, I would put my money on adding a few more lanes, rather than trying to control the congestion with virtual channels. Even if it would turn out that VCs can solve congestion problems.
As a matter of fact, a well designed link would never stall because of running out of credits, but the effective bandwidth should be limited by the raw bus rate only. So again, I can’t see why virtual channels are relevant here.
The arbitration mechanisms I am thinking of are given in section 6.3.3.2 “VC Arbitration – Arbitration Between VCs” in the PCIe Base Specification 2.0 (revision 0.9). I believe even more determinism can be achieved by using isochronous support (described in Appendix A of the PCIe Base Spec). However, Tylersburg doesn’t have isochronous support either.
I have a system with 4 GPUs connected by x16 links to a single Tylersburg IOH (pairs of GPUs are connected by a PCIe switch). I’m trying to work out the available bandwidth to each GPU under different conditions. Ideally, I’d be able to prioritize the traffic of one GPU over another (if required). Otherwise, I can resort to locking protocols implemented in software if absolutely necessary.
Interestingly, I am finding that the independent x16 ports on the single Tylersburg IOH do not appear to be entirely independent. If I send data to two GPUs, one on each port, I do not get the same bandwidth as if each GPU were being used alone. Alone, I get ~5.8GB/s. Together, ~4.3GB/s. This is troubling because the datasheet for the 5520 chipset says that each x16 link can run at 8GB/s independently. I had thought that I might be saturating QPI, but that has a max outbound bandwidth of 12.5GB/s. QPI should be able to keep up with the outbound PCIe traffic. I can’t explain this bandwidth loss if all the documentation I’ve been reading is correct.
And now I finally got the answer to why you want VCs: With those bandwidths, you’re indeed among the few ones who should consider that. Not that I’m sure that will help you, but your options are pretty much running out.
Sending data from the host is a sad story indeed. Based upon DMA transactions, it’s your GPU which requests data from the host with an non-posted TLP. The credits for that request aren’t reclaimed until the host returns with the completion, which can take a substantial amount of time (in terms of a butt-kicking rate like yours). So I can definitely see how completion credits turn into an issue in your specific scenario.
On the other hand, it could have nothing to do with credits. Since the PCIe protocol allows reordering of completions belonging to different requests, the initiator can’t start a new request before the previous one if it needs the requested data to arrive in linear order. I don’t know how your GPUs behave in this manner, but if they do, it can explain… Forget it. Two different GPUs in parallel would actually make things better in this case.
And another, completely different take on this would be the rate at which the host can supply data. If two GPUs are requesting data from different segments all the time, you might be facing a cache wipe issue. Or something having to do with how the DDR memories need to fetch rows from their data arrays.
So as you can see, all I can do is suggest things that come to my mind. And you know what? Even though it may turn out to be the solution, I would still not put my money on Virtual Channels. But that’s just a hunch.
Ah, thank you for mentioning the DDR memory. I have verified that memory is the bottleneck in my system. A single DIMM in my system can only pump 10666 MB/s (DDR3-1333). This isn’t enough to push 5.8 GB/s to two GPUs concurrently.
It’s too bad that my system cannot use DDR3-1600, which has a max throughput of 128000 MB/s. This rate would equal QPI (in one direction). PCIe would then be the bottleneck.
When performing a transfer from the host PC to a PCIe endpoint via memcpy the majority of the time the payload of the packets are 64-bits. This makes sense since the CPU is 64-bits and the largest mov operation that it can perform to the PCIe controller is 8 bytes. However, occasionally I see a packet being generated with a payload of 128-bits(16 bytes). How come this happens occasionally, what is the mechanism behind it?
I don’t really know. I’m quite surprised to know that a processor joins write operations. Even though it’s a allowed and sensible thing to do, I can’t see why anyone would bother implementing such an optimization. After all, if you have a lot of data to transfer, you should use DMA.
This way or another, if you hope someone else will answer this here (which I doubt), you need to specify the processor and chipset you’re using, and exactly to what you’ve connected your PCIe device. If this really interests you, I think that the right place is a newsgroup related to your very specific hardware.
I’ve seen it on pretty much all of the systems that I’ve used(Core 2, Core I7 Gen 1, Core I7 Gen 2) so I don’t think it’s a unique behavior. The systems that I’ve lately been using have been Dell Optiplex 980 and Dell T3500. The endpoint was just a basic Virtex 6 FPGA with the Xilinx PCIe core.
OK, so it’s not just a specific piece of hardware (and the type of PCIe device has no significance, as you surely understand yourself).
Now we know that the processors do what they’ve been allowed to do anyhow. I have seen nowhere saying how a couple of writes to adjacent addresses should be translated into TLPs.
Which makes me wonder why this is an issue at all. Your endpoint should be able to accept any TLP length within the maximal length announced by the endpoint.
It’s not a problem I was just curious why this was happening since I wouldn’t expect the processor to perform write combining. At the end of the day the transfers will of course be handled via DMA anyway. I just thought this was an interesting and unexpected behavior, and wanted to know whether anyone else has encountered the behavior and knew why it was happening.
Thank you for your article!
I have a question related to Vendor Define Message (VDM):
- The PC will automatically receive the VDM message from the device and the PC will allocate one its memory space to store the VDM message when this message is sent from the PCIe device.
(The PCIe device sends VDM under route to root complex mode)
Is it correct?
I have no idea. I’ve never worked with these (and odds are you shouldn’t either)
Great site!
I have a question about 4K boundary and PCIe. We used to have an application where there was CPU –> EP using conventional PCI bus. Then the CPU was replaced with a newer one which had only PCIe interface. So we had to put a PCIe-to-PCI brigde between.
Now DMA transfer from CPU memory to EP doesn’t work all the times. It seems that when 4K boundary is crossed one DW is corrupted in the middle of the data where the boundary is. Is it so that the conventional PCI does not have this 4K boundary restriction?
I don’t know about PCI, but if you cross a 4k boundary with PCIe, you’re off spec and anything can happen. So this is something you’ll have to fix, this way or another.
Hey, great site,
according to the spec and your description, endpoints must have infinite flow control credits for completion TLPs. Do you have a glue, why?
Well, I didn’t write the spec, but I suppose that the logic behind the infinite credits was that if an endpoint could refuse to accept data for completions based upon credits, it could indirectly force another entity to store data in its own buffers (i.e. the completion data itself, which isn’t flowing through). So to avoid this data overflowing buffers, another mechanism would have been needed to make sure there is enough space for the completion data.
All in all, it makes sense: Ask another endpoint for data only if you have space for the answer. Don’t hide behind the flow control in this matter.
Questions & Comments
Since this post’s comment section has turned into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required.
This comment section is closed.