USB 3.0 bandwidth efficiency: Looking at real-life DATA bursts
Introduction
This post looks at the DATA and ACK packet exchange between a device and an xHCI USB 3.0 controller for the sake of explaining the actual, measured bandwidth that is observed on a BULK IN endpoint. And then some BULK OUT at the bottom of this post, as Bonus II.
A certain level of knowledge of the Superspeed packet protocol is assumed.
Superspeed data flow
For the sake of bandwidth efficiency, the USB 3.x spec allows (and encourages) bursts of DATA packets. This is implemented by virtue of the NumP field in the ACK packets that are sent by the receiver of DATA packets and in response to them.
The NumP field is a number saying how many packets the receiver is capable of accepting immediately after sending the ACK packet that carries it. This gives the sender of the DATA packets a go-ahead to send several packets in response to this ACK packet. In fact, an infinite flow of DATA packets is theoretically possible if the receiver keeps sending ACK packets with a sufficiently high NumP, and there’s enough data to send.
The rules that govern the data flow are rather complicated, and involve several factors. For example, due to the inherent delay of the physical bit stream, there’s a chance that when an ACK packet arrives, its NumP field is somewhat outdated because DATA packets have already been sent on the expense of a previous ACK’s NumP. The sender of DATA packets needs to compensate for these gaps.
USB remains a control freak
Even though USB 3.0 introduces a more relaxed framework for data transport (compared with USB 2.0), the concept that the host has full control over the data flow remains. In particular, any data transfer on the bus is a direct result of an outstanding request to the xHCI controller.
More precisely, any USB data transfer begins with the USB device driver setting up a data buffer and a transfer descriptor (TD) which is a data structure that contains the information on the requested data transfer. The device driver passes on this request to the USB controller (xHCI) driver, which adds it to a queue that is directly accessible by the hardware USB controller (usually after some chewing and chopping, however this isn’t relevant here). The latter performs the necessary operations to fulfill the request, and eventually reports back to the xHCI driver when the request is completed (or failed). The USB device driver is notified, and takes relevant action. For example, consuming the data that arrived from an IN endpoint.
The exchange of TDs and data between the software and hardware is asynchronous. The xHCI controller allows queuing several TDs for each endpoint, and activity on the bus on behalf of each endpoint takes place only in the presence of TDs on its queue. If there are no TDs queued for a specific endpoint, no data transfer occurs on its behalf, whether the device is ready or not.
And this is the important conclusion: For a high-bandwidth application, the software should ensure that a number of TDs are queued for the endpoint all the time. Failing to do so slows down the data flow due to momentary data flow halts while no TDs are queued.
Actual behavior
These are some anecdotal tests on a an Intel B150 chipset’s USB 3.0 xHCI controller (8086:a12f) and a Renesas Technology Corp. uPD720202 (1912:0015). These were fed with at least four TDs (BULK IN, 4096 bytes each) to handle the data flow that was monitored before the device became ready with its data, so the hardware’s optimal behavior is observed.
This is a typical sequence for the Intel USB controller:
513.048 ACK seq=0 nump=4 0.040 DATA seq=0 len=1024 0.032 DATA seq=1 len=1024 2.080 DATA seq=2 len=1024 0.736 ACK seq=1 nump=3 1.400 DATA seq=3 len=1024 0.720 ACK seq=2 nump=2 2.160 ACK seq=3 nump=1 2.144 ACK seq=4 nump=0 2.008 ACK seq=4 nump=4 0.040 DATA seq=4 len=1024 0.032 DATA seq=5 len=1024 2.080 DATA seq=6 len=1024 0.736 ACK seq=5 nump=3 1.384 DATA seq=7 len=1024 0.736 ACK seq=6 nump=2 2.160 ACK seq=7 nump=1 2.144 ACK seq=8 nump=0 1.736 ACK seq=8 nump=4 0.040 DATA seq=8 len=1024
DATA packets are sent by device, and ACK by host. The numbers at the beginning of each line are the time difference with the previous line, in microseconds, measured inside the device’s logic. The timing for DATA is of the internal request for a packet (in the device’s logic), not the actual transmission, and the internal queue for such requests is two entries deep — that’s why two DATA packets are fired off right after the ACK packet’s arrival.
A DATA packet with 1024 bytes’ payload consists of a DPH (4 bytes start pattern + 16 bytes) and a DPP (4 bytes start pattern + 1024 bytes payload + 4 bytes CRC + 4 bytes end pattern), all in all 1056 bytes, which take 2.112 μs on wire. The theoretical efficiency limit is hence 1024/1056 ≈ 97%, or ~485 MB/s.
From the log above, it’s evident that there’s a turnaround time of ~2.85 μs from DATA to ACK, which is just ~ 0.74 μs beyond the time it takes to transmit the packet.
Note that the host separates the bursts for each TD: The NumP starts at 4 and counts down to 0 in the ACKs packets, so exactly 4096 bytes (the size of the TD) are transmitted in a burst. The following ACK packet, which starts off a new burst with a nump=4 is transmitted only 2 μs later, indicating that it took the USB controller some time to figure out that it has more to do with the same endpoint. In theory, it could have looked ahead for the next TD and realized that there’s enough TDs to continue the burst practically forever, but this optimization isn’t implemented.
It’s interesting to calculate the time no DATA was transmitted due to the burst stop and restart. The size of the gap isn’t easily calculated, as the times on the DATA packets are when they’re queued. To work around this, one can assume that the last byte of the 4th packet was sent 0.74 μs before the first ACK on its behalf was timed. The gap is hence 0.74 + 2.008 = 2.748 μs (the latter is the difference between the two ACKs for seq=4, the first concluding the burst, and the second starting a new one).
The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 2.748) ≈ 75.4% or ~377 MB/s. The actual speed measurement was 358 MB/s. The difference is most likely attributed to momentary shortages of TDs that are observed as occasional longer gaps (seen only in extensive traffic traces).
The same test with the Renesas USB host controller:
4.808 ACK seq=0 nump=4 0.040 DATA seq=0 len=1024 0.032 DATA seq=1 len=1024 2.080 DATA seq=2 len=1024 0.712 ACK seq=1 nump=3 1.416 DATA seq=3 len=1024 0.704 ACK seq=2 nump=2 2.152 ACK seq=3 nump=1 2.144 ACK seq=4 nump=0 5.488 ACK seq=4 nump=4 0.040 DATA seq=4 len=1024 0.032 DATA seq=5 len=1024 2.080 DATA seq=6 len=1024 0.704 ACK seq=5 nump=3 1.448 DATA seq=7 len=1024 0.712 ACK seq=6 nump=2 2.144 ACK seq=7 nump=1 2.152 ACK seq=8 nump=0 5.552 ACK seq=8 nump=4
The turnaround for an ACK is similarly ~2.82μs from DATA to ACK, which is ~ 0.71 μs beyond the time it takes to transmit the packet. Almost the same as the previous result.
However the time between the two ACKs that make the gap in the data flow is 0.71 + 5.488 = 6.20 μs, significantly worse than the Intel chipset.
The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 5.488) ≈60.6% or ~303 MB/s. The actual speed measurement was 262 MB/s.
Larger TD buffers
Since the bandwidth efficiency took a hit because of the short bursts, the next step is to assign larger buffers to each TD, hoping that the USB controller will produce longer bursts. Hence the software prepared 31 TDs (the maximum that the Linux controller was ready to accept in advance), each with 512 kiB of data.
The Intel controller’s response:
501.944 ACK seq=0 nump=4 0.040 DATA seq=0 len=1024 0.032 DATA seq=1 len=1024 2.080 DATA seq=2 len=1024 0.736 ACK seq=1 nump=3 1.400 DATA seq=3 len=1024 0.712 ACK seq=2 nump=3 1.432 DATA seq=4 len=1024 0.728 ACK seq=3 nump=3 1.432 DATA seq=5 len=1024 0.728 ACK seq=4 nump=3 1.432 DATA seq=6 len=1024 0.736 ACK seq=5 nump=3 1.416 DATA seq=7 len=1024 0.736 ACK seq=6 nump=3 1.416 DATA seq=8 len=1024 0.736 ACK seq=7 nump=3 1.432 DATA seq=9 len=1024 0.720 ACK seq=8 nump=3 1.432 DATA seq=10 len=1024 0.728 ACK seq=9 nump=3 1.432 DATA seq=11 len=1024 [ ... ]
this goes on like a machine until the TD is filled:
0.736 ACK seq=29 nump=3 1.432 DATA seq=31 len=1024 0.728 ACK seq=30 nump=2 2.152 ACK seq=31 nump=1 2.160 ACK seq=0 nump=0 1.736 ACK seq=0 nump=4 0.040 DATA seq=0 len=1024 0.032 DATA seq=1 len=1024 2.080 DATA seq=2 len=1024 0.736 ACK seq=1 nump=3
Even though the device announces a maximal burst length of 8, and each TD can take much more than that, Intel’s controller chooses to limit itself to a NumP of 4. Since the DATA to ACK bus turnaround is ~2.85 μs (see above), which is less than the time for transmitting two DATA packets of maximal length, this limitation has no performance impact. Note that the DATA packets are queued a significant time after the arrival of ACK packets, indicating that the device wasn’t waiting for them. This is quite expected, as there are two DATA packets in flight all the time, and ACKs arriving with nump=3, so all in all there’s always another DATA packet allowed. Intel got their timing correct here.
The measured bandwidth on this run was 471 MB/s.
Renesas host controller doesn’t behave as elegant however:
5.808 ACK seq=2 nump=8 0.040 DATA seq=2 len=1024 0.032 DATA seq=3 len=1024 2.080 DATA seq=4 len=1024 0.704 ACK seq=3 nump=7 1.416 DATA seq=5 len=1024 0.704 ACK seq=4 nump=6 1.464 DATA seq=6 len=1024 0.688 ACK seq=5 nump=5 1.464 DATA seq=7 len=1024 0.688 ACK seq=6 nump=4 1.464 DATA seq=8 len=1024 0.696 ACK seq=7 nump=3 1.464 DATA seq=9 len=1024 0.696 ACK seq=8 nump=2 2.160 ACK seq=9 nump=1 2.168 ACK seq=10 nump=7 0.040 DATA seq=10 len=1024 0.032 DATA seq=11 len=1024 2.080 DATA seq=12 len=1024 0.704 ACK seq=11 nump=6 1.432 DATA seq=13 len=1024 0.688 ACK seq=12 nump=5 1.464 DATA seq=14 len=1024 0.688 ACK seq=13 nump=4 1.464 DATA seq=15 len=1024 0.696 ACK seq=14 nump=3 1.464 DATA seq=16 len=1024 0.688 ACK seq=15 nump=2 2.152 ACK seq=16 nump=1 2.208 ACK seq=17 nump=7 0.040 DATA seq=17 len=1024 0.032 DATA seq=18 len=1024 2.080 DATA seq=19 len=1024 0.712 ACK seq=18 nump=6 1.416 DATA seq=20 len=1024 0.704 ACK seq=19 nump=5 1.448 DATA seq=21 len=1024 0.696 ACK seq=20 nump=4 1.464 DATA seq=22 len=1024 0.696 ACK seq=21 nump=3 1.464 DATA seq=23 len=1024 0.696 ACK seq=22 nump=2 2.160 ACK seq=23 nump=1 2.168 ACK seq=24 nump=6 0.040 DATA seq=24 len=1024 0.032 DATA seq=25 len=1024 2.080 DATA seq=26 len=1024 0.704 ACK seq=25 nump=5 1.432 DATA seq=27 len=1024 0.688 ACK seq=26 nump=4 1.464 DATA seq=28 len=1024 0.688 ACK seq=27 nump=3 1.464 DATA seq=29 len=1024 0.696 ACK seq=28 nump=2
The sequence doesn’t repeat itself, so the short sequence above doesn’t show all that went on. It’s not so clear what this host controller is up to. NumP is decremented, sometimes down to 1, sometimes down to 0, and then returns to a seemingly random number (8 after nump=0, but quite random after nump=1). It seems to be a combination of attempting to make bursts of 8 DATA packets (the maximal burst length announced by the device) and a bandwidth limitation between the USB controller and the host (PCIe Gen2 x 1, which doesn’t leave a lot of spare bandwidth compared with the USB 3.0 link).
The measured bandwidth was 401 MB/s, which seems to confirm that the bottleneck is the PCIe link.
Bonus: Trying on ASMedia ASM1142 (1b21:1242)
The ASM1142 USB 3.1 controller was also tested in the large TD buffer scenario described above. the board ran at its maximum of Speed 5GT/s, Width x2 (despite having a 4x finger). I should mention that there were some occasional problems with detecting (and enumerating) the FPGA device — sometimes it took a few seconds, and sometimes it failed completely. A USB 3.0 hub I have didn’t enumerate at all. This is possible due to USB 3.1 support of this chipset, which maybe causes some confusion during the initial signaling (which is more sensitive to timing inaccuracies).
However there were no detection problems when the device was connected before the computers power-up (and the USB 3.0 hub also enumerated with no problem).
This is the typical sequence.
ACK seq=24 nump=2 0.040 DATA seq=24 len=1024 0.032 DATA seq=25 len=1024 2.808 ACK seq=25 nump=2 0.040 DATA seq=26 len=1024 2.088 ACK seq=26 nump=2 0.040 DATA seq=27 len=1024 2.136 ACK seq=27 nump=2 0.040 DATA seq=28 len=1024 2.112 ACK seq=28 nump=2 0.040 DATA seq=29 len=1024 2.112 ACK seq=29 nump=2 0.040 DATA seq=30 len=1024 2.112 ACK seq=30 nump=2 0.040 DATA seq=31 len=1024 2.112 ACK seq=31 nump=2 0.040 DATA seq=0 len=1024 2.128 ACK seq=0 nump=1 2.144 ACK seq=1 nump=0 0.064 ACK seq=1 nump=2 0.040 DATA seq=1 len=1024 0.032 DATA seq=2 len=1024 2.824 ACK seq=2 nump=2 0.040 DATA seq=3 len=1024 2.080 ACK seq=3 nump=2 0.040 DATA seq=4 len=1024 2.120 ACK seq=4 nump=2 0.040 DATA seq=5 len=1024 2.112 ACK seq=5 nump=2 0.040 DATA seq=6 len=1024 2.112 ACK seq=6 nump=2 0.040 DATA seq=7 len=1024 2.120 ACK seq=7 nump=2 0.040 DATA seq=8 len=1024 2.112 ACK seq=8 nump=2 0.040 DATA seq=9 len=1024 2.120 ACK seq=9 nump=1 2.160 ACK seq=10 nump=0 0.064 ACK seq=10 nump=2
The turnaround time from DATA to ACK is ~2.85 μs, as with the other controllers. No surprise here. But the NumP is set to 2 (despite the device reporting a Max Burst length of 8). As a result, the device stops and waits for the ACK to arrive after two packets. Note that with the two other controllers, the first ACK packet arrives before the device has consumed the initial NumP allocation, so the burst isn’t stopped.
The impact isn’t significant when the transfer is long, since the ACKs come at the rate of DATA packet transmissions either way in the long run.
The less favorable issue is that the bursts are restarted after 9 DATA packets, with NumP going to zero, and then back to 2. There is no obvious explanation to this (what could be 9 kB long?), as the transfers are significantly longer. The data flow gap is easy to calculate, because both the DATA packet before and after the gap were transmitted due to an ACK that came immediately before them. Hence the gap itself is the time between these two DATA packets minus the time the DATA is transmitted, i.e. 4.376 – 2.112 = 2.264 μs.
Calculating the efficiency based upon this issue alone is (9 * 2.112) / ((9 * 2.112) + 2.264) ≈ 89.4% or ~446 MB/s, which is pretty close to the measured result, 454 MB/s.
Bonus II: BULK OUT
Now to some similar tests in the opposite direction: Host to device, i.e. BULK OUT. These tests were made with 16 TDs queued, 64 kB each.
The Intel controller first: Measured speed was 461 MB/s.
The DATA events shown below are when the DPH packet (before the payload) has arrived completely to the device, and the ACK events are when the ACK packet is queued for transmission. The 2.096 μs is the DATA payload’s time on wire (slight fluctuations because of spread spectrum on the data clock).
Note that the DATA packets are sent with larger delays occasionally (in bold), which is probably the reason for the suboptimal bandwidth.
DATA seq=0 len=1024 2.096 ACK seq=1 nump=4 0.040 DATA seq=1 len=1024 2.096 ACK seq=2 nump=4 0.064 DATA seq=2 len=1024 2.096 ACK seq=3 nump=4 0.072 DATA seq=3 len=1024 2.104 ACK seq=4 nump=4 0.064 DATA seq=4 len=1024 2.096 ACK seq=5 nump=4 0.072 DATA seq=5 len=1024 2.104 ACK seq=6 nump=4 0.064 DATA seq=6 len=1024 2.096 ACK seq=7 nump=4 7.640 DATA seq=7 len=1024 2.096 ACK seq=8 nump=4 0.040 DATA seq=8 len=1024 2.088 ACK seq=9 nump=4 0.072 DATA seq=9 len=1024 2.096 ACK seq=10 nump=4 0.064 DATA seq=10 len=1024 2.104 ACK seq=11 nump=4 0.064 DATA seq=11 len=1024 2.088 ACK seq=12 nump=4 0.072 DATA seq=12 len=1024 2.104 ACK seq=13 nump=4 0.064 DATA seq=13 len=1024 2.088 ACK seq=14 nump=4 8.160 DATA seq=14 len=1024 2.096 ACK seq=15 nump=4 0.040 DATA seq=15 len=1024
Here’s an interesting thing. When off a program on the host that sends data to the BULK endpoint, packets are sent even if the device is in a flow control condition due to a previous NRDY:
DATA seq=7 len=1024 2.088 bulkout_nrdy 0.040 DATA seq=8 len=1024 2.160 DATA seq=9 len=1024 2.144 DATA seq=10 len=1024 2.144 DATA seq=11 len=1024
So the controller shoots out five packets for a starter. The NRDY surely arrives to the controller while it transmits the second packet, but who cares. It’s not like it has better use for the bandwidth at that moment.
In the experiment above, the Max Burst Length was set to 8. When reducing it to 3, it’s four packets instead of five (four, not three).
And then, when device becomes ready to receive data
ERDY nump=1 507.232 DATA seq=7 len=1024 2.104 ACK seq=8 nump=4 2.736 DATA seq=8 len=1024 2.096 ACK seq=9 nump=4 0.048 DATA seq=9 len=1024 2.104 ACK seq=10 nump=4 0.064 DATA seq=10 len=1024 2.096 ACK seq=11 nump=4 0.072 DATA seq=11 len=1024 2.104 ACK seq=12 nump=4 0.064 DATA seq=12 len=1024 2.096 ACK seq=13 nump=4 0.072 DATA seq=13 len=1024 2.104 ACK seq=14 nump=4 0.064 DATA seq=14 len=1024 2.088 ACK seq=15 nump=4
And now the same story with Renesas’ controller. Measured speed is 360 MB/s.
Starts more or less the same, but note the slightly larger delays on the DATA packets. Note the larger numbers next to the DATA packets: These are extra delays between the packets. There are gaps of idle on wire.
DATA seq=0 len=1024 2.088 ACK seq=1 nump=4 0.424 DATA seq=1 len=1024 2.088 ACK seq=2 nump=4 0.416 DATA seq=2 len=1024 2.104 ACK seq=3 nump=4 0.416 DATA seq=3 len=1024 2.096 ACK seq=4 nump=4 0.416 DATA seq=4 len=1024 2.096 ACK seq=5 nump=4 0.768 DATA seq=5 len=1024 2.096 ACK seq=6 nump=4 0.080 DATA seq=6 len=1024 2.104 ACK seq=7 nump=4 0.376 DATA seq=7 len=1024 2.104 ACK seq=8 nump=4 0.400 DATA seq=8 len=1024 2.096 ACK seq=9 nump=4 0.776 DATA seq=9 len=1024
This becomes occasionally worse (taken from a later segment of the same flow):
0.760 DATA seq=25 len=1024 2.104 ACK seq=26 nump=4 11.736 DATA seq=26 len=1024 2.096 ACK seq=27 nump=4 0.416 DATA seq=27 len=1024 2.088 ACK seq=28 nump=4 0.432 DATA seq=28 len=1024 2.096 ACK seq=29 nump=4 0.768 DATA seq=29 len=1024 2.088 ACK seq=30 nump=4 0.120 DATA seq=30 len=1024 2.096 ACK seq=31 nump=4 0.344 DATA seq=31 len=1024 2.104 ACK seq=0 nump=4 20.232 DATA seq=0 len=1024 2.088 ACK seq=1 nump=4 0.424 DATA seq=1 len=1024
As for starting off when the device is already in a flow condition, it’s the same is Intel, just two packets instead of five:
DATA seq=30 len=1024 2.104 bulkout_nrdy 0.416 DATA seq=31 len=1024
Could it be that the controller stopped sending packets because of the NRDY? Or does it generally send just two packets?
Anyhow, when the device is ready:
ERDY nump=1 9.384 DATA seq=30 len=1024 2.096 ACK seq=31 nump=4 0.744 DATA seq=31 len=1024 2.096 ACK seq=0 nump=4 0.088 DATA seq=0 len=1024 2.096 ACK seq=1 nump=4
Conclusion
It’s quite clear that there’s a difference between the behavior performance of USB 3.0 controllers. While all controllers work perfectly within the requirements of the spec, the handling of TDs and the creation of bursts differs dramatically.
Large buffer TDs are definitely required for a good bandwidth utilization. This was shown above at the hardware level, but is also true with regards to the software’s ability to keep the TD queue populated.