USB 3.0 bandwidth efficiency: Looking at real-life DATA bursts

This post was written by eli on November 9, 2019
Posted Under: USB

Introduction

This post looks at the DATA and ACK packet exchange between a device and an xHCI USB 3.0 controller for the sake of explaining the actual, measured bandwidth that is observed on a BULK IN endpoint. A certain level of knowledge of the Superspeed packet protocol is assumed.

Superspeed data flow

For the sake of bandwidth efficiency, the USB 3.x spec allows (and encourages) bursts of DATA packets. This is implemented by virtue of the NumP field in the ACK packets that are sent by the receiver of DATA packets and in response to them.

The NumP field is a number saying how many packets the receiver is capable of accepting immediately after sending the ACK packet that carries it. This gives the sender of the DATA packets a go-ahead to send several packets in response to this ACK packet. In fact, an infinite flow of DATA packets is theoretically possible if the receiver keeps sending ACK packets with a sufficiently high NumP, and there’s enough data to send.

The rules that govern the data flow are rather complicated, and involve several factors. For example, due to the inherent delay of the physical bit stream, there’s a chance that when an ACK packet arrives, its NumP field is somewhat outdated because DATA packets have already been sent on the expense of a previous ACK’s NumP. The sender of DATA packets needs to compensate for these gaps.

USB remains a control freak

Even though USB 3.0 introduces a more relaxed framework for data transport (compared with USB 2.0), the concept that the host has full control over the data flow remains. In particular, any data transfer on the bus is a direct result of an outstanding request to the xHCI controller.

More precisely, any USB data transfer begins with the USB device driver setting up a data buffer and a transfer descriptor (TD) which is a data structure that contains the information on the requested data transfer. The device driver passes on this request to the USB controller (xHCI) driver, which adds it to a queue that is directly accessible by the hardware USB controller (usually after some chewing and chopping, however this isn’t relevant here). The latter performs the necessary operations to fulfill the request, and eventually reports back to the xHCI driver when the request is completed (or failed). The USB device driver is notified, and takes relevant action. For example, consuming the data that arrived from an IN endpoint.

The exchange of TDs and data between the software and hardware is asynchronous. The xHCI controller allows queuing several TDs for each endpoint, and activity on the bus on behalf of each endpoint takes place only in the presence of TDs on its queue. If there are no TDs queued for a specific endpoint, no data transfer occurs on its behalf, whether the device is ready or not.

And this is the important conclusion: For a high-bandwidth application, the software should ensure that a number of TDs are queued for the endpoint all the time. Failing to do so slows down the data flow due to momentary data flow halts while no TDs are queued.

Actual behavior

These are some anecdotal tests on a an Intel B150 chipset’s USB 3.0 xHCI controller (8086:a12f) and a Renesas Technology Corp. uPD720202 (1912:0015). These were fed with at least four TDs (BULK IN, 4096 bytes each) to handle the data flow that was monitored before the device became ready with its data, so the hardware’s optimal behavior is observed.

This is a typical sequence for the Intel USB controller:

     513.048 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3
       1.400 DATA seq=3 len=1024
       0.720 ACK  seq=2 nump=2
       2.160 ACK  seq=3 nump=1
       2.144 ACK  seq=4 nump=0
       2.008 ACK  seq=4 nump=4
       0.040 DATA seq=4 len=1024
       0.032 DATA seq=5 len=1024
       2.080 DATA seq=6 len=1024
       0.736 ACK  seq=5 nump=3
       1.384 DATA seq=7 len=1024
       0.736 ACK  seq=6 nump=2
       2.160 ACK  seq=7 nump=1
       2.144 ACK  seq=8 nump=0
       1.736 ACK  seq=8 nump=4
       0.040 DATA seq=8 len=1024

DATA packets are sent by device, and ACK by host. The numbers at the beginning of each line are the time difference with the previous line, in microseconds, measured inside the device’s logic. The timing for DATA is of the internal request for a packet (in the device’s logic), not the actual transmission, and the internal queue for such requests is two entries deep — that’s why two DATA packets are fired off right after the ACK packet’s arrival.

A DATA packet with 1024 bytes’ payload consists of a DPH (4 bytes start pattern + 16 bytes) and a DPP (4 bytes start pattern + 1024 bytes payload + 4 bytes CRC + 4 bytes end pattern), all in all 1056 bytes, which take 2.112 μs on wire. The theoretical efficiency limit is hence 1024/1056 ≈ 97%, or ~485 MB/s.

From the log above, it’s evident that there’s a turnaround time of ~2.85 μs from DATA to ACK, which is just ~ 0.74 μs beyond the time it takes to transmit the packet.

Note that the host separates the bursts for each TD: The NumP starts at 4 and counts down to 0 in the ACKs packets, so exactly 4096 bytes (the size of the TD) are transmitted in a burst. The following ACK packet, which starts off a new burst with a nump=4 is transmitted only 2 μs later, indicating that it took the USB controller some time to figure out that it has more to do with the same endpoint. In theory, it could have looked ahead for the next TD and realized that there’s enough TDs to continue the burst practically forever, but this optimization isn’t implemented.

It’s interesting to calculate the time no DATA was transmitted due to the burst stop and restart. The size of the gap isn’t easily calculated, as the times on the DATA packets are when they’re queued. To work around this, one can assume that the last byte of the 4th packet was sent 0.74 μs before the first ACK on its behalf was timed. The gap is hence 0.74 + 2.008 = 2.748 μs (the latter is the difference between the two ACKs for seq=4, the first concluding the burst, and the second starting a new one).

The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 2.748) ≈ 75.4% or ~377 MB/s. The actual speed measurement was 358 MB/s. The difference is most likely attributed to momentary shortages of TDs that are observed as occasional longer gaps (seen only in extensive traffic traces).

The same test with the Renesas USB host controller:

       4.808 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.712 ACK  seq=1 nump=3
       1.416 DATA seq=3 len=1024
       0.704 ACK  seq=2 nump=2
       2.152 ACK  seq=3 nump=1
       2.144 ACK  seq=4 nump=0
       5.488 ACK  seq=4 nump=4
       0.040 DATA seq=4 len=1024
       0.032 DATA seq=5 len=1024
       2.080 DATA seq=6 len=1024
       0.704 ACK  seq=5 nump=3
       1.448 DATA seq=7 len=1024
       0.712 ACK  seq=6 nump=2
       2.144 ACK  seq=7 nump=1
       2.152 ACK  seq=8 nump=0
       5.552 ACK  seq=8 nump=4

The turnaround for an ACK is similarly ~2.82μs from DATA to ACK, which is ~ 0.71 μs beyond the time it takes to transmit the packet. Almost the same as the previous result.

However the time between the two ACKs that make the gap in the data flow is 0.71 + 5.488 = 6.20 μs, significantly worse than the Intel chipset.

The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 5.488) ≈60.6% or ~303 MB/s. The actual speed measurement was 262 MB/s.

Larger TD buffers

Since the bandwidth efficiency took a hit because of the short bursts, the next step is to assign larger buffers to each TD, hoping that the USB controller will produce longer bursts. Hence the software prepared 31 TDs (the maximum that the Linux controller was ready to accept in advance), each with 512 kiB of data.

The Intel controller’s response:

     501.944 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3
       1.400 DATA seq=3 len=1024
       0.712 ACK  seq=2 nump=3
       1.432 DATA seq=4 len=1024
       0.728 ACK  seq=3 nump=3
       1.432 DATA seq=5 len=1024
       0.728 ACK  seq=4 nump=3
       1.432 DATA seq=6 len=1024
       0.736 ACK  seq=5 nump=3
       1.416 DATA seq=7 len=1024
       0.736 ACK  seq=6 nump=3
       1.416 DATA seq=8 len=1024
       0.736 ACK  seq=7 nump=3
       1.432 DATA seq=9 len=1024
       0.720 ACK  seq=8 nump=3
       1.432 DATA seq=10 len=1024
       0.728 ACK  seq=9 nump=3
       1.432 DATA seq=11 len=1024
[ ... ]

this goes on like a machine until the TD is filled:

       0.736 ACK  seq=29 nump=3
       1.432 DATA seq=31 len=1024
       0.728 ACK  seq=30 nump=2
       2.152 ACK  seq=31 nump=1
       2.160 ACK  seq=0 nump=0
       1.736 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3

Even though the device announces a maximal burst length of 8, and each TD can take much more than that, Intel’s controller chooses to limit itself to a NumP of 4. Since the DATA to ACK bus turnaround is ~2.85 μs (see above), which is less than the time for transmitting two DATA packets of maximal length, this limitation has no performance impact. Note that the DATA packets are queued a significant time after the arrival of ACK packets, indicating that the device wasn’t waiting for them. This is quite expected, as there are two DATA packets in flight all the time, and ACKs arriving with nump=3, so all in all there’s always another DATA packet allowed. Intel got their timing correct here.

The measured bandwidth on this run was 471 MB/s.

Renesas host controller doesn’t behave as elegant however:

       5.808 ACK  seq=2 nump=8
       0.040 DATA seq=2 len=1024
       0.032 DATA seq=3 len=1024
       2.080 DATA seq=4 len=1024
       0.704 ACK  seq=3 nump=7
       1.416 DATA seq=5 len=1024
       0.704 ACK  seq=4 nump=6
       1.464 DATA seq=6 len=1024
       0.688 ACK  seq=5 nump=5
       1.464 DATA seq=7 len=1024
       0.688 ACK  seq=6 nump=4
       1.464 DATA seq=8 len=1024
       0.696 ACK  seq=7 nump=3
       1.464 DATA seq=9 len=1024
       0.696 ACK  seq=8 nump=2
       2.160 ACK  seq=9 nump=1
       2.168 ACK  seq=10 nump=7
       0.040 DATA seq=10 len=1024
       0.032 DATA seq=11 len=1024
       2.080 DATA seq=12 len=1024
       0.704 ACK  seq=11 nump=6
       1.432 DATA seq=13 len=1024
       0.688 ACK  seq=12 nump=5
       1.464 DATA seq=14 len=1024
       0.688 ACK  seq=13 nump=4
       1.464 DATA seq=15 len=1024
       0.696 ACK  seq=14 nump=3
       1.464 DATA seq=16 len=1024
       0.688 ACK  seq=15 nump=2
       2.152 ACK  seq=16 nump=1
       2.208 ACK  seq=17 nump=7
       0.040 DATA seq=17 len=1024
       0.032 DATA seq=18 len=1024
       2.080 DATA seq=19 len=1024
       0.712 ACK  seq=18 nump=6
       1.416 DATA seq=20 len=1024
       0.704 ACK  seq=19 nump=5
       1.448 DATA seq=21 len=1024
       0.696 ACK  seq=20 nump=4
       1.464 DATA seq=22 len=1024
       0.696 ACK  seq=21 nump=3
       1.464 DATA seq=23 len=1024
       0.696 ACK  seq=22 nump=2
       2.160 ACK  seq=23 nump=1
       2.168 ACK  seq=24 nump=6
       0.040 DATA seq=24 len=1024
       0.032 DATA seq=25 len=1024
       2.080 DATA seq=26 len=1024
       0.704 ACK  seq=25 nump=5
       1.432 DATA seq=27 len=1024
       0.688 ACK  seq=26 nump=4
       1.464 DATA seq=28 len=1024
       0.688 ACK  seq=27 nump=3
       1.464 DATA seq=29 len=1024
       0.696 ACK  seq=28 nump=2

The sequence doesn’t repeat itself, so the short sequence above doesn’t show all that went on. It’s not so clear what this host controller is up to. NumP is decremented, sometimes down to 1, sometimes down to 0, and then returns to a seemingly random number (8 after nump=0, but quite random after nump=1). It seems to be a combination of attempting to make bursts of 8 DATA packets (the maximal burst length announced by the device) and a bandwidth limitation between the USB controller and the host (PCIe Gen2 x 1, which doesn’t leave a lot of spare bandwidth compared with the USB 3.0 link).

The measured bandwidth was 401 MB/s, which seems to confirm that the bottleneck is the PCIe link.

Not directly related, on termination of the test with the Renesas controller, the following line appeared in the kernel log: (Linux v5.3.0):

xhci_hcd 0000:03:00.0: WARN Successful completion on short TX for slot 1 ep 2: needs XHCI_TRUST_TX_LENGTH quirk?

Conclusion

It’s quite clear that there’s a difference between the behavior performance of USB 3.0 controllers. While both controllers work perfectly within the requirements of the spec, the handling of TDs and the creation of bursts differs dramatically.

Large buffer TDs are definitely required for a good bandwidth utilization. This was shown above at the hardware level, but is also true with regards to the software’s ability to keep the TD queue populated.

Add a Comment

required, use real name
required, will not be published
optional, your blog address