Recovering from a BULK IN overflow on USB 3.0

This post was written by eli on December 7, 2019
Posted Under: USB

Introduction

At times, an attempt to get data from a BULK IN endpoint may result in an overflow error. In other words,

rc = libusb_bulk_transfer(dev_handle, (1 | LIBUSB_ENDPOINT_IN),
                          buf, bufsize, &count, (unsigned int) 1000);

may fail with rc having the value of LIBUSB_ERROR_OVERFLOW. Subsequent attempts to access the same endpoint, even after re-initializing the libusb interface result in rc = LIBUSB_ERROR_IO, which is just “the I/O failed”. Terminating and rerunning the program doesn’t help. Apparently, the only thing that gets the endpoint out of this condition is physically plugging it out and back into the computer (or hub).

Why this happens

The term “transfer” in the “libusb_bulk_transfer” function name refers to a USB transfer cycle. In essence, calling this function ends up with a request to the xHCI hardware controller to receive a given number of bytes (“bufsize” above) from the endpoint, and to write it into a buffer (“buf” above). The host controller start requesting DATA packets from the device, and fills the buffer with the data in those packets. Note that the communication isn’t a continuum of data packets. Rather, it’s a session which fulfills a transfer request. Someone who wrote the USB spec probably thought of the data transmission in terms of a function call of the libusb_bulk_transfer sort: A function is called to request some data, communication takes place, the function returns. This principle holds for all USB versions.

Here’s the crux, and it’s related to the hardware protocol, namely with data packets: The host controller can give the device go-ahead to transmit packets, but it can’t control the number of bytes in each packet. This is completely up to the device. The rule is that if the number of bytes in a packet is less than the maximum allowed for the specific USB version (1024 bytes for USB 3.0), it’s considered a “short packet”. When a short packet arrives at the host controller, it should conclude that the transfer is done (including the data in this last packet).

So the device may very well supply less data than requested for the transfer. This is perfectly normal, and this is why it’s mandatory to check how many bytes have actually arrived when the function returns.

But what if the transfer request was for 10 bytes, and the device sent 20?

An overflow on USB 3.0

On USB 3.0, the xHCI controller requests data from the device indirectly by issuing an ACK Transaction Packet to the device. It may or may not acknowledge packets it has already acknowledged, but this isn’t the point at the moment. Right now, it’s more interesting that all ACK packets also carry the number of DATA packets that the host is ready to receive at the moment that the ACK packet was formed, in a field called NumP. This is how the host controls the flow of data.

When there’s no data flowing, it can be because the device has no data to send, but also if the last ACK for the relevant endpoint was a packet with NumP=0. To resume the flow, the host then issues an ACK packet, with a non-zero NumP, giving the go-ahead to transmit more data.

So a packet capture can look like this for a transfer request of 1500 bytes (the first column is time in microseconds):

       0.000 ACK  seq=0 nump=2
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.832 ACK  seq=1 nump=1
       2.104 ACK  seq=2 nump=0

Note that the device sent 1024 bytes twice, exceeding the 1500 bytes requested. This causes an overflow error. So far so good. But what about all those LIBUSB_ERROR_IO afterwards?

So this is the time to pay attention to the “seq” numbers. All DATA packets carry this sequence number, which runs cyclically from 0 to 31. The ACK’s sequence number is the one that the host expects next. Always. For example, if the host requests a retransmission, it repeats the sequence number of the DATA packet it failed to receive properly, basically saying, “I expect this packet again” (and then ignores DATA packet that follow until the retransmission).

Now, this is what appears on the bus as a result of a libusb_bulk_transfer() call after the one that returned with an overflow condition:

21286605.320 ACK  seq=0 nump=2

This is not a mistake: The sequence number is zero. Note that the ACK that closed the previous sequence with a nump=0 had seq=2. Hence the ACK that follows it to re-initiate the data flow should have seq=2 as well. But it’s zero. In section 8.11.1, the USB 3.0 spec says that an ACK is considered invalid if it hasn’t an expected sequence number, and that it should be ignored in this case. So the device ignores this ACK and sends no DATA packet in response. The host times out on tDPResponse (400 ns per USB 3.0 spec) and reports a LIBUSB_ERROR_IO. So the forensic explanation is established. Now to the motive.

Handling a babble error

The resetting of the sequence number has been observed with more than one xHCI controller (an Intel chipset as well as Renesas’ uPD720202), so it’s not a bug.

This brings us to the xHCI spec, revision X, section 4.10.2.4: “When a device transmits more data on the USB than the host controller is expecting for a transaction, it is defined to be babbling. In general, this is called a Babble Error. When a device sends more data than the TD transfer size, … the host controller shall set the Babble Detected Error in the Completion Code field of the TRB, generate an Error Event, and halt the endpoint (refer to section 4.10.2.1).”

So the first thing to note is that the endpoint wasn’t halted after the overflow. In fact, there was no significant traffic at all. Quite interestingly, Linux’ host controller didn’t fulfill its duty in this respect.

But still, why was the sequence number reset? Section 8.12.1.2 in the USB 3.0 sheds some light: “The host expects the first DP to have a sequence number set to zero when it starts the first transfer from an endpoint after the endpoint has been initialized (via a Set Configuration, Set Interface, or a ClearFeature (ENDPOINT_HALT) command”.

So had the endpoint been halted, as it’s required per xHCI spec, it would just have returned STALL packets until it was taken out of this condition. At which point the sequence number should be reset to zero per USB spec.

So apparently, whoever designed the xHCI hardware controller assumed that no meaningful communication would take place after the overflow (babble) error, and that the endpoint must be halted anyhow, so reset the sequence number and have it done with. It’s easier doing it this way than detecting the SETUP sequence that clears the ENDPOINT HALT feature.

Given that the xHCI driver doesn’t halt the endpoint, it’s up to the application software to do it.

Halt and unhalt

This is a sample libusb-based C code for halting BULK IN endpoint 1.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x03, // bRequest = SET_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to halt endpoint");
	  break;
	}

	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x01, // bRequest = CLEAR_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to unhalt endpoint");
	  break;
	}

	continue;
      }

The second control transfer can be exchanged with

	rc = libusb_clear_halt(dev_handle, (1 | LIBUSB_ENDPOINT_IN));

however there is no API function for halting the endpoint, so one might as well do them both with control transfers.

Resetting the entire device

For those who like the big hammer, it’s possible to reset the device completely. This is one of the conditions for resetting the sequence numbers on all endpoints, so there’s no room for confusion.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_reset_device(dev_handle);

	if (rc) {
	  print_usberr(rc, "Failed to reset device");
	  break;
	}
	continue;
      }

This causes a Hot Reset. which is an invocation of Recovery with the Hot Reset bit set, and return to U0, which in itself typically takes ~150 μs. However as a result from this call, the is reconfigured — its descriptors are read and configuration commands are sent to it. It keeps its bus number, and the the entire process takes about 100 ms. All this rather extensive amount of actions is hidden in this simple function call.

Also, a line appears in the kernel log, e.g.:

usb 4-1: reset SuperSpeed USB device number 3 using xhci_hcd

So all in all, this is a noisy overkill, and is not recommended. It’s given here mainly because this is probably how some people eventually resolve this kind of problem.

Add a Comment

required, use real name
required, will not be published
optional, your blog address