xhci_hcd WARN Event TRB for slot x ep y with no TDs queued

This post was written by eli on November 9, 2019
Posted Under: Linux kernel,USB

What’s this?

There’s a chance that you’re reading this because the message in the title appeared (or flooded) your kernel log. This post attempts to clarify what to do about it, depending on how much you want to get involved in the matter.

So the short answer: The said warning message is a bug related to the xHCI USB controller, but a rather harmless one: Except for the message in the kernel log, everything is fine. This was fixed in Linux kernel v4.15, so upgrading the kernel is one way out. Alternatively, patch your kernel with commit e4ec40ec4b from December 2017, which is the fix made on v4.15. Even editing away the said xhci_warn() call is fairly sensible, if the patch doesn’t apply.

Note that the xHCI controller is used for any port that is USB 3.x capable, even when a lower version USB device is connected to it (e.g. USB 2.0). So if your computer has a few USB 2.0-only ports, moving a device to such port might be your quick fix. Look for USB plugs with black plastic instead of blue.

The rest of this post dives deeply into the whereabouts of this accident. It matters if you’re developing a USB device. Note that this post is written in increasing detail level, which makes it a bit disorganized. Apologies if it’s confusing.

Why this?

The explanation is a bit tangled, and is related to the way the xHCI driver organizes the buffers for a USB transfer.

To make a long story short(er), a software transfer request (URB in Linux) is conveyed to the hardware controller as a TD (Transfer Descriptor) which is presented to the hardware in the form of one or more Transfer Request Blocks (TRBs). How many? There’s no advantage in chopping the TD into more than one TRB for a transfer request, however there are certain constraints to an TRB. Among others, each TRB points at a chunk of the buffer in continuous physical memory. The xHCI specification also requires in its section 6.4.1 that the data buffers shall not cross 64 kiB boundaries in physical memory.

So if the transfer request’s data buffer does cross 64 kiB boundaries, the TD is split into several TRBs, each pointing at a another part of the buffer supplied in the software’s transfer request. For example, see xhci_queue_bulk_tx() in the Linux kernel’s drivers/usb/host/xhci-ring.c. This is fine, and should cause no problems.

But as it turns out, if a BULK IN transfer request is terminated by the device with a short packet, and the TD consists of more than one TRB, the xHCI hardware produces more than one notification event on this matter. This is also fine and legal, however it causes the kernel (before the patch) to issue the said warning. As already said, the warning is a bug, but otherwise there’s no problem.

So for this bug to show up, there needs to be a combination of a pre-4.15 kernel, short packets and transfer requests that have data buffers that span 64 kiB boundaries.

Huh? Short packets?

A “short packet” is a USB packet that is shorter than the maximal size that is allowed for the relevant USB version (that is, 512 bytes for USB 2.0 and 1024 bytes on USB 3.x).

The idea is that the software can request a transfer of a data chunk of any size (with some maximum, but surely several kilobytes). On a BULK IN endpoint, the device is in principle expected to send that amount of data, divided into packets, which all have the maximal size, except for the last one.

However it’s perfectly legal and fine that the device sends less than requested by the software in the transfer request. In that case, the device sends a packet that is shorter than the maximal packet size, and by doing so, it marks the end of the transfer. It can even be a packet with zero bytes, if the previous packet was of maximal length. This is called a short packet, and once again, it’s not reason for any alarm. It just means that the software must check how much data it actually got when the transfer is completed.

Since the hardware and its driver often have some dedicated protocol to coordinate their data exchange, this short packet mechanism is often unnecessary and not used.

Avoiding these warnings

Linux kernel v4.15 was released in January 2018, so when writing this it’s quite expected that older kernels are still ubiquitous.

If short packets are expected from the device, and pre-patch kernels are expected to be in use, it’s wise to make sure that the buffers don’t cross the said boundaries. Keeping them small, surely below 4 kiB is a good start, as larger buffers surely span more than one page. Also, different pages can be anywhere in physical memory, causing the need to divide the buffer into several TRBs. And then there’s the 64 kiB boundaries issue. However this isn’t practical in a high-bandwidth application, as a 4 kB buffer is exhausted in 10 μs at a rate of 400 MB/s. The software has no chance to continuously generate TDs fast enough to keep up with their completion, which will result in data transmission stops, waiting for TDs to be queued.

When writing a device driver in the kernel, it’s relatively easy to control the position of the buffer in physical memory. In particular, the kernel’s __get_free_pages() function allows this. This is however not the common practice, in particular as the buffers are typically much smaller than a page, so using __get_free_pages() would have seemed to be a waste of memory. So existing drivers are subject to this problem, and there’s not much to do about it (except for applying the patch, as suggested above).

When libusb is used to access the device (through usbfs), there is a solution, assuming that the kernel is v4.6 (released May 2016) and later + libusb version 1.0.21 and later. The trick is to use the libusb_dev_mem_alloc() function to allocate memory (i.e. the zerocopy feature), which implements a shared memory mapping with the usbfs driver, so that the data buffer that is accessed from user space is directly accessed by the xHCI hardware controller. It also speeds up things slightly by avoiding memory allocation and copying of buffers on x86 platforms, which ensure cache coherency on DMA transfers. Not sure on what happens on ARM platforms, for which cache coherency needs to be maintained explicitly.

Note that physical memory continuity is only ensured within 4 kiB pages, as the mmap() call doesn’t ensure physical memory continuity. Since virtual to physical memory translation never touches the lower 12 bits of the address, staying within 4 kiB alignment in virtual alignment ensures no 4 kiB boundaries are crossed — but this doesn’t help regarding 64 kiB boundaries.

Without libusb_dev_mem_alloc(), the usbfs framework allocates a buffer with kmalloc() for each transfer, and eventually copies the data from the DMA buffer into it. No control whatsoever on how the buffer is aligned. The position of the buffer that is supplied by the user-space software makes no difference.

Zero-copy buffers: Gory details (too much information)

Introduced in Linux kernel v4.6 (commit f7d34b445, February 2016), mmap() on the usbfs file descriptor allows the memory buffer provided by the user-space program to be used directly by the hardware xHCI controller. The patch refers to this as “zerocopy”, as there is no copying of the data, however the more important part is that the buffers don’t need to be allocated and freed each time.

The idea is quite simple: The user-space software allocates a memory buffer by calling mmap() on the file descriptor of the relevant USB device. When a URB is submitted with the data buffer pointer directed to inside the mmap’ed region, the usbfs driver detects this fact, and skips memory allocation and copying of data. Instead, it uses the memory buffer directly.

This is implemented in proc_submiturb() (drivers/usb/core/devio.c) by first checking if the buffer is in an mmap’ed segment:

as->usbm = find_memory_area(ps, uurb);

“uurb” in the context of this code is the user-space copy of the URB. find_memory_area scans the list of mmap’ed regions for one that contains the address uurb->buffer (and checks that uurb->buffer_length doesn’t exceed the region). It returns NULL if no such buffer was found (which is the common case, when mmap() isn’t used at all). Moving on, we have

if (as->usbm) {
  unsigned long uurb_start = (unsigned long)uurb->buffer;

  as->urb->transfer_buffer = as->usbm->mem + (uurb_start - as->usbm->vm_start);
 } else {
  as->urb->transfer_buffer = kmalloc(uurb->buffer_length, GFP_KERNEL);
  if (!as->urb->transfer_buffer) {
    ret = -ENOMEM;
    goto error;
  }
  if (!is_in) {
    if (copy_from_user(as->urb->transfer_buffer,
		       uurb->buffer,
		       uurb->buffer_length)) {
      ret = -EFAULT;
      goto error;
    }

So if a memory mapped buffer exists, the buffer that is used against hardware is calculated from the relative position in the mmap’ed buffer.

Otherwise, it’s kmalloc’ed, and in the case of an OUT transaction, there’s a copy_from_user() call following to populate the buffer with data.

So to use this feature, the user-space software should mmap() a memory segment that is large enough to contain all data buffers, and then manage this segment, so that each transfer URB has its own chunk of memory in this segment.

In order to do this from libusb, there’s the libusb_dev_mem_alloc() API function call, defined in core.c. It calls usbi_backend.dev_mem_alloc(), which for Linux is op_dev_mem_alloc() in os/linux_usbfs.c. It’s short and concise, so here it is:

static unsigned char *op_dev_mem_alloc(struct libusb_device_handle *handle,
	size_t len)
{
	struct linux_device_handle_priv *hpriv = _device_handle_priv(handle);
	unsigned char *buffer = (unsigned char *)mmap(NULL, len,
		PROT_READ | PROT_WRITE, MAP_SHARED, hpriv->fd, 0);
	if (buffer == MAP_FAILED) {
		usbi_err(HANDLE_CTX(handle), "alloc dev mem failed errno %d",
			errno);
		return NULL;
	}
	return buffer;
}

This capability was added to libusb in its version 1.0.21 (as the docs say) with git commit a283c3b5a, also in February 2016, and by coincidence, by Steinar H. Gunderson, who also submitted the Linux kernel patch. Conspiracy at its best.

libusb: From API call to ioctl()

This is some libusb sources dissection notes. Not clear why this should interest anyone.

From the libusb sources: A simple submitted bulk transfer (libusb_bulk_transfer(), defined in sync.c) calls do_sync_bulk_transfer(). The latter function wraps an async transfer, and then calls the API’s function for that purpose, libusb_submit_transfer() (defined in io.c), which in turn calls usbi_backend.submit_transfer(). For Linux, this data structure is populated in os/linux_usbfs.c, with a pointer to the function op_submit_transfer(), which calls submit_bulk_transfer().

submit_bulk_transfer() ends up making a ioctl() call to submit one or more URBs to Linux’ usbfs interface. The code used for this ioctl is IOCTL_USBFS_SUBMITURB, which is a libusb-specific defined as

#define IOCTL_USBFS_SUBMITURB	_IOR('U', 10, struct usbfs_urb)

in linux_usbfs.h, which matches

#define USBDEVFS_SUBMITURB         _IOR('U', 10, struct usbdevfs_urb)

in the Linux kernel source’s include/uapi/linux/usbdevice_fs.h.

The ioctl() call ends up with proc_submiturb() defined in drivers/usb/core/devio.c, which moves the call on to proc_do_submiturb() in the same file. The latter function splits the transfer into scatter-gather buffers if it’s larger than 16 kiB (where did that number come from?).

And then we have the issue with memory mapping, as mentioned above.

Extra kernel gory details: URB handling

This is the traversal of a URB that is submitted via usbfs in the kernel, details and error handling dropped for the sake of the larger picture.

proc_do_submiturb() sets up a usbfs-specific struct async container for housekeeping data and the URB (as a struct urb), puts it on a locally accessible list and then submits the URB into the USB framework with a

ret = usb_submit_urb(as->urb, GFP_KERNEL);

This finishes the usbfs-specific handling of the URB submission request. usb_submit_urb() (defined in drivers/usb/core/urb.c) makes a lot of sanity checks, and eventually calls usb_hcd_submit_urb() (defined in drivers/usb/core/hcd.c). Unless the URB is directed to the root hub itself, the function registered as hcd->driver->urb_enqueue is called with the URB. The hcd is obtained with

hcd = bus_to_hcd(udev->bus);

which merely fetches the usb_hcd structure from the usb_bus structure.

Anyhow for xHCI the list of methods is mapped in drivers/usb/host/xhci.c, where urb_enqueue is assigned with xhci_urb_enqueue() (no surprises here). This function allocates a private data structure with kzalloc() and assigns urb->hcpriv with a pointer to it, and then calls xhci_queue_bulk_tx() (defined in drivers/usb/host/xhci-ring.c) for a bulk transfer URB.

xhci_queue_bulk_tx() calls queue_trb() which actually puts 16 bytes of TRB data into the Transfer Ring (per section 6.4.1 in the xHCI spec), and calls inc_enq() to move the enqueue pointer.

Once the hardware xHCI controller finishes handling the TRB (for better or worse), it queues an entry in the event ring, and issues an interrupt, which is handled by xhci_irq() (also in xhci-ring.c). After several sanity checks, this ISR calls xhci_handle_event() until all events have been handled, and then does the housekeeping tasks for confirming the events with the hardware.

The interesting part in xhci_handle_event() is that it calls handle_tx_event() if the event was a finished transmission URB. This happens to be the function that emits the warning in the title under some conditions. After a lot of complicated stuff, it calls process_bulk_intr_td() for a BULK endpoint TD. Which in turn calls finish_td(), which returns the TD back to the USB subsystem: It calls xhci_td_cleanup(), which checks if all TDs of the relevant URB have been finished. If so, xhci_giveback_urb_in_irq() is called, which in turn calls usb_hcd_giveback_urb() (defined in drivers/usb/core/hcd.c). That function launches a tasklet that completes the URB.

Add a Comment

required, use real name
required, will not be published
optional, your blog address