usbpiper: A single-threaded /dev/cuse and libusb-based endpoint to device file translator

Introduction

Based upon CUSE, libusb and the kernel’s epoll capability, this is a single-threaded utility which generates one /dev/usbpiper_* device file for each bulk / interrupt endpoint on a USB device. For example, /dev/usbpiper_bulk_in_01 and /dev/usbpiper_bulk_out_03.

It’s an unfinished project, that was stopped before a lot of obvious tasks in the TODO list were done. This is why several parameters are hardcoded and some memory allocations aren’t freed. Plus several other implications listed below.

It’s available at Github: https://github.com/billauer/usbpiper

I eventually went for a good old kernel driver instead. This post explains why, and you probably want to read it if you have plans on this utility or want to use FUSE or CUSE otherwise. That post also explains why I went right on to /dev/cuse rather than using libfuse.

Nevertheless, the project may very well be useful for development of USB projects, as a boilerplate or a getting-started utility. It also shows how to implement epoll-based asynchronous USB transfers, as well as implementing a CUSE-based device file driver in userspace, implementing the protocol of /dev/cuse directly (i.e. without relying on libfuse). And all this as a single thread program.

But what was the utility meant to do in the first place?

The underlying idea is simple: With a single-threaded userspace program, create a plain character device for each BULK (or INTERRUPT) endpoint that is found on a selected USB device, and allow data to be sent to each OUT endpoint by opening a device file, and just write data to it. With “cat” for example. And the other way around, read data from each IN endpoint by reading data from another device file. This description is simplistic, however it may work quite well when working on a USB device project. Just be sure to read the details below on setting up usbpiper. Doing that pretty much covers the necessary gory details.

What usbpiper definitely isn’t: It’s NOT a user-space driver for XillyUSB (a generic FPGA IP core for SuperSpeed USB 3.0, based upon the FPGA’s Gigabit transceivers). XillyUSB requires a dedicated driver, which implements a specific protocol with the IP core.

Confusing usbpiper with XillyUSB’s driver is easy, because both share the idea of plain device files for I/O with a USB device. In fact, usbpiper started off as a user-space driver for XillyUSB, but never got to the point of covering XillyUSB’s protocol.

Another possible source of confusion is usbfs. It’s a USB filesystem, so what is there to add? So yes, usbfs is used by libusb to allow a low-level driver for a USB device to be written in user space (usbpiper uses this interface, of course). It doesn’t allow a simple access to the data.

It’s recommended to look on this post on the protocol with /dev/cuse before diving into this one.

What works and in what ways it’s unfinished

usbpiper is executed with no arguments. It takes control of the selected USB device’s interface (which one — see below) and creates a /dev/usbpiper_* device file for each bulk or endpoint endpoint that it finds. The file’s name reflects the endpoint’s number, direction and bulk vs. interrupt.

It has however only been tested on bulk endpoints. Interrupt endpoints may work, but has not been tested, and isochronous endpoints are ignored. Also, usbpiper doesn’t free memory properly, in particular not buffers and other memory consuming stuff that are related to libusb.

Several parameters would normally be set through command-line parameters, but they are hardcoded.

The verbosity level can be set by editing some defines in usbpiper.h. In particular, a lot of messages are reduced by replacing

#define DEBUG(...) { fprintf(stderr, __VA_ARGS__); }

with

#define DEBUG(...)

In usbpiper.c, max_size defines the largest number of bytes that can be handled in a CUSE READ or WRITE request.

In usb.c, the following parameters are hardcoded:

  • FIFOSIZE: The effective number of bytes in the FIFO between the CUSE and USB units. The actual FIFO size for OUT endpoints is larger by max_size, for reasons explained in the “Basic data flow principle” section below.
  • vendorID and prodID define the device to be targeted. Note that the find_device() function in usb.c explicitly finds the device from the list of devices on the bus, so it can be altered to select the device based upon other criteria.
  • int_idx and alt_idx are the Interface and Alternate Setting indexes for selection on the device. More on this issue below.
  • td_bufsize is the size of the buffer that goes which each transfer. Set to 64 kiB, which is probably an overkill for most devices, but reasonable for proper bandwidth utilization with SuperSpeed devices. Also see below why it should be large when working with just some device.
  • numtd: The maximal number of outstanding transfers for each endpoint. A large number is good for high-bandwidth applications (with SuperSpeed) since it gives the hardware controller several transfers in a row before software intervention is required. Make it too big, and libusb_submit_transfer() may fail (the controller got more than it could accept).

Features that were meant to be added, but I never got there:

  • Array size of epoll should be dynamic (number of held file descriptors). Currently it’s ARRAYSIZE in usbpiper.c.
  • A file was supposed to be bidirectional. Makes no sense in this usage scenario, and bidirectional was never tested.
  • Non-blocking OPEN not supported
  • Was intended to support USB hotplugging
  • Adaption to XillyUSB’s protocol

USB Transfers and why you should care about them

There is a good reason why there isn’t any pipe-like plain device file interface for any USB device by default: usbpiper overlooks several details in the communication of a USB device.

The most important issue is that USB communication is divided into transfers, and are generally not treated as a continuous stream of data. The underlying model in the USB spec is that the host software initiates a transfer of a given number of bytes (in or out), the USB framework carries it out against the device, and then informs the software that it has been finished. The USB spec’s authors seem to have thought that the mainline usage of the USB bus would be done with a functional call saying something like “send this packet of data to the device”. Or another function saying “receive X bytes from the device”, which returns with a buffer pointing to the data buffer.

The USB framework supports asynchronous transfers, of course, but that doesn’t change the notion that the host’s software explicitly requests each transfer with a given number of bytes. All communication is cut into packet-like chunks of data with clear, boundaries. The device is allowed to divert from the host’s transfer requests only in one way: On IN endpoints, it’s allowed to terminate a transfer with less bytes than the host expected, and this is not considered an error.

However generally speaking, any software that communicates with a device directly (i.e. a device driver) is expected to know when the device expects transfers and of what size. usbpiper ignores this completely. Therefore, it may very well not work properly with just any device. This is less of an issue if the device is developed along with using usbpiper.

The three points to note are hence:

  • usbpiper sets byte count of OUT transfers according to the momentary buffer fill, up to a certain limit (td_bufsize). If the device expects a certain number of bytes in the transfer (which is legit) or the transfers are longer than in can take — things will break, of course. A device may also be sensitive to transfer boundaries, which usbpiper pays no attention to. If the device expects a fixed length for all transfers, this issue can be worked around by modifying try_queue_bulkout() never send a partially filled transfer, and set the desired length instead of td_bufsize.
  • usbpiper sets td_bufsize as the length of IN transfers, however the host doesn’t inform the device on how long the transfer is expected to be. The device driver is supposed to know the maximal length of an IN transfer that the device will respond with, and prepare a buffer long enough. Otherwise, a babbling error results (libusb returns LIBUSB_ERROR_OVERFLOW). td_bufsize is set to 64 kiB which is unlikely to be exceeded by USB devices — but this isn’t guaranteed.
  • Another issue with IN endpoints is that the information on where the boundaries of the transfers is lost: usbpiper just copies the data into a FIFO, which is read continuously on the other side. If the protocol of an IN endpoint relies on the driver knowing where a transfer started, usbpiper won’t be useful. This can be the case if the transfers are packets with a header, but without a data length field. This makes sense against a driver that receives the transfers directly.

Interfaces and alternate settings

A USB device may present several interfaces, and each interface may have alternate settings. This isn’t a gory technical detail, but can be the difference between getting your device working with usbpiper or not, in particular if it’s not something you designed yourself.

Even though a device is assigned an address on the USB bus, any USB driver claims the control of an interface of that device. In other words, it’s perfectly normal that several, possibly independent drivers control a single physical device. A keyboard / mouse combo device or a sound card with MIDI and joystick interface (not so common today). Or like a scanner / printer, which also acts as a card reader:

$ usb-devices
T:  Bus=01 Lev=03 Prnt=44 Port=03 Cnt=01 Dev#= 45 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=03f0 ProdID=7a11 Rev=01.00
S:  Manufacturer=HP
S:  Product=Photosmart B109a-m
S:  SerialNumber=MY5687428T02D2
C:  #Ifs= 4 Cfg#= 1 Atr=c0 MxPwr=2mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=cc Prot=00 Driver=(none)
I:  If#= 1 Alt= 0 #EPs= 2 Cls=07(print) Sub=01 Prot=02 Driver=usblp
I:  If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#= 3 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage

Note that the device effectively behaves like two independent devices: A scanner / printer and a USB disk.

It’s therefore important to not just set the Vendor / Product IDs correctly, but also the interface. usb-devices and lsusb -vv may help making the correct selection.

Alternate setting is less common, but a single interface may have different usage modes. If present, this must be set correctly as well.

Basic data flow principle

The purpose of the utility is to move data from a USB endpoint to a CUSE device file or vice versa. To accomplish this, there is a plain RAM FIFO allocated for each such data stream.

For an IN endpoint, the USB frontend queues asynchronous transfer requests using libusb. For each IN transfer that is finished, the data is copied into the relevant FIFO. On the FIFO’s other side, the read() calls on the device file (i.e. CUSE READ requests) are fulfilled, as necessary, by submitting data that is fetched from the FIFO. Overflow of the FIFO is prevented by queuing IN transfer requests only when there’s enough room in the FIFO to accept the data that all outstanding requests may carry, if they all return with a full buffer. Underflow is not an issue, but the read() call isn’t completed if there is no data to submit, in which case read() blocks.

For an OUT endpoint, a the handler of a write() call (i.e. CUSE WRITE requests) copies the data into the relevant FIFO. As a result of the FIFO containing data, the USB frontend may queue new OUT transfers with the data available — it may also not do so, in particular if the number of already outstanding transfer stands at the maximal available. The FIFO is protected from overflow by blocking the write() call until there is enough room in the FIFO. The exact condition relates to the fact the length of the data buffer of each CUSE WRITE request is limited by a number (max_size in the code) that is set during CUSE initialization. A WRITE request is hence not completed (hence preventing another one) until there is room for max_size additional bytes in the FIFO, after writing the current request’s data to the FIFO. This ensures that the usbpiper process always has where to put the data, and doesn’t need to block — which it’s now allowed to, being a single-threaded utility.

The requirement of always having max_size bytes of data vacant in the FIFO gets slightly trickier when a WRITE request is interrupted (i.e. receives an INTERRUPT request on its behalf). This forces usbpiper to immediately complete the request. In order to ensure the requirement on the FIFO, usbpiper possibly unwinds the FIFO, throwing away data so that the FIFO’s write fill is at most max_size bytes below full. This doesn’t break the data stream’s integrity or continuity, because the write() call returns with the number of bytes actually written (or an -EINTR, if none). If the FIFO was unwound, the number of bytes that were discarded is reduced from write()’s return value, giving the caller of write() the correct picture of how much data was consumed.

Execution flow

Recall from above that usbpiper doesn’t rely on libfuse, but rather communicates with the CUSE framework directly through /dev/cuse.

As the utility’s single thread needs to divide attention between the USB tasks and those related to CUSE, a single epoll() file descriptor is allocated for all open /dev/cuse files as well as those supplied by the libusb framework. A epoll_wait() event loop is implemented in usbpiper.c: Each entry in the epoll_event array contains a pointer a small structure, which contains a function to call and a pointer to a private data pass it to the function.

The communication protocol with /dev/cuse is discussed on another post. For the purpose of the current topic, the CUSE kernel framework creates a device file in /dev/ as a result of each time /dev/cuse being opened and a simple read-write handshake completed. After this, for each operation on the related device file (e.g. open(), read(), write() etc) a request packet is passed to the server (i.e. usbpiper in this case) by virtue of read() calls to the /dev/cuse file handle. The operation blocks until the server responds by writing a buffer to the same file handle, which contains a status header and possibly data. Responses to requests are not necessarily written in the same order as the requests. A unique ID number in the said status header ensures the pairing between requests and their responses.

read() calls from /dev/cuse block when there’s nothing to do, and are therefore subject to epoll in usbpiper. write() calls never block.

However this is not enough: For example, an epoll entry may indicate a new WRITE request on a CUSE file descriptor, which fills one of the FIFOs with data. As a result, there might be a new opportunity to queue new USB transfers. There are many software design approaches for how to make one action trigger others — the one taken in usbpiper is the simplest and messiest: Letting the performer of the action call the functions that may benefit from the opportunity directly. In the given example, this means that process_write() calls try_queue_bulkout() directly. The latter calls try_complete_write() in turn.

The function nomenclature in this utility is consistent in that several functions have a try_*() prefix to mark that they are opportunity oriented. It would have been equally functional, cleaner and more elegant (however less efficient) to call all try_*() functions on behalf of all endpoints and device files. Or alternatively, maintain some queue of try_*() function calls, however this wouldn’t take away the need for awareness of which actions may open what opportunity.

Delays and timeouts

There are a couple of situations where a timer is required. A timerfd is allocated for each device file, serving the following two scenarios:

  • Related to IN endpoints: When a READ request can’t be completed with the full number of bytes that are required, usbpiper waits up to 10 ms for data from the IN endpoint to fill the relevant FIFO. After this timeout, try_complete_read() completes the request as soon as there is any data in the FIFO. The rationale is to avoid a flood of READ request and responses if the data arrives frequently and in small chunks.
  • Related to OUT endpoints: When a RELEASE request arrives, and there is still data in the relevant FIFO, try_complete_release() waits up to 1000 ms for the FIFO to drain by the OUT endpoint. After this, try_complete_release() completes the request, hence closing the related device file (not /dev/cuse) after emptying the FIFO.

A single timer can be used for both tasks, because a RELEASE can’t occur before all outstanding requests have been completed on the related device file (Linux’ device file API ensures that). Besides, each device file can be related only to either an IN or OUT endpoint, so once again, the timer won’t be necessary for both uses at the same time.

A similar 10 ms timeout could have been implemented for OUT endpoints, i.e. generate an OUT transfer only if the FIFO contains enough data for a full transfer buffer. This wouldn’t require another timer, for the first reason given above. However this possibility was dropped in favor of another mechanism for preventing unnecessary I/O: try_queue_bulkout() submits a transfer with less than a full buffer only if there is no other outstanding transfer on the same endpoint. The reason for opting out the 10 ms timer for this purpose has to do with the original purpose of this usbpiper, as a driver for XillyUSB (which didn’t materialize).

FUSE / CUSE kernel driver dissection notes

What this post is about

Before anything: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I bothered looking at the kernel code instead of using libfuse.

So these are some quite random notes I took while trying to figure out how to talk with /dev/cuse directly by reading the sources directly. I’m probably not going to touch CUSE with a five-foot stick again, so maybe this will help someone out there.

Everything said here relates to Linux v5.3. As FUSE a bit of hack-on-demand kind of filesystem, things change all the time.

CUSE vs. FUSE

CUSE is FUSE’s little brother, allowing to generate a single device file in /dev, having the driver implemented in user space. Compared with FUSE’s ability to mount an entire filesystem, CUSE much lighter, and is accordingly implemented as a piggy-back on the FUSE driver.

CUSE and FUSE are reached from user space through different device files: A server (i.e. driver) for FUSE opens /dev/fuse, and a server for CUSE opens /dev/cuse.

Note that the user application program that opens /dev/cuse or /dev/fuse is called the server. It’s actually a driver, but the latter term is saved for the FUSE kernel framework.

The driver for /dev/cuse is implemented entirely in fs/fuse/cuse.c, and it does quite little: All file operation methods for /dev/cuse are redirected to those used for /dev/fuse (by literally copying the list of methods), except for open and release.

The CUSE-specific method for open runs a slightly different initialization procedure against the server (more about this below) and eventually generates a character device file instead of making a filesystem mountable.

This character device file is assigned I/O methods that are handled in cuse.c, however their implementation relies heavily on functions that are defined in the mainline FUSE driver. Effectively, this device file is a FUSE file which is forced to use “direct I/O” methods to present a data pipe abstraction.

It might very well be that it’s possible to obtain the same result by setting up a small mounted filesystem with a file with certain settings, but I haven’t investigated this further. It seems however that the application program will have to open the file with the O_DIRECT flag for this to work. See Documentation/filesystems/fuse-io.txt in the kernel source tree.

The relevant source files

The FUSE filesystem handles I/O requests of two completely different types: Those related to the file system that is mounted in relation to it (or the device file generated on behalf of CUSE), and those related to the character device which the FUSE / CUSE server opens. This might cause a slight confusion, but the kernel code sticks to a naming convention that pretty much avoids it.

The interesting files in the kernel tree:

  • fs/fuse/file.c — Methods for handling I/O requests from the FUSE-mounted file system. The typical function name prefix is fuse_file_*.
  • fs/fuse/dev.c — Methods for handling I/O requests from /dev/fuse. The typical function name prefix is fuse_dev_*.
  • fs/fuse/cuse.c — CUSE-specific driver. Responsible for generating /dev/cuse, and make it behave quite like /dev/fuse. In fact, it routes a lot of function calls to the FUSE driver. The typical function name prefix is cuse_channel_* for methods handling I/O requests from /dev/cuse. Functions named just cuse_* are handlers for the CUSE-generated character device. Note that the /dev/cuse character device is referred to as the “channel” so it’s not confused with the other one.
  • include/uapi/linux/fuse.h — Header file with all structures and constants that are visible in user space
  • fs/fuse/fuse_i.h — Header file with everything that isn’t visible from user space.

FUSE protocol

It’s probably necessary to be acquainted with writing a Linux kernel character device (at least) in order to understand the nuts and bolts of FUSE. It’s actually helpful to have worked with a device driver for Microsoft Windows as well, since flow of I/O requests resembles the IRP concept in Windows’ driver model:

Each I/O request by the user space program goes into the kernel and is translated into a data structure which contains the information, and that data structure is handed over to the server (i.e. the driver in user space). The server queues the request for processing and acknowledges its reception, but not its completion. Rather, the server processes the request in its own free time, and when finished, it turns it back to the I/O system that requested it, along with the status and possibly data. If the user program blocks on the completion of the I/O system call (async I/O is also supported), it does so until the server turns back the request.

So there’s a flow of requests arriving from /dev/fuse (or /dev/cuse, as applicable), and a flow of responses written to the same file descriptor by the driver. The relation between the requests and responses is asynchronous (which is the main resemblance with IRPs), so the responses may arrive in no particular order.

The main difference from Windows’ IRP model is that Windows’ kernel makes calls to I/O operation handlers in the device driver (just like a Linux driver, but with the driver’s hands tied) with a pointer to the IRP. With FUSE, all requests go through a single pipe (good old UNIX design philosophy) and the driver chooses what to do with each. Also, in Windows, there’s a special treatment of requests that can be finished immediately — the driver can return with a status saying so. FUSE’s take on this matter is congratulations, finish the request and submit the response. Now or later makes no essential difference.

This way or another, the FUSE / CUSE server should not block or otherwise delay the reception of requests from /dev/fuse while handling a previous request (a Windows device driver is not allowed to block because it runs in arbitrary thread context, but that’s really irrelevant here). Even if it can’t or isn’t expected to handle another request before the current one is done, it must keep receiving requests while handling previous ones, at least for one reason: Accepting requests to handle a signal (interrupt) for an already queued request. More on that below.

The other side of the coin: A read() call from /dev/fuse or /dev/cuse may block, and will do so until there’s a request available to handle. On the other hand, a write() never blocks (which makes sense, since it merely informs the kernel driver a request has been finished). The poll() system call is implemented, so epoll() and select() can be used on /dev/fuse and /dev/cuse, rather than blocking a thread on waiting for a request (libfuse doesn’t take advantage of this).

I/O requests

The request from /dev/fuse or /dev/cuse is starts with a header of the following form (defined in the kernel’s include/uapi/linux/fuse.h and libfuse’s libfuse/include/fuse_kernel.h):

struct fuse_in_header {
	uint32_t	len;
	uint32_t	opcode;
	uint64_t	unique;
	uint64_t	nodeid;
	uint32_t	uid;
	uint32_t	gid;
	uint32_t	pid;
	uint32_t	padding;
};

The header is then followed by data that is related to the request, if necessary.

@len is the number of bytes in the request, including the fuse_in_header struct itself.

@opcode says what operation is requested, out those listed in enum fuse_opcode in the same header files (the opcodes are also listed and explained on this page).

@unique is the identifier of the request, to be used in the response. Note that if bit 0 is set (i.e. @unique is odd), the request is an interrupt notification to another request (with the @unique after clearing bit 0). This is not true on all kernel versions however.

The rest — nodeid, uid, gid and pid are quite obvious. But it’s noteworthy that the process ID is exposed to the driver in user space.

Reads from /dev/{cuse,fuse} are done in one single read() requests, which dequeues one request from one of the kernel driver’s requests queues: One for INTERRUPT requests, one for FORGET requests, and one for all the others. They are prioritized in this order (i.e. INTERRUPT go before any other etc.).

The read() call is atomic: It must request a number of bytes that is larger or equal to the request’s @len, or the request is discarded and -EIO is returned instead. For this reason, the number of bytes of any read() from /dev/cuse or /dev/fuse must be max_write + fuse_in_header, where @max_write is as submitted on the cuse_init_out structure in response to an INIT request (see below) (max_write is expected to be 4096 at least).

However oddly enough, in libfuse’s fuse_lowlevel.c it says

	se->bufsize = FUSE_MAX_MAX_PAGES * getpagesize() +
		FUSE_BUFFER_HEADER_SIZE;

(the session’s buffer size of arriving requests are se->bufsize) and then libfuse’s fuse_i.h goes

#define FUSE_MAX_MAX_PAGES 256

but how is that an upper limit of something?

I/O responses

Responses are written by the server into the same file descriptor of /dev/fuse or /dev/cuse, starting with a header as follows:

struct fuse_out_header {
	uint32_t	len;
	int32_t		error;
	uint64_t	unique;
};

The meaning of @len and @unique are the same in the request: @len includes the header itself, and @unique is a copy of the identifier of the request (with some extra care when handling interrupt requests).

@error is the status. Zero means success, negative numbers are in the well-known form of -EAGAIN, -EINVAL etc. It’s expected to be zero or negative (but not below -999). If it’s non-zero, the response must consist of a header only, or the write() call that submits the response returns -EINVAL.

A response write() is atomic as well: The number of bytes requested in the call must equal to @len, or the call returns -EINVAL.

How requests are made in the kernel code

For each request to the server, a struct fuse_req is allocated and initialized to contain the information on the request to send and what the answer is about to look like. This begin with calling fuse_get_req() or fuse_get_req_for_background(), which both call __fuse_get_req(struct fuse_conn *fc, unsigned npages, bool for_background).

To make a long story short, this function allocates the memory for the struct fuse_req itself as well a memory array of npages entries of struct page and struct fuse_page_desc. It also initializes several functional fields of the structure, among others the pages, page_descs, max_pages entries, as well as setting the reference count to 1, the FR_PENDING flag and initializing the two list headers and the wait queue. The pid, uid and gid fields in the information for the request are also set.

Then the fuse_req structure is set up specifically for the request. In particular, the @end entry points at the function to call by request_end() following the arrival of a response from the server or the abortion of the request.

The fuse_req has two entries, @in and @out, which are of type fuse_in and fuse_out, respectively. Note that “in” and “out” are from the server’s perspective, so “in” means kernel to server and vice versa.

struct fuse_arg {
	unsigned size;
	void *value;
};

struct fuse_in_arg {
	unsigned size;
	const void *value;
};

struct fuse_in {
	struct fuse_in_header h;
	unsigned argpages:1;
	unsigned numargs;
	struct fuse_in_arg args[3];
};

struct fuse_out {
	struct fuse_out_header h;
	unsigned argvar:1;
	unsigned argpages:1;
	unsigned page_zeroing:1;
	unsigned page_replace:1;
	unsigned numargs;
	struct fuse_arg args[2];
};

Despite the complicated outline, the usage is quite simple. It’s summarized in detail at the rest of this section, but in short: The request consists of a fuse_in_header followed by arguments, which is just a concatenation of strings (there are @in.numargs of them), which are set up when the request is prepared. @value and @size are set up in an array of struct fuse_in_arg.

The response is a concatenation of fuse_out_header and @out.numargs arguments, once again these are concatenated strings. The sizes and buffers are set up when the request is generated. The @argvar flag is possibly set to allow a shorter response at the expense of the last argument. Look at the function pointed by @end for how these arguments are interpreted.

And now the longer version of the two clauses above:

When a request is prepared for transmission to the server by fuse_dev_do_read(), it concatenates the @h entry in the struct fuse_in with @numargs “arguments”. Each “argument” is a string, which is represented as a fuse_in_arg entry in the @args array, by a pointer @value and the number of bytes given as @size. So it’s a plain string concatenation of @numargs + 1 strings, the first with a fixed size (of struct fuse_in_header) and some variable-length strings. What makes it seem complicated is the paging-aware data copying mechanism.

As for handling the arrival of responses from the server: Except for notifications and interrupt replies, fuse_dev_do_write() handles the write() request, which must include everything in the buffer submitted, as follows. The first bytes are copied into the fuse_req’s @out.h, or in other words, the fuse_out’s @h entry. So this consumes the number of bytes in a struct fuse_out_header.

The rest is chopped into arguments (by copy_out_args() ), following the same convention of @numargs concatenated strings, each having the length of @size and written into the buffer pointed by @value. @numargs as well as entries of the struct fuse_arg array are set when preparing the request — when the response arrives, the relevant buffers are filled. And don’t confuse struct fuse_arg with struct fuse_args, which is completely different.

copy_out_args() checks the header’s @error field before copying anything. If it’s non-zero, no copying takes place: The response is supposed to consist of a struct fuse_out_header only.

The last argument of a response from the server may be shorter (possibly zero length) than its respective @size entry if and only if the @argvar entry in the related struct fuse_out struct is set (which is possibly done when preparing the request). If this is the case, the server simply submits less bytes than the sum of the header + all argument’s @size, and the last argument is shortened accordingly. This may sound complicated, but it just means, for example, that a response to READ submits the data that it managed to collect.

Once again, all this sounds a bit scary, but take the relevant snippet from cuse_send_init() defined in the kernel’s fs/fuse/cuse.c:

	req->in.h.opcode = CUSE_INIT;
	req->in.numargs = 1;
	req->in.args[0].size = sizeof(struct cuse_init_in);
	req->in.args[0].value = arg;
	req->out.numargs = 2;
	req->out.args[0].size = sizeof(struct cuse_init_out);
	req->out.args[0].value = outarg;
	req->out.args[1].size = CUSE_INIT_INFO_MAX;
	req->out.argvar = 1;
	req->out.argpages = 1;
	req->pages[0] = page;
	req->page_descs[0].length = req->out.args[1].size;
	req->num_pages = 1;
	req->end = cuse_process_init_reply;
	fuse_request_send_background(fc, req);

It’s quite clear: The driver sends one argument (i.e. one string) after the header, and expects two back in the response. And the function that handles the response is cuse_process_init_reply(). So it’s fairly easy to tell what is sent and what is expected in return.

How CUSE implements read()

The CUSE driver (cuse.c) assigns cuse_read_iter() for the read_iter fops method. This function sets the file position to zero, and calls fuse_direct_io(), defined in file.c. Not to be confused with fuse_direct_IO(), defined in the same file.

The latter function retrieves the number of bytes to process as its local variable @count. It then loops on sending requests and retrieving the data as follows (outlined for non-async I/O): fuse_send_read() is called for sending a READ request to the server by calling fuse_read_fill() and fuse_request_send(). The latter is defined in dev.c, and calls __fuse_request_send(), which queues the request for transmission (with queue_request()) and then waits (i.e. blocks, sleeps) until the response with a matching unique ID has arrived (by calling request_wait_answer()). This happens by virtue of the server’s invocation of a write() on its /dev/cuse filehandle, with a matching unique ID.

Back to the loop on @count, fuse_send_read() returns with the number of bytes of the response’s first argument — that is, the length of the data that arrived. The loop hence continues with checking the error status of the response (in the @error field). If there was an error, or if there were less bytes than requested in the response, the loop terminates. Also if @count is zero after deducing the number of arrived bytes from it.

The return value of fuse_direct_io(), which is also the return of the cuse_read_iter(), is the number of bytes that were read (in total), if this number is non-zero, even if the loop quit because of an error. Only no bytes were received, the function returns the @error field in the response (which is zero if there was neither an error nor data).

The rationale behind the loop and the way it handles errors is that a single read() request by the application may be chopped into several READ requests if the read() can’t be fit into a single READ request (i.e. the read()’s @count is larger than max_read, as specified on the INIT response). It’s therefore necessary to iterate.

How CUSE implements write()

The CUSE driver (cuse.c) assigns cuse_write_iter() for the write_iter fops method. This function sets the file position to zero, and like cuse_read_iter(), it calls fuse_direct_io(), defined in file.c. Only with different arguments to tell the latter function that the data goes in the opposite direction.

fuse_direct_io() calls fuse_send_write() instead of fuse_send_read, which calls fuse_write_fill() instead of fuse_read_fill(). And then fuse_request_send() is called, which sends the request and waits for its response. fuse_send_write() returns with the number of bytes that were actually written, as it appears in the @size entry of the struct fuse_write_out in the response.

Note that the kernel driver sends a buffer along with the WRITE call, and the server chooses how much to consume from it, and then tells the kernel about that in the response. This requires a small discussion on partial handling of write().

The tricky thing with a write() is that the application program supplies a buffer to write, along with the number of bytes to write. It’s perfectly fine to point to a huge buffer and set the count to the entire buffer. Any character device driver may write the entire buffer, or just as much as it can at the moment, and return the number of bytes written. The fact that a huge number of bytes were requested makes no difference, because the character device driver treats the request as if it was for the number of bytes it could write. The rest of the buffer is ignored.

So there are two problems, both arising when the buffer of the write() from the application program is large: One is how to make sure that the server has allocated a buffer large enough to receive the data in one go (recall that both requests and responses must be done in a single I/O operation). The second and smaller problem is the wasted I/O of data in a WRITE request that is eventually ignored, because the server chose to consume less than available.

To prevent huge buffers from being transmitted to the server and then ignored, the server supplies a max_write parameter in its response to an INIT request, that sets the maximal number of bytes for transmitted on a WRITE server request (it should be 4096 or larger). So the write() operation is chopped up into smaller buffers by FUSE / CUSE as necessary to meet this restriction.

This parameter is a tradeoff between reducing the number of I/Os with the server and the possibility to waste data transfers. fuselib picks 128 kB.

There is no similar problem with read() calls, because the server submits the number of bytes actually read in the response after the response header that says how many bytes are submitted. Nevertheless, there is a separate max_read limit for CUSE sessions nevertheless (but not for FUSE, which copies it from max_write).

Handling interrupts (signals)

There is a lot of fuss about this topic, which is discussed on a separate post. To make a long story short, a server must be able to process INTERRUPT requests. To the server, such request is just like the others, in the sense that it comprises of a struct fuse_in_header followed by a single argument:

struct fuse_interrupt_in {
	uint64_t	unique;
};

The function that implements this in the kernel is fuse_read_interrupt() in dev.c.

Note that there are two @unique IDs in the request. One is in the header, which is ID of the interrupt request itself. The second is in the argument, which is the unique ID of the request that should be interrupted. The server should not assume any special connection between the two (there is such since kernel v4.20, due to commit c59fd85e4fd07).

When a server receives an INTERRUPT request, it shall immediately send a response (i.e. completion) of the request with the @unique given in the argument. An -EINTR status may be reported, in accordance the common POSIX rules.

Note that even though an INTERRUPT request is guaranteed to be conveyed to the server after the request it relates to, it may arrive after the server’s response has been submitted if a race condition occurs. As a result, the server may receive INTERRUPT requests with a @unique ID that it doesn’t recognize (because it has removed its records while responding). Therefore, the server should ignore such requests.

On the other hand, if multiple threads fetch requests from the same file descriptor (of /dev/cuse or /dev/fuse), one thread may decode the INTERRUPT request before the original request has been recorder. This possibility is present in the libfuse implementation, and is the reason behind the complication discussed in that other post.

POLL requests

Poll is different from many other requests in that it requires two (or even more) responses from the server:

  • An immediate response, with the bitmap informing which operations are possible right away
  • Possibly additional notifications, when one or more of the selected operations have become possible.

fuse_file_poll in file.c handles poll() is calls on a file. It queues a FUSE_POLL request, with one argument, consisting of a fuse_poll_in struct:

struct fuse_poll_in {
	uint64_t	fh;
	uint64_t	kh;
	uint32_t	flags;
	uint32_t	events;
};

The @events entry is set with

inarg.events = mangle_poll(poll_requested_events(wait));

which supplies a bitmap of the events that are waited for in POSIX style (mangle_poll() is defined in the kernel’s poll.h, which does the conversion).

@flags may have one flag set, FUSE_POLL_SCHEDULE_NOTIFY, saying that there’s a process actually waiting. If it’s set, the server is required to send a notification when the file becomes ready. If cleared, the server may send such notification, but it will be ignored.

@fh and @kh are the file’s file handle, in userspace and kernel space respectively (the latter is systemwide unique).

If there is a process waiting, the file is then registered in a dedicated data structure (an RB tree), and will be kept there until the file is released. The underlying idea is that if a file descriptor has been polled once, it’s likely happen a lot of times to follow.

Either way, the POLL request is submitted, and the server is expected to submit a response with a poll bitmap, which is deconverted into kernel format, and used as the poll() return value. Consequently, poll() blocks until the response arrives.

Should the server respond with an -ENOSYS status, no more POLL requests are sent to the server at the rest of the session, and DEFAULT_POLLMASK is returned on this and all subsequent poll() calls. Defined in poll.h:

#define DEFAULT_POLLMASK (EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM)

So there’s the poll response:

struct fuse_poll_out {
	uint32_t	revents;
	uint32_t	padding;
};

Rather trivial — just the events that are active.

More interesting, is the notifications. The server may send a notification anytime by setting @unique to zero and the @error field to the code of the notification request (FUSE_NOTIFY_POLL == 1). The @opcode field is ignored in this case (there is no opcode for notifications).

There’s one argument in a poll notification:

struct fuse_notify_poll_wakeup_out {
	uint64_t	kh;
};

where @kh echoes back the value in the poll request.

In dev.c, fuse_notify() calls fuse_notify_poll(), which in turn calls fuse_notify_poll_wakeup() (in file.c) after a few sanity checks.

fuse_notify_poll_wakeup() looks up the value of @kh entry in the dedicated data structure. If it’s not found, the notification is silently ignored. This is considered OK, since the server is allowed to send notifications even if FUSE_POLL_SCHEDULE_NOTIFY wasn’t set.

If the entry is found, wake_up_interruptible_sync() is called on the file’s wait queue that is used only in relation to poll (which is known from the entry in the data structure). That’s it.

poll() is supported by FUSE since kernel v2.6.29 (Git commit 95668a69a4bb8, Nov 2008)

CUSE INIT requests

The bringup of the device file is initiated by the kernel driver, which sends an CUSE_INIT request. The server sets up the connection and device file’s attributes by responding to this request.

In cuse.c, cuse_channel_open(), implements /dev/cuse’s method for open(). Aside from allocating and initializing a struct cuse_conn for containing the private data of this connection, it calls cuse_send_init() for queuing an CUSE_INIT (opcode 4096) request to the new file handle. Note that this is different from the FUSE_INIT (opcode 26) that arrives from /dev/fuse.

The request consists of a struct fuse_in_header concatenated with a struct cuse_init_in:

struct cuse_init_in {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
};

The major and minor fields are the FUSE_KERNEL_VERSION and FUSE_KERNEL_MINOR_VERSION, telling the server which FUSE version the kernel offers. flags is set to 0x01, which is CUSE_UNRESTRICTED_IOCTL.

The pid, uid and gid in the header are those of the process that opened /dev/cuse — not really interesting. @unique is typically 1 (but don’t rely on it — it can be anything in future versions). On fairly recent kernels, it continues with 2 and increments by 2 for each request to follow. On older kernels, it just counts upwards with steps of 1. The unique ID mechanism was changed in kernel commit c59fd85e4fd07 (September 2018, v4.20) for the purpose of allowing a hash of unique IDs in the future.

The response is a string concatenation of the following three elements (header + two arguments):

  • A struct fuse_out_header, with the header for the response (with @unique typically set to 1)
  • A struct cuse_init_out with some information (more on that below)
  • A null-terminated string that reads e.g. “DEVNAME=mydevice” (without the quotes, of course) for generating the device file /dev/mydevice. Don’t forget to actually write the null byte in the end, or the device generation fails with a “CUSE: info not properly terminated” in the kernel log.

struct cuse_init_out is defined as

struct cuse_init_out {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
	uint32_t	max_read;
	uint32_t	max_write;
	uint32_t	dev_major;
	uint32_t	dev_minor;
	uint32_t	spare[10];
};

The fields of cuse_init_out are as follows:

  • @major and @minor have the same meaning as these fields in struct cuse_init_in, but they reflect the version that the server is designed for, and hence rules the session. As of kernel v5.3 (which implement FUSE version 7.26), @major must be 7 and @minor at least 11, or the initialization fails. FUSE 7.11 was introduced in kernel v2.6.29 in 2008. See include kernel sources’ uapi/linux/fuse.h for revision history.
  • @max_read and @max_write are the maximal number of bytes in the payload of a READ and WRITE request, respectively. Note that @max_write forces read() requests from /dev/cuse to supply a @count parameter of at least @max_write + the size of struct fuse_out_header + the size of struct fuse_write_out, or WRITE requests may fail. Same goes for @max_read and struct fuse_in_header and struct fuse_read_in. What counts is the length of the requests and their possible responses, which includes the lengths of the non-data parts.
  • @flags: If bit 0 (CUSE_UNRESTRICTED_IOCTL) is set, unrestricted ioctls is enabled.
  • @dev_major and @dev_minor are the created device file’s major and minor numbers. This means that the server needs to make sure that the aren’t already allocated.

FORGET requests

These requests inform a FUSE server that there’s no need to retain information on a specific inode. This request will never appear on /dev/cuse.

FUSE / CUSE signal handling: The very gory details

First: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I didn’t just take what libfuse offered.

Overview

This is a detour from another post of mine, which dissects the FUSE / CUSE kernel driver. I wrote this separate post on signal handling because of some confusion on the matter, which ended up with little to phone home about.

To understand why signals is a tricky issue, suppose that an application program is blocking on a read() from a /dev file that is generated by CUSE. The server (i.e. the driver of this device file in userspace) has collected some of the data, and is waiting for more, which is why it doesn’t complete the request. And then a “harmless” signal (say, SIGCHLD) is sent to the application program.

Even though that program is definitely not supposed to terminate on that signal, the read() should return ASAP. And because it has already collected some data (and possibly consumed it from its source), it should return with the number of bytes already read, and not with an -EINTR (which is the response if it has no data when returning on an interrupt).

So the FUSE / CUSE must notify the server that an interrupt has arrived, so that the relevant request is finished quickly, this way or another. To make things even trickier, it might very well be, that while notification on the interrupt is being prepared and sent to the server, the server has already finished the request, and is in the middle of returning the response.

Luckily, the FUSE / CUSE kernel interface offers a simple solution to this: An INTERRUPT request is sent to the server in response to an interrupt to the application program, with a unique ID number that matches a previously sent request. The server responds with normally returning a response for the said request, possibly with -EINTR status, exactly like a kernel character driver’s response to a signal.

The only significant race condition is when the server has already finished handling the request, for which the INTERRUPT request arrives, and has therefore forgotten the unique ID that comes with it. In this case, the server can simply ignore the INTERRUPT request — it has done the right thing anyhow.

So why this long post? Because I wanted to be sure, and because the little documentation there is on this topic, as well as the implementation in libfuse are somewhat misleading. Anyhow, the bottom line has already been said, if you’d like to TL;DR this post.

The official version

There is very little documentation on FUSE in general, however there is a section in the kernel source tree’s Documentation/filesystems/fuse.txt:

If a process issuing a FUSE filesystem request is interrupted, the following will happen:

  1. If the request is not yet sent to userspace AND the signal is fatal (SIGKILL or unhandled fatal signal), then the request is dequeued and returns immediately.
  2. If the request is not yet sent to userspace AND the signal is not fatal, then an “interrupted” flag is set for the request. When the request has been successfully transferred to userspace and this flag is set, an INTERRUPT request is queued.
  3. If the request is already sent to userspace, then an INTERRUPT request is queued.

INTERRUPT requests take precedence over other requests, so the userspace filesystem will receive queued INTERRUPTs before any others.

The userspace filesystem may ignore the INTERRUPT requests entirely, or may honor them by sending a reply to the original request, with the error set to EINTR.

It is also possible that there’s a race between processing the original request and its INTERRUPT request. There are two possibilities:

  1. The INTERRUPT request is processed before the original request is processed
  2. The INTERRUPT request is processed after the original request has been answered

If the filesystem cannot find the original request, it should wait for some timeout and/or a number of new requests to arrive, after which it should reply to the INTERRUPT request with an EAGAIN error. In case 1 the INTERRUPT request will be requeued. In case 2 the INTERRUPT reply will be ignored.

The description above is correct (see detailed dissection of kernel code below) however beginning from the “race condition” part it gets somewhat confusing.

Race condition?

In the rest of this post, there’s a detailed walkthrough of the involved functions in the v5.3.0 kernel, and there’s apparently no chance for the race condition mentioned fuse.txt. It’s not even an old bug that was relevant when interrupt handling was introduced with Git commit a4d27e75ffb7b (where the text cited above in fuse.txt was added as well): Even looking at the original commit, there’s a clear locking mechanism that prevents any race condition in the kernel code. This was later replaced with memory barriers, which should work just the same.

All in all: An INTERRUPT request is queued, if at all, only after the related request has been submitted as the answer to a read() by the server.

So what is this all about, then? A multi-threaded server, which spreads requests randomly among work threads, might indeed handle requests in a random order. It seems like this is what the “race condition” comment refers to.

The solution to the non-existing problem

Had there been a possibility that INTERRUPT request may arrive before the request it relates to, the straightforward solution would be to maintain an orphan list of Unique IDs of INTERRUPT requests that didn’t have a request processed when the INTERRUPT request arrived. This list would then be filled with INTERRUPT requests that arrived too early (before the related request) or too late (after the request was processed).

Then, for each non-INTERRUPT request that arrives, see if it’s in the list, and if so, remove the Unique ID from the list, and treat the request as interrupted.

But the requests that were added into the list because of the “too late” scenario will never get off the list this way. So some garbage collection mechanism is necessary.

The FUSE driver facilitates this by allowing a response with an -EAGAIN status to INTERRUPT requests. Even though no response is needed to INTERRUPT requests, an -EAGAIN response will cause the repeated queuing of the INTERRUPT request by the kernel if the related request is still pending, and otherwise do nothing.

So occasionally, the server may go through its list of orphans, and send an -EAGAIN response to each entry, and delete this entry as the response is sent. If the deleted entry is still relevant, it will be re-sent by the kernel, so it’s re-listed (or possibly handled directly if the related request has arrived in the meantime). Entries from the “too late” scenario won’t be re-listed, because the kernel will do nothing in reaction to the -EAGAIN response.

This is the solution suggested in fuse.txt on the race conditions issue. The reason this solution is suggested in the kernel’s documentation, even though it relates to a problem in a specific software implementation, is probably to explain the motivation to the -EAGAIN feature. But boy, was it confusing.

How libfuse handles INTERRUPT requests

Spoiler: The solution to the non-existent problem is implemented in libfuse 3.9.0 (and way back) as described above. The related comment was written based upon a problem that arose with libfuse. Which is multithreaded, of course.

The said garbage collection mechanism is run on the first entry in the list of orphaned INTERRUPT requests each time a non-INTERRUPT request arrives and has no match against any of the list’s members. This ensures that the list is emptied quite quickly, and without risk of an endless loop circulation of INTERRUPT requests, because the arrival of a non-INTERRUPT request means that the queue for INTERRUPT requests in the kernel was empty at that moment. A quirky solution to a quirky problem.

Note that even when libfuse is run with debug output, it’s difficult to say anything about ordering, as the debug output shows processing, not arrival. And the log messages come from different threads.

The problem of unordered processing of INTERRUPT requests could have been solved much more elegantly of course, but libfuse is a patch on patch, so they made another one.

And for the interested, this is the which-function-calls-what in libfuse.

So in libfuse’s fuse_lowlevel.c, the method for handling interrupts, do_interrupt(), first attempts to find the related request, and if it fails, it adds an entry to a session-specific list, se->interrupts. Then there’s check_interrupt(), which is called by fuse_session_process_buf_int() for each arriving request that isn’t an INTERRUPT itself. This function looks up the list for the request, and if it’s found, it sets that request’s “interrupted” flag, and removes it from the list. Otherwise, if the list is non-empty, it removes the first entry of se->interrupts and returns it to the caller, which initiates an EAGAIN for that.

Read the source

Since this is an important topic, let’s look on how this is implemented. So from this point until the end of this post, these are dissection notes of the v5.3.0 kernel source. There are commits applied all the time in this region, but in essence it seems to be the same for a long time.

Generally speaking, all I/O operations that are initiated by the application program (read(), write(), etc.) end up with the setup of a fuse_req structure containing the request information in file.c, and its submission to the server front-end with a call to fuse_request_send(), which is defined in dev.c. If the I/O is asynchronous, fuse_async_req_send() is called instead, but that’s irrelevant for the flow discussed now. fuse_request_send() calls __fuse_request_send(), which in turn calls queue_request() which puts the request in the queue, and more importantly, request_wait_answer(), which puts the process to sleep until the request is completed (or something else happens…).

And now details…

So what does request_wait_answer() do? First, let’s get acquainted with some of the flags that are related to each request (i.e. in struct fuse_req’s flags entry), see also fuse_i.h:

  • FR_FINISHED: request_end() has been called for this request, which happens when the response for this request has arrived (but not processed yet — when that is done the request is freed). Or when it has been aborted for whatever reason (and once again, the error has not been processed yet).
  • FR_PENDING: The request is on the list of requests for transmission to the server. The flag is set when the fuse_req structure of a new request is initialized, and cleared when fuse_dev_do_read() has completed a server’s read() request. Or alternatively, failed for some reason, in which case request_end() has been called to complete the request with an error. So when it’s set, the request has not been sent to the server, but when cleared, it doesn’t necessarily mean it has.
  • FR_SENT: The request has been sent to the server. This is set by fuse_dev_do_read() when nothing can fail anymore. It differs from !FR_PENDING in that FR_PENDING is cleared when there’s an error as well.
  • FR_INTERRUPTED: This flag is set if an interrupt arrived while waiting for a response from the server.
  • FR_FORCE: Force sending of the request even if interrupted
  • FR_BACKGROUND: This is a background request. Irrelevant for the blocking scenario discussed here.
  • FR_LOCKED: Never mind this: It only matters when tearing down the FUSE connection and aborting all requests, and it determines the order in which this is done. It means that data is being copied to or from the request.

request_wait_answer()

With this at hand, let’s follow request_wait_answer() step by step:

  • Wait (with wait_event_interruptible(), sleeping) for FR_FINISHED to be set. Simply put, wait until a response or any interrupt has arrived.
  • If FR_FINISHED is set, the function returns (it’s a void function, and has no return value).
  • If any interrupt occurred while waiting, set FR_INTERRUPTED and check FR_SENT. If the latter was set, call queue_interrupt() to queue the request on the list of pending interrupt requests (unless it is already queued, as fixed in commit 8f7bb368dbdda. The same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt).

Note that these three bullets above are skipped if the FUSE connection had the “no_interrupt” flag on invocation to request_wait_answer(). This flag is set if the server answered to any interrupt request in the current session’s past with an -ENOSYS.

  • Wait again for FR_FINISHED to be set, now with wait_event_killable(). This puts the process in the TASK_KILLABLE state, so it returns only when the condition is met or on a fatal signal. If wait_event_interruptible() was awaken by a fatal signal to begin with, there will be no waiting at all on this stage (because the signal is still pending).
  • If FR_FINISHED is set, the function returns. This means that a response has been received for the request itself. As explained below, this is unrelated to the interrupt request’s fate.
  • Otherwise, there’s a fatal signal pending. If FR_PENDING is set (the request has not been sent to server yet), the request is removed from the queue for transmission to the server (with due locking). It’s status is set to -EINTR, and the function returns.

Note that these three bullets are skipped if the FR_FORCE flag is set for this request. And then, there’s the final step if none of the above got the function to return:

  • Once again, wait for FR_FINISHED to be set, but this time with the dreaded, non-interruptible wait_event(). In simple words, if the server doesn’t return a response for the request, the application that is blocking on the I/O call is sleeping and non-killable. This is not so bad, because if the server is killed (and hence closes /dev/fuse or /dev/cuse), all its requests are marked with FR_FINISHED.

To see the whole picture, a close look is needed on fuse_dev_do_read() and fuse_dev_do_write(), which are the functions that handle the request and response communication (respectively) with the driver.

fuse_dev_do_write()

Starting with fuse_dev_do_write(), which handles responses: After a few sanity checks (e.g. that the data lengths are in order), it looks up the request based upon the @unique field (for responses to interrupt requests, the original request is looked for). If the request isn’t found, the function returns with -ENOENT.

If the response has an odd @unique field, it’s an interrupt request response. If the @error field is -ENOSYS, the “no_interrupt” flag is set for the current connection (see above). If it’s -EAGAIN, another interrupt request is queued immediately. Otherwise the interrupt request response is ignored and the function returns. In other words, except for the two error codes just mentioned, it’s pointless to send them. The desired response to an interrupt request is to complete the original request, not responding to the interrupt request.

So now to handling regular responses: The first step is to clear FR_SENT, which sort-of breaks the common sense meaning of this flag, but it’s probably a small hack to reduce the chance of an unnecessary interrupt request, as the original request is just about to finish.

The response’s content is then copied into kernel memory, and request_end() is called, which sets FR_FINISHED, then removes the request from the queue of pending interrupts (if it’s queued there), and after that it returns with due error code (typically success).

So not much interesting here.

fuse_dev_do_read() step by step

The function returns with -EAGAIN if /dev/fuse or /dev/cuse was opened in non-blocking mode, and there’s no data to supply. Otherwise, it waits with wait_event_interruptible_exclusive_locked() until there’s a request to send to the server in any of the three queues (INTERRUPT, FORGET or regular requests queues). If the server process got an interrupt, the wait function returns with -ERESTARTSYS, and so does this function (this is bug? It should be -EINTR).

First, the queue of pending interrupts is checked. If there’s any entry there, fuse_read_interrupt() is called, which generates a FUSE_INTERRUPT request with the @unique field set to the original request’s @unique, ORed with FUSE_INT_REQ_BIT (which equals 1). The request is copied into the user-space buffer, and fuse_dev_do_read() returns with the size of this request.

Second, FORGET requests are submitted, if such are queued.

If none of the INTERRUPT and FORGET were sent, the first entry in the request queue is dequeued, and its FR_PENDING flag is cleared. The I/O data handling then takes place.

Just before returning, the FR_SENT flag is set, and then FR_INTERRUPTED is checked. If the latter is set, queue_interrupt() is called to queue the request on the list of pending interrupt requests (unless it is already queued. Once again, the same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt). Together with request_wait_answer(), this ensures that an interrupt is queued as soon as FR_SENT is set: If the waiting function returned before FR_SENT is set, FR_INTERRUPTED is set by request_wait_answer() before checking FR_SENT, so fuse_dev_do_read will call queue_interrupt() after setting FR_SENT. If the waiting function returned after FR_SENT is set, request_wait_answer() will call queue_interrupt(). And in case of a race condition, both will call this function; note that each of the two racers sets one flag and checks opposite in reverse order with respect to each other. And calling queue_interrupt() twice results in queuing the interrupt request only once.

Linux CUSE (and FUSE): Why I ditched two months of work with it

Introduction

If you’re planning to use CUSE (or FUSE) for an application you care about, this post is for you. That includes future self. I’m summarizing my not-so-pleasant journey with this framework here, with focus on how I gradually realized that I should start from the scratch with an old-school kernel module instead.

Most important, if you run CUSE on a v5.0 to v5.3 Linux kernel, you’re in for an imminent OOPS that requires an immediate reboot of the computer. This was the final straw for me (more like a huge log). Even if the user-space driver detected the kernel version and refused to run on kernels that would crash, that would mean it wouldn’t run on the most common distributions at the time of release. And I asked if I want to depend on a subsystem that is maintained this way.

Maybe I should have listened to what Linus had to say about FUSE back in 2011:

People who think that userspace filesystems are realistic for anything but toys are just misguided.

Unfortunately, it seems like the overall attiude towards FUSE is more or less in that spirit, hence nobody gets alarmed when the relevant code gets messier than is usually allowed: FUSE is nice for that nifty GUI that allows me to copy some files from my smartphone to the computer over a USB cable. It fails when there are many files, but what did I expect. Maybe it’s a problem with FUSE, maybe with the MTP/PTP protocol, but the real problem is that it’s treated as a toy.

As for myself, I was tempted to offer a user-space device driver for a USB product I’ve designed. A simple installation, possibly from binaries, running on virtually any computer. CUSE is around for many years, and opens a file in /dev with my name of choice. It makes the device file behave as if it was backed by a driver in the kernel (more or less). What could possibly go wrong?

And a final note before the storytelling: This post was written in the beginning of 2020. Sometimes things change after a while. Not that they usually do, but who knows?

Phase I: Why I opted out libfuse

The natural and immediate choice for working with FUSE is to use its ubiquitous library, libfuse. OK, how does the API go? How does it work?

libfuse’s git commits date back to 2001, and the project is alive by all means, with several commits and version updates every month. As for documentation, the doc/ subdirectory doesn’t help much, and its mainpage.dox says it straight out:

The authoritative source of information about libfuse internals (including the protocol used for communication with the FUSE kernel module) is the source code.

Simply put, nothing is really documented, read the source and figure it out yourself. There’s also an example/ directory with example code, showing how to get it done. Including a couple of examples for CUSE. But no API at all. Nothing on the fine details that make the difference between “look it works, oops, now it doesn’t” and something you can rely upon.

As for the self-documenting code, it isn’t a very pleasant experience, as it’s clearly written in “hack now, clean up later (that is, never)” style.

There are however scattered pieces of documentation, for example:

So with the notion that messy code is likely to bite back, I decided to skip libfuse and talk with /dev/cuse directly. I mean, kernel code can’t be that messy, can it?

It took me quite some time to reverse-engineer the CUSE protocol, and I’ve written a couple of posts on this matter: This and this.

Phase II: Accessing /dev/cuse causing a major OOPS

After nearly finishing my CUSE-based (plus libusb and epoll) driver on a Linux v4.15 machine , I gave it a test run on a different computer, running kernel v5.3. And that went boooom.

Namely, just when trying to close /dev/cuse, an OOPS message as follows appeared, leaving Linux limping, requiring an immediate reboot:

kernel: BUG: spinlock bad magic on CPU#0, cat/951
kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 0 PID: 951 Comm: cat Tainted: G           O      5.3.0-USBTEST1 #1
kernel: RIP: 0010:spin_bug+0x6a/0x96
kernel: Code: 04 00 00 48 8d 88 88 06 00 00 48 c7 c7 90 ef d5 81 e8 8c af 00 00 41 83 c8 ff 48 85 db 44 8b 4d 08 48 c7 c1 85 ab d9 81 74 0e <44> 8b 83 c8 04 00 00 48 8d 8b 88 06 00 00 8b 55 04 48 89 ee 48 c7
kernel: RSP: 0018:ffffc900008abe18 EFLAGS: 00010202
kernel: RAX: 0000000000000029 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffff81d9ab85
kernel: RDX: 0000000000000000 RSI: ffff88816da16478 RDI: 00000000ffffffff
kernel: RBP: ffff88815a109248 R08: 00000000ffffffff R09: 000000006b6b6b6b
kernel: R10: ffff888159b58c50 R11: ffffffff81c5cd00 R12: ffff88816ae00010
kernel: R13: ffff88816a165e78 R14: 0000000000000012 R15: 0000000000008000
kernel: FS:  00007ff8be539700(0000) GS:ffff88816da00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007ffe4fa2fee8 CR3: 000000016b5d0002 CR4: 00000000003606f0
kernel: Call Trace:
kernel: do_raw_spin_lock+0x19/0x84
kernel: fuse_prepare_release+0x3b/0xe7 [fuse]
kernel: fuse_sync_release+0x37/0x49 [fuse]
kernel: cuse_release+0x16/0x22 [cuse]
kernel: __fput+0xf0/0x1c2
kernel: task_work_run+0x73/0x86
kernel: exit_to_usermode_loop+0x4e/0x92
kernel: do_syscall_64+0xc9/0xf4
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9

OK, so why did this happen?

To make a long story short, because a change was made in the FUSE kernel code without testing it on CUSE. I mean, no test at all.

The gory details: A spinlock was added to struct fuse_inode, but someone forgot that the CUSE doesn’t have such struct related to it, because it’s on /dev and not on a FUSE mounted filesystem. Small mistake, no test, big oops.

Even more gory details: Linux kernel commit f15ecfef058d94d03bdb35dcdfda041b3de9d543 adds a spinlock check in fuse_prepare_release() (among others), saying

	if (likely(fi)) {
		spin_lock(&fi->lock);
		list_del(&ff->write_entry);
		spin_unlock(&fi->lock);
	}

For this to even be possible, an earlier commt (ebf84d0c7220c7c9b904c405e61175d2a50cfb39) adds a struct fuse_inode *fi argument to fuse_prepare_release(), and also makes sure that it’s populated correctly. In particular, in cuse.c, it goes:

struct fuse_inode *fi = get_fuse_inode(inode);

(what would I do without git blame?).

But wait. What inode? Apparently, the idea was to get the inode’s struct fuse_inode, which is allocated and initialized by fuse_alloc_inode() in fs/fuse/inode.c. However this function is called only as a superblock operation — in other words, when the kernel wants to create a new inode on a mounted FUSE filesystem. A CUSE device file doesn’t have such entry allocated at all!

get_fuse_inode() is just a container_of(). In other words, it assumes that @inode is an entry inside a struct fuse_inode, and returns the address of the struct. But it’s not. In the CUSE case, the inode belongs to a devfs or something. get_fuse_inode() returns just some random address, and no wonder do_raw_spin_lock() whines that it’s called on something that isn’t a spinlock at all.

The relevant patches were submitted by Kirill Tkhai and committed by Miklos Szeredi. None of whom made the simplest go-no-go test on CUSE after this change, of course, or they would have spotted the problem right away. What damage could a simple change in cuse.c make, after all?

The patch that fixed it

This issue is fixed in kernel commit 56d250ef9650edce600c96e2f918b9b9bafda85e (effective in kernel v5.4) by Miklos Szeredi, saying “It’s a small wonder it didn’t blow up until now”. Irony at its best. He should have written “the FUSE didn’t blow”.

So this bug lived from v5.0 to v5.3 (inclusive), something like 8 months. 8 months without a single minimal regression test by the maintainer or anyone else.

The patch removes the get_fuse_inode() call in cuse.c, and calls fuse_prepare_release() with a NULL instead. Meaning there is no inode, like it should.

Firefox: Stop that nagging on available updates

After upgrading to Firefox 72.0.1 on Linux, I got a tab encouraging me to upgrade Firefox (and that it couldn’t be done automatically). To make things worse, the option to turn off automatic upgrades checks has been removed from the GUI interface. Why? Call it what you want, at the bottom there’s a “let’s force people to do it” attitude behind it. Seen it so many times before, and also seen disasters resulting from it.

Upgrading is important for security, yes, but thanks, I don’t need to be reminded every time a microversion upgrade is available. Some people enjoy spending quality time with their computer, updating all software and then fixing things that broke because of that. I’m not one of them.

The solution for now (taken from here) is to add the following file as /usr/lib/firefox/distribution/policies.json :

{
 "policies": {
 "AppUpdateURL": "http://127.0.0.1/",
 "DisableAppUpdate": true,
 "DisableFirefoxStudies": true,
 "DisableSystemAddonUpdate": true,
 "DisableTelemetry": true,
 "ExtensionUpdate": false
 }
}

The location may vary of course, and the “distribution” directory may need to be created. In my system, Firefox is run from /usr/bin/firefox which is a symlink to /usr/lib/firefox/firefox (so /usr/lib/firefox is the installation directory for this purpose).

Octave: Empty plots (after “figure”)

Running Octave 4.2.2 on Linux Mint 19, I got plots with nothing in them occasionally. Solution: Change the graphics toolkit to GNU Plot.

Simply put, add ~/.octaverc reading

graphics_toolkit("gnuplot")

and rerun Octave.

By the way, for zooming in, right-click the mouse on the first point, and left-click on the second.

Recovering from a BULK IN overflow on USB 3.0

Introduction

At times, an attempt to get data from a BULK IN endpoint may result in an overflow error. In other words,

rc = libusb_bulk_transfer(dev_handle, (1 | LIBUSB_ENDPOINT_IN),
                          buf, bufsize, &count, (unsigned int) 1000);

may fail with rc having the value of LIBUSB_ERROR_OVERFLOW. Subsequent attempts to access the same endpoint, even after re-initializing the libusb interface result in rc = LIBUSB_ERROR_IO, which is just “the I/O failed”. Terminating and rerunning the program doesn’t help. Apparently, the only thing that gets the endpoint out of this condition is physically plugging it out and back into the computer (or hub).

Why this happens

The term “transfer” in the “libusb_bulk_transfer” function name refers to a USB transfer cycle. In essence, calling this function ends up with a request to the xHCI hardware controller to receive a given number of bytes (“bufsize” above) from the endpoint, and to write it into a buffer (“buf” above). The host controller start requesting DATA packets from the device, and fills the buffer with the data in those packets. Note that the communication isn’t a continuum of data packets. Rather, it’s a session which fulfills a transfer request. Someone who wrote the USB spec probably thought of the data transmission in terms of a function call of the libusb_bulk_transfer sort: A function is called to request some data, communication takes place, the function returns. This principle holds for all USB versions.

Here’s the crux, and it’s related to the hardware protocol, namely with data packets: The host controller can give the device go-ahead to transmit packets, but it can’t control the number of bytes in each packet. This is completely up to the device. The rule is that if the number of bytes in a packet is less than the maximum allowed for the specific USB version (1024 bytes for USB 3.0), it’s considered a “short packet”. When a short packet arrives at the host controller, it should conclude that the transfer is done (including the data in this last packet).

So the device may very well supply less data than requested for the transfer. This is perfectly normal, and this is why it’s mandatory to check how many bytes have actually arrived when the function returns.

But what if the transfer request was for 10 bytes, and the device sent 20?

An overflow on USB 3.0

On USB 3.0, the xHCI controller requests data from the device indirectly by issuing an ACK Transaction Packet to the device. It may or may not acknowledge packets it has already acknowledged, but this isn’t the point at the moment. Right now, it’s more interesting that all ACK packets also carry the number of DATA packets that the host is ready to receive at the moment that the ACK packet was formed, in a field called NumP. This is how the host controls the flow of data.

When there’s no data flowing, it can be because the device has no data to send, but also if the last ACK for the relevant endpoint was a packet with NumP=0. To resume the flow, the host then issues an ACK packet, with a non-zero NumP, giving the go-ahead to transmit more data.

So a packet capture can look like this for a transfer request of 1500 bytes (the first column is time in microseconds):

       0.000 ACK  seq=0 nump=2
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.832 ACK  seq=1 nump=1
       2.104 ACK  seq=2 nump=0

Note that the device sent 1024 bytes twice, exceeding the 1500 bytes requested. This causes an overflow error. So far so good. But what about all those LIBUSB_ERROR_IO afterwards?

So this is the time to pay attention to the “seq” numbers. All DATA packets carry this sequence number, which runs cyclically from 0 to 31. The ACK’s sequence number is the one that the host expects next. Always. For example, if the host requests a retransmission, it repeats the sequence number of the DATA packet it failed to receive properly, basically saying, “I expect this packet again” (and then ignores DATA packet that follow until the retransmission).

Now, this is what appears on the bus as a result of a libusb_bulk_transfer() call after the one that returned with an overflow condition:

21286605.320 ACK  seq=0 nump=2

This is not a mistake: The sequence number is zero. Note that the ACK that closed the previous sequence with a nump=0 had seq=2. Hence the ACK that follows it to re-initiate the data flow should have seq=2 as well. But it’s zero. In section 8.11.1, the USB 3.0 spec says that an ACK is considered invalid if it hasn’t an expected sequence number, and that it should be ignored in this case. So the device ignores this ACK and sends no DATA packet in response. The host times out on tDPResponse (400 ns per USB 3.0 spec) and reports a LIBUSB_ERROR_IO. So the forensic explanation is established. Now to the motive.

Handling a babble error

The resetting of the sequence number has been observed with more than one xHCI controller (an Intel chipset as well as Renesas’ uPD720202), so it’s not a bug.

This brings us to the xHCI spec, revision X, section 4.10.2.4: “When a device transmits more data on the USB than the host controller is expecting for a transaction, it is defined to be babbling. In general, this is called a Babble Error. When a device sends more data than the TD transfer size, … the host controller shall set the Babble Detected Error in the Completion Code field of the TRB, generate an Error Event, and halt the endpoint (refer to section 4.10.2.1).”

So the first thing to note is that the endpoint wasn’t halted after the overflow. In fact, there was no significant traffic at all. Quite interestingly, Linux’ host controller didn’t fulfill its duty in this respect.

But still, why was the sequence number reset? Section 8.12.1.2 in the USB 3.0 sheds some light: “The host expects the first DP to have a sequence number set to zero when it starts the first transfer from an endpoint after the endpoint has been initialized (via a Set Configuration, Set Interface, or a ClearFeature (ENDPOINT_HALT) command”.

So had the endpoint been halted, as it’s required per xHCI spec, it would just have returned STALL packets until it was taken out of this condition. At which point the sequence number should be reset to zero per USB spec.

So apparently, whoever designed the xHCI hardware controller assumed that no meaningful communication would take place after the overflow (babble) error, and that the endpoint must be halted anyhow, so reset the sequence number and have it done with. It’s easier doing it this way than detecting the SETUP sequence that clears the ENDPOINT HALT feature.

Given that the xHCI driver doesn’t halt the endpoint, it’s up to the application software to do it.

Halt and unhalt

This is a sample libusb-based C code for halting BULK IN endpoint 1.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x03, // bRequest = SET_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to halt endpoint");
	  break;
	}

	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x01, // bRequest = CLEAR_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to unhalt endpoint");
	  break;
	}

	continue;
      }

The second control transfer can be exchanged with

	rc = libusb_clear_halt(dev_handle, (1 | LIBUSB_ENDPOINT_IN));

however there is no API function for halting the endpoint, so one might as well do them both with control transfers.

Resetting the entire device

For those who like the big hammer, it’s possible to reset the device completely. This is one of the conditions for resetting the sequence numbers on all endpoints, so there’s no room for confusion.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_reset_device(dev_handle);

	if (rc) {
	  print_usberr(rc, "Failed to reset device");
	  break;
	}
	continue;
      }

This causes a Hot Reset. which is an invocation of Recovery with the Hot Reset bit set, and return to U0, which in itself typically takes ~150 μs. However as a result from this call, the is reconfigured — its descriptors are read and configuration commands are sent to it. It keeps its bus number, and the the entire process takes about 100 ms. All this rather extensive amount of actions is hidden in this simple function call.

Also, a line appears in the kernel log, e.g.:

usb 4-1: reset SuperSpeed USB device number 3 using xhci_hcd

So all in all, this is a noisy overkill, and is not recommended. It’s given here mainly because this is probably how some people eventually resolve this kind of problem.

systemd: Reacting to USB NIC hotplugging (post-up scripting)

The problem

Using Linux Mint 19, I have a network device that needs DHCP address allocation connected to a USB network dongle. When I plug it in, the device appears, but the DHCP daemon ignored eth2 (the assigned network device name) and didn’t respond to its DHCP discovery packets. But restarting the DHCP server well after plugging in the USB network card solved the issue.

I should mention that I use a vintage DHCP server for this or other reason (not necessarily a good one). There’s a good chance that a systemd-aware DHCP daemon will resynchronize itself following a network hotplug event. It’s evident that avahi-daemon, hostapd, systemd-timesyncd and vmnet-natd trigger some activity as a result of the new network device.

Most notable is systemd-timesyncd, which goes

Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.

twice, once when the new device appears, and a second time when it is configured. See sample kernel log below.

It’s not clear to me how these daemons get their notification on the new network device. I could have dug deeper into this, but ended up with a rather ugly solution. I’m sure this can be done better, but I’ve wasted enough time on this — please comment below if you know how.

Setting up a systemd service

The very systemd way to run a script when a networking device appears is to add a service. Namely, add this file as /etc/systemd/system/eth2-up.service:

[Unit]
Description=Restart dhcp when eth2 is up

[Service]
ExecStart=/bin/sleep 10 ; /bin/systemctl restart my-dhcpd
Type=oneshot

[Install]
WantedBy=sys-subsystem-net-devices-eth2.device

And then activate the service:

# systemctl daemon-reload
# systemctl enable eth2-up

The concept is simple: A on-shot service depends on the relevant device. When it’s up, what’s on ExecStart is run, the DHCP server is restarted, end of story.

I promised ugly, didn’t I: Note the 10 second sleep before kicking off the daemon restart. This is required because the service is launched when the networking device appears, and not when it’s fully configured. So starting the DHCP daemon right away misses the point (or simply put: It doesn’t work).

I guess the DHCP daemon will be restarted one time extra on boot due to this extra service. In that sense, the 10 seconds delay is possible better than restarting it soon after or while it being started by systemd in general.

So with the service activated, this is what the log looks like (the restarting of the DHCP server not included):

Nov 11 11:25:54 kernel: usb 1-12: new high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:54 kernel: usb 1-12: New USB device found, idVendor=0bda, idProduct=8153
Nov 11 11:25:54 kernel: usb 1-12: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Nov 11 11:25:54 kernel: usb 1-12: Product: USB 10/100/1000 LAN
Nov 11 11:25:54 kernel: usb 1-12: Manufacturer: Realtek
Nov 11 11:25:54 kernel: usb 1-12: SerialNumber: 001000001
Nov 11 11:25:55 kernel: usb 1-12: reset high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 kernel: r8152 1-12:1.0 eth2: v1.09.9
Nov 11 11:25:55 mtp-probe[59372]: checking bus 1, device 125: "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-12"
Nov 11 11:25:55 mtp-probe[59372]: bus: 1, device: 125 was not an MTP device
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12
Nov 11 11:25:55 systemd-networkd[65515]: ppp0: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet8: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet1: Link is not managed by us
Nov 11 11:25:55 networkd-dispatcher[1140]: WARNING:Unknown index 848 seen, reloading interface list
Nov 11 11:25:55 systemd-networkd[65515]: lo: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: eth2: IPv6 successfully enabled
Nov 11 11:25:55 systemd[1]: Starting Restart dhcp when eth2 is up...
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 vmnetBridge[1620]: Removing interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:55 kernel: userif-3: sent link down event.
Nov 11 11:25:55 kernel: userif-3: sent link up event.
Nov 11 11:25:57 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:57 systemd-networkd[65515]: eth2: Gained carrier
Nov 11 11:25:57 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:57 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.20.30.1.
Nov 11 11:25:57 avahi-daemon[1115]: New relevant interface eth2.IPv4 for mDNS.
Nov 11 11:25:57 avahi-daemon[1115]: Registering new address record for 10.20.30.1 on eth2.IPv4.
Nov 11 11:25:57 kernel: r8152 1-12:1.0 eth2: carrier on
Nov 11 11:25:57 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWADDR: index:848, addr:10.20.30.1
Nov 11 11:25:57 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:58 kernel: userif-3: sent link down event.
Nov 11 11:25:58 kernel: userif-3: sent link up event.
Nov 11 11:25:59 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv6 with address fe80::2e0:4cff:fe68:71d.
Nov 11 11:25:59 avahi-daemon[1115]: New relevant interface eth2.IPv6 for mDNS.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Gained IPv6LL
Nov 11 11:25:59 avahi-daemon[1115]: Registering new address record for fe80::2e0:4cff:fe68:71d on eth2.*.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Configured
Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:59 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).

As emphasized in bold above, there are 4 seconds between the activation of the script and systemd-networkd’s declaration that it’s finished with it.

It would have been much nicer to kick off the script where systemd-timesyncd detects the change for the second time. It would have been much wiser had WantedBy=sys-subsystem-net-devices-eth2.device meant that the target is reached when it’s actually configured. Once again, if someone has an idea, please comment below.

A udev rule instead

The truth is that I started off with a udev rule first, ran into the problem with the DHCP server being restarted too early, and tried to solve it with systemd as shown above, hoping that it would work better. The bottom line is that it’s effectively the same. So here’s the udev rule, which I kept as /etc/udev/rules.d/99-network-dongle.rules:

SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth2", RUN+="/bin/sleep 10 ; /bin/systemctl restart my-dhcpd"

Note that I nail down the device by its name (eth2). It would have been nicer to do it based upon the USB device’s Vendor / Product IDs, however that failed for me. Somehow, it didn’t match for me when using SUBSYSTEM==”usb”. How I ensure the repeatable eth2 name is explained on this post.

Also worth noting that these two commands for getting the udev rule together:

# udevadm test -a add /sys/class/net/eth2
# udevadm info -a /sys/class/net/eth2

A VoIP phone at home: The tech details on leaving your phone company

Introduction

This is some information and hard-learned wisdom I collected while setting up an Israeli phone number for use with a VoIP phone, so I can accept calls with a regular Israeli phone number and also make outbound calls. Important to note is that I did this without the local ISPs that provide this service. In particular, I did this for the sake of leaving Israeli Netvision’s service, which required a certain arrangement between the phone adapter and the ADSL line.

First and foremost: Setting up a VoIP line for a regular phone number is a time consuming task, which requires quite a bit of technical understanding in computer networks. Both service providers and hardware vendors behave as if only phone experts deal with them, and that’s the kind of support and documentation to expect. Unless you’re familiar with some internet protocols and know how to configure a firewall and work with a sniffer, you’re in for a big time frustration.

The biggest, ehm, alternative truth, related to VoIP is “phone line within minutes”. Expect it to take a few days at best. On a good day, this will be because of your own learning curve. And if you’re not the computer geek type to enjoy exploring a technical topic, odds are that it’s going to be extremely annoying.

The VoIP trinity

There are basically three components that need to be set up for this to work:

  • A phone number must be allocated or ported. This is referred to as a Direct Inward Dial or Direct Dial In service (DID / DDI). This means that some phone company knows it should relay phone calls of that number to a VoIP service provider, rather than to its own internal phone network.
  • An SIP trunk for connecting calls between yourself and the VoIP supplier over the internet. Or something of a similar nature.
  • A VoIP client: A program running on your computer / smartphone or a physical VoIP phone. Just remember that those innocent-looking handsets are actually small Linux computers with a web interface for configuring the network connection (SIP details and credentials, connection method to the Internet and whatnot). Most of the disappointed reviews on these phones are from people who expected a plug and play experience.

Typically (or always?), the first two items are a package deal, supplied by a single service provider. So you’ll sign up for the phone number, and once it has been set up, you’ll set up the VoIP connection with the same company’s servers.

Note that the DID phone number doesn’t have to be in your own country. As a matter of fact, your geographic location doesn’t matter. So if you want to supply local number to dial into for customers worldwide, this is the way. However if you want to be in contact with friends and family abroad, this is way too much work to set up.

And since I’m at it — a DID provider doesn’t necessarily relay to a VoIP network. The service might very well be plain forwarding to another regular phone number. This might be the optimal solution just to create an international presence, but without the VoIP headache.

SIP trunks…?

This is the most confusing issue. All I want is a single phone line, and suddenly I get the word “SIP trunk” everywhere. Do I really need it? I want a glass of milk, and I get the whole cow? The answer is, well, yes. But no fear, this cow supplies exactly the glass of milk you’ll need.

So first, let’s understand why it’s called an SIP trunk: For the sake of argument, say that you have an office with a number of phone lines connected to an in-office phone relay, and all internal phones can now connect with each other. For an outside world connection, a pool of voice circuits is set up, traditionally (that is, 1990-ish) through a single high-speed (~2 Mb/s) digital link to the local telephone company. This digital link is called a “trunk”, and phone calls in and out are allocated in TDM-multiplexed slots on the digital link as these calls are initiated. The telephony company allocates phone numbers, and when someone calls one of these, the trunk is used to carry both the signaling (which phone number is called) and voice into the office.

The “SIP trunk” uses an existing internet connection for the same thing. Instead of a fixed wire, UDP packets carry both signaling and voice. Instead of the regular phone company, an SIP server makes the connection between the VoIP link and real phone numbers. The in-office telephone relay registers itself on the SIP server, tells it what phone number it covers (so incoming calls are relayed to it) and proves it’s authenticity with some passphrase.

After registration, the in-office phone relay can initiate outgoing calls as required by someone in the office calling out, or accept inbound calls, in which case it will ring one of the phones in the office.

And here’s the point: In the end of the day, that VoIP handset does the same as that in-office phone relay: Requests outgoing calls and accepts incoming calls, with the same SIP protocol. That’s why some kind of trunk is set up (a SIP trunk or some other type) with the capacity of typically one phone number.

That also explains why the whole thing gets complicated: Setting up that little IP phone, you get into the shoes of a small business’ phone technician (of the rather high-end type, actually). So there are a few technical details to understand, and the service providers are somewhat adapted to work with people that do this for a living. You’re supposed to know what you’re doing.

Setting up a simple software SIP client

The protocol for maintaining the telephone signalling and session is called SIP. Hence either a piece of software running on the computer or some dedicated hardware phone can do the job. Or a combination of both.

After some looking around, I went for Linphone for this purpose. It’s simple and to the point.

# apt install linphone

How to set up an SIP connection on Linphone: Dismiss the setup assistant. Instead, go for Options > Preferences. Leave the Network settings with their default values: SIP (UDP) at port 5600, Direct connection to the Internet (even though I have iptables doing both firewall and NAT, however it allow established and related connections). In Multimedia settings, leave the echo canceling on (not that it helps much).

Then go to the “Manage SIP Accounts” tab, and go for the ” + Add” button (not the Wizard). Set Your SIP Identity to e.g. “sip:123456@core-sip-qts.avoxi.com” and SIP Proxy address to “sip:core-sip-qts.avoxi.com”. The “123456″ should be replaced with the DID phone number (which is also the first part of the credentials given by the service provider). Leave “Route” empty and registration duration is set automatically to 3600 secs, which is fine. I use Avoxi’s SIP server in this example, but see below on my choice of service provider.

Why the proxy address is used twice is beyond me (once in the identity and then as the proxy address), but this is commonly seen.

Immediately on clicking OK, Linphone prompts for the password. The password, also given by the service provider, is prompted for only on the first attempt to connect. A couple of seconds after supplying the password, the status line at the bottom of Linphone’s main windows reads “Registration on <core-sip-qts.avoxi.com> succesful”. It better be.

A debug log is available with Help > Show debug window. It shows, among a lot of other stuff, the SIP protocol exchange.

Keep in mind that linphone keeps running even after closing the main window. To really quit it, do so in the icon on the desktop’s toolbar.

A phone adapter

If you want to keep your regular phone, a SIP adapter will do the job. Even though I didn’t go this path, I considered a PAP2T from AliExpress at $20 or something. No experience with it however.

A real IP phone: Grandstream GXP1610

First, some general words: There are several SIP phones out there, and the reviews on this specific one are mixed, and there’s a good reason for that. It’s generally OK, with a lot of specified featured, but at the same time it misses on the small details. For example, it has a wall mount option, but the handset will fall off if you really try that. It can’t be flat on the table either, because the plugs are in the back. So only the table upright position is an option.

The ring tone options are rather poor for a machine that is effectively a computer. Setting it up to allow simple dialing of local phone numbers is a riddle. And the documentation is pretty lacking. There’s a lot of detail on esoteric issues, much less how to get started with the obvious stuff. Once can imagine the computer geeks adding more and more software features, but with nobody looking at the overall usage experience.

That said, it’s fine once set up, in particular if it’s intended for sporadic use, and it’s low-cost. It makes sense in an office, where there’s an IT person to handle the installation and setup.

So now to how to set it up for simple use.

First, some documentation: Download the User Guide from Grandstream’s resource page for the simple use. For configuration, download the Administration Guide. There’s also a Security Manual, which goes through a few security options with the phone.

Plug in the power supply and wait for a few seconds. The phone says “Booting” and then it boots for a while (it takes a minute, like almost exactly 60 seconds). Plug a the phone’s “LAN” Ethernet jack to the local LAN.

This is the time to mention, that the phone’s web configuration interface is on the same port as the voice communication. It’s quite common to do the configuration on a separate Ethernet port, but this is not the case. This means that anyone with access to phone from the LAN (or web?) can fiddle with its configuration. Maybe a good idea in an office with an IT department handling the phones. So setting up a firewall to prevent intrusion from outside is a good idea (if possible). There’s a separate “PC” Ethernet port, but it’s not clear what it is for.

The phone functions as a DHCP client by default, so it gets its address and displays it on the LCD (or press NextScr). Address 0.0.0.0 means that no address has been obtained with DHCP and static IP is disabled.

If DHCP isn’t enabled, press the button in the middle of the four arrows, and navigate: System > Network > IPv4 Settings, select DHCP. The menu returns to IPv4 Settings, meaning it has accepted the selection. Pressing “back” makes the phone ask if we want to reboot, so yes.

To get started, open a browser and type the IP address of the phone. The web interface asks for username and password, and it’s admin/admin, not surprisingly. The web app forces a password change if these are used. Note to self: Look for the phone-login-password.txt file.

To set up a SIP account, go to Accounts > Account 1 > General Settings and fill in the Account Name, SIP Server, SIP User ID (without the “@” and proxy server) and Authenticate Password. Click “Save and Apply”. Then check with Status > Account Status. SIP Registration should say Yes, meaning that the phone is functional (and there’s an icon on the LCD screen, see below).

Then enter Accounts > Account 1 > Audio Settings and set the Preferred Vocoder – choice 1 to G.722, then PCMA, then PCMU. These sounded best in my tests.

There’s also the Accounts > Account 1 > Call Settings which allows setting up local area codes and restrictions, but I didn’t bother — it looks like a riddle in regular expressions. I don’t expect to call out a lot from this phone, so I’ll use the full international number when necessary.

The ring tone: There are four ring tones to choose from, available on the LCD menu under Preferences > Ring Tone. Aside from the “default ring tone” there are three not-so-impressive choices.

The “default ring tone” can be configured through the web interface, but only as a composition of two frequencies. This is set as the “System Ring Tone” on Settings > Preferences > Ring Tone. The default is a plain dual tone going on and off (defined with the string “f1=440,f2=480,c=200/400;”). I changed it to something lighter with “f1=440,f2=480,c=10/30-30/170;”, which is one 100 ms tone, 300 ms pause, then 300 ms tone and 1700 ms pause. The ring volume can be adjusted with the arrow keys: Preferences > Ring Volume.

Finally, General Settings > Preferences > Date and Time (in the web GUI) for setting the correct time zone. Don’t expect it to get the daylight saving time correctly (at least not in Israel).

The registration status is indicated on the icon at the LCD screen’s upper left:

  • A filled T shape: Properly registered ready for phone calls
  • Same, but hollow icon (the “T” is absent): A LAN connection is present but no valid registration is in effect (possibly because of a rejection by remote SIP server)
  • No icon: No connection to the LAN port.

A DID / DDI provider

A Direct Inward Dial or Direct Dial In service is required to relay incoming phone calls to over an VoIP link. There are a lot of providers to choose from. I looked for one that could port an Israeli number to their service, and with a low monthly price.

To make a (very) long story short, I had two finalists: DIDWW and Avoxi. Spoiler: I went for DIDWW. But the way there was interesting.

I first started checking with DIDWW, mainly because their monthly fee was lowest (setup fee of $2.50 and then a monthly fee of $2.50 for all phone numbers, with $0.01 per minute fee, including incoming calls). But there were some worrying signs. First, I had to create an account and log in, just to see their prices. And even worse, at the very first order, one is required to top up a balance of $50, just to begin. One gets aware of this only after registering and at the last stage of making the order. So it looked like a bit of an ugly sales trick. Their “Terms and Conditions” explicitly says that this isn’t reimbursable.

So do they charge $50 upfront to make me a hostage, or is it a sign of serious intentions? I decided to check up Avoxi. $4.50 monthly at lowest plan, $0.04 / minute. More expensive, but no upfront payments. Actually, they offered a free phone line to try for a while.

This is where I’ve deleted quite a few lines describing a lot of good intentions but not so much competence by Avoxi’s support, that eventually pushed me back to DIDWW. I put the $50 on the table, and soon enough I had a working phone line to test. From there I went through the porting procedure with them. It was no fraud, it turned out. They have a pdf document with prices — it’s not something they fiddle with. And their support it quick and to the point, at least so far.

Setting up a DIDWW account

DIDWW has a proper web interface for helping yourself, but this is VoIP, and you’re supposed to behave like this is what you’ve been doing all your life. So this is a short survival guide to follow after having a phone number allocated on your account.

So here it comes: The DID number is linked with a “Phone Systems” trunk, which is apparently a powerful tool for routing phone calls between queues, DTMF menus, voice mail, fax and well, human response (see user manual). It’s a bit of a cow when needing a glass of milk (a recurring motive in this post), but this is how DIDWW offers a SIP phone connection to their numbers. Plus a lot of features to add on later.

Sign up for the “Phone Systems” product at the bottom of the dashboard to the left, “Lifetime Free” plan, which covers exactly one circuit, and launch the service in web interface. Follow their tutorial on setting up an SIP account with Phone Systems.

After this, there’s a “Phone Systems” trunk in the list of Voice IN trunks (or add one). The CLI possibilities merely allow selecting how Calling ID is displayed, so it’s not that important. Check “Map all DIDs” so the phone numbers are related to this trunk.

Then to the DIDWW Phone System’s interface: Click the menu icon at the top right, pick Add a New Contact. Add the name and then pick “Add a New Contact Method”. Pick “SIP Account”. Enable “Enable outbound calls” and select the External caller ID to use. Don’t enable “Allowed IPs”.

After clicking “Save”, pick the SIP details drop-down to see the SIP access information. And then pick “Finish”.

Now some graph games to make this phone line live. Exit the right-side settings, and drag-drop “Phone number” into the canvas. Select the desired phone number, and a name to appear on the icon on the canvas. Click Save.

Add a “Ring Group” the same way, and add a single ring destination, namely the contact you set up before. And connect a wire between the two. And that’s it! It should look something like this:

DIDWW Phone Systems GUI screenshot(click to enlarge)

At “My DID Numbers”, the relevant number should appear, with the Trunk set to the Phone Systems trunk set up for this phone number. Capacity should be 100 with a green dot. If it isn’t, click the dot and set up “Pay Per Minute”.

RTP packets came from 46.19.210.34, which is located in Ireland with a 84 ms ping from Israel, so it’s relatively fine.

The parameters are something like:

Username: h8rn7gkb0p
Password: ymqhn2f50k
Domain: sip.phone.systems

Note that if you’re using Linphone, the actual user name given to the SIP phone is sip:h8rn7gkb0p@sip.phone.systems. But when setting the same thing on a Grandstream IP phone, it’s just h8rn7gkb0p. You should know this. You’ve been doing this all your life.

Bonus with using Phone Systems: It’s possible to turn the phone number temporarily into a receiving fax (yes, we’re in 2019, and the bank wants to send me one). Just drag “Fax” into the canvas and configure it for delivery through email (or some other way you prefer). The configuration is somewhat tangled, but quite straightforward. Incoming faxes arrive as a pdf file. Leave the “Ring Group” icon in place, and make the connection to the Fax, as long as this service is desired instead.

Porting a phone number to DIDWW

It’s a matter of filling in web form with the desired number. Then upload a Letter of Authorization (short thing), some kind of ID (passport in my case, so it’s in English) and the latest invoice from the previous supplier. On the next business day I got an email confirming that the phone number is portable, asking me to confirm the porting fee and the operation in general on the web interface (on the “porting” tab). Clicking on the link that came with the email, I got the message “You have no Portable items”, and instead the phone number was under “In Progress”. Which makes sense, because the porting fee is zero (in Israel). So the email was somewhat misleading — there was no action necessary from my side.

Actually, there was no more to do from my side. 10 days after I submitted the application (which was on a Saturday) I got an email saying that the porting was finished. That was one day after the date given as the target during the process, but well within the official lead time. So it was really quick and painless. Phone number is up and running.

Troubleshooting

So what if it just doesn’t work? You go to the sniffer. But what should it look like? So here’s an example of packets on wire.

First, registration. Rule number one is that if the SIP server doesn’t like the user ID and/or domain of the REGISTER request, it responds with a 403 Forbidden response. It doesn’t ignore the request. So if there’s no response at all, it’s not a matter or user identity or anything of that sort. Odds are you’re talking with the wrong server.

Now to a session between the Grandstream phone and the SIP server that works with DIDWW. Only packet content is shown below. The user name shown below is not valid anymore. The phone’s IP address on the LAN it works on is supposedly 10.11.12.13. Not clear why this is exposed in the SIP session.

So the phone says hello with

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchKwF3ISZRG4VSl3lYx9geIizu9DZUGlS27c4A=", uri="sip:sip.phone.systems", response="fb33a3645a58de7ae1fab14d23196de8", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

to which the server says “nice try, but you have to prove me your love first”:

SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=3cbf29d5022d29bd5eb970c4fa286be5.83a2
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
WWW-Authenticate: Digest realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA="
Server: hedgehog v7p0
Content-Length: 0

Scary, huh? It says Unauthorized. It makes it look like an error. It isn’t. Some just say “no” to begin with.

So the phone says “I know the answer to your challenge” (using the password to produce a digest):

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA=", uri="sip:sip.phone.systems", response="9ff65eb2e0c784af08cd11cc1a7a489f", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The server is impressed by the persistence, and opens its doors:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=9ac8b8c8d68bc095abf326021301853f-0b5b
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;expires=1800;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>";reg-id=1
Server: hedgehog v7p0
Content-Length: 0

This concludes the registration.

And then, every 30 seconds, the server asks (this is specific to Phone Systems):

OPTIONS sip:h8rn7gkb0p@10.11.12.13:5060 SIP/2.0
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: sip:keepalive@sip.phone.systems;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: sip:h8rn7gkb0p@10.11.12.13:5060
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Content-Length: 0

and the phone responds with

SIP/2.0 200 OK
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: <sip:keepalive@sip.phone.systems>;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: <sip:h8rn7gkb0p@10.11.12.13:5060>;tag=394229760
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Supported: replaces, path, timer
User-Agent: Grandstream GXP1610 1.0.4.128
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The purpose of this eternal nagging is most likely to refresh any firewall’s memory on the existence of a UDP link, in particular if there’s NAT involved (more on this below). Maybe also for checking that there’s still a phone on the other end (not sure if it’s so important, from a server’s perspective).

With Linphone connected to Avoxi, it was the client that kept the UDP link alive with some short dummy UDP packets. Looks like this is down to each phone service.

Does it work with NAT?

The short answer: It does for me, out of the box. No need for any special firewall rule or something. With plain Linux iptables NAT, that is. But if your phone is behind NAT or a firewall, be sure to check that some kind of keepalive UDP packets are exchanged every minute or so. Otherwise, the firewall might forget the UDP connection and not let through an incoming call.

This page explains a typical handshake over a NAT router.

I found interest in this after failing to receive inbound calls with Linphone despite having no issues whatsoever with outbound calls. In other words, no problem dialing from the computer, but unavailable when trying to dial to it.

This had nothing to do with NAT and firewalls. The INVITE requests that start off a phone conversation are sent through the UDP link that is constantly maintained with keepalive packets. Hence the server knows at which IP address it should find the SIP client, and the NAT / Firewall remembers the UDP link. So the rule is that if the registration went through fine, there are no excuses. If sound doesn’t come through after the phone is picked up, that’s another story, however iptables should handle this well if it set to allow related connections (and it should).

One thing that surprised me was that the audio UDP (RTP) packets start streaming as soon as the phone starts ringing on the other side. This is the common practice with cellular phones, and still. Even more surprising was that even though they came from a completely different server, using a UDP port that is unrelated to anything before. How did the NAT know how to forward this?

The answer lies in a UDP packet sent from the “regular” SIP host, saying (example with Avoxi server, some numbers xxx’ed):

183 Session Progress
Via: SIP/2.0/UDP 10.1.1.22:5060;received=109.186.xx.xx;branch=z9hG4bK278239419;rport=5060
Record-Route: <sip:199.244.96.39:5060;transport=udp;lr>
Contact: sip:199.244.96.46:5070
To: <sip:9724xxxxxx@core-sip-qts.avoxi.com>;tag=rvguhtm5fjvevrnd.i
From: <sip:9723xxxxxxx@core-sip-qts.avoxi.com>;tag=680205676
Call-ID: 1519032647
CSeq: 21 INVITE
Allow: INVITE, ACK, BYE, CANCEL, INFO, SUBSCRIBE, NOTIFY, REFER, MESSAGE, OPTIONS, UPDATE
Content-Type: application/sdp
Server: Sippy
Content-Length: 240

v=0
o=Sippy 219713410032301436 1 IN IP4 199.244.96.46
s=SIP Media Capabilities
t=0 0
m=audio 49610 RTP/AVP 0 101
c=IN IP4 199.244.96.46
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
a=sendrecv
a=ptime:20

This is the last stage in the Session Description Protocol (SIP/SDP) session, which started with the INVITE request.

And then RTP/UDP packets started arrive from IP address 199.244.96.46′s port 49610 (to destination port 7078, but that doesn’t matter). So obviously this is how the NAT got prepared to let through the related link.

xhci_hcd WARN Event TRB for slot x ep y with no TDs queued

What’s this?

There’s a chance that you’re reading this because the message in the title appeared (or flooded) your kernel log. This post attempts to clarify what to do about it, depending on how much you want to get involved in the matter.

So the short answer: The said warning message is a bug related to the xHCI USB controller, but a rather harmless one: Except for the message in the kernel log, everything is fine. This was fixed in Linux kernel v4.15, so upgrading the kernel is one way out. Alternatively, patch your kernel with commit e4ec40ec4b from December 2017, which is the fix made on v4.15. Even editing away the said xhci_warn() call is fairly sensible, if the patch doesn’t apply.

Note that the xHCI controller is used for any port that is USB 3.x capable, even when a lower version USB device is connected to it (e.g. USB 2.0). So if your computer has a few USB 2.0-only ports, moving a device to such port might be your quick fix. Look for USB plugs with black plastic instead of blue.

The rest of this post dives deeply into the whereabouts of this accident. It matters if you’re developing a USB device. Note that this post is written in increasing detail level, which makes it a bit disorganized. Apologies if it’s confusing.

Why this?

The explanation is a bit tangled, and is related to the way the xHCI driver organizes the buffers for a USB transfer.

To make a long story short(er), a software transfer request (URB in Linux) is conveyed to the hardware controller as a TD (Transfer Descriptor) which is presented to the hardware in the form of one or more Transfer Request Blocks (TRBs). How many? There’s no advantage in chopping the TD into more than one TRB for a transfer request, however there are certain constraints to an TRB. Among others, each TRB points at a chunk of the buffer in continuous physical memory. The xHCI specification also requires in its section 6.4.1 that the data buffers shall not cross 64 kiB boundaries in physical memory.

So if the transfer request’s data buffer does cross 64 kiB boundaries, the TD is split into several TRBs, each pointing at a another part of the buffer supplied in the software’s transfer request. For example, see xhci_queue_bulk_tx() in the Linux kernel’s drivers/usb/host/xhci-ring.c. This is fine, and should cause no problems.

But as it turns out, if a BULK IN transfer request is terminated by the device with a short packet, and the TD consists of more than one TRB, the xHCI hardware produces more than one notification event on this matter. This is also fine and legal, however it causes the kernel (before the patch) to issue the said warning. As already said, the warning is a bug, but otherwise there’s no problem.

So for this bug to show up, there needs to be a combination of a pre-4.15 kernel, short packets and transfer requests that have data buffers that span 64 kiB boundaries.

Huh? Short packets?

A “short packet” is a USB packet that is shorter than the maximal size that is allowed for the relevant USB version (that is, 512 bytes for USB 2.0 and 1024 bytes on USB 3.x).

The idea is that the software can request a transfer of a data chunk of any size (with some maximum, but surely several kilobytes). On a BULK IN endpoint, the device is in principle expected to send that amount of data, divided into packets, which all have the maximal size, except for the last one.

However it’s perfectly legal and fine that the device sends less than requested by the software in the transfer request. In that case, the device sends a packet that is shorter than the maximal packet size, and by doing so, it marks the end of the transfer. It can even be a packet with zero bytes, if the previous packet was of maximal length. This is called a short packet, and once again, it’s not reason for any alarm. It just means that the software must check how much data it actually got when the transfer is completed.

Since the hardware and its driver often have some dedicated protocol to coordinate their data exchange, this short packet mechanism is often unnecessary and not used.

Avoiding these warnings

Linux kernel v4.15 was released in January 2018, so when writing this it’s quite expected that older kernels are still ubiquitous.

If short packets are expected from the device, and pre-patch kernels are expected to be in use, it’s wise to make sure that the buffers don’t cross the said boundaries. Keeping them small, surely below 4 kiB is a good start, as larger buffers surely span more than one page. Also, different pages can be anywhere in physical memory, causing the need to divide the buffer into several TRBs. And then there’s the 64 kiB boundaries issue. However this isn’t practical in a high-bandwidth application, as a 4 kB buffer is exhausted in 10 μs at a rate of 400 MB/s. The software has no chance to continuously generate TDs fast enough to keep up with their completion, which will result in data transmission stops, waiting for TDs to be queued.

When writing a device driver in the kernel, it’s relatively easy to control the position of the buffer in physical memory. In particular, the kernel’s __get_free_pages() function allows this. This is however not the common practice, in particular as the buffers are typically much smaller than a page, so using __get_free_pages() would have seemed to be a waste of memory. So existing drivers are subject to this problem, and there’s not much to do about it (except for applying the patch, as suggested above).

When libusb is used to access the device (through usbfs), there is a solution, assuming that the kernel is v4.6 (released May 2016) and later + libusb version 1.0.21 and later. The trick is to use the libusb_dev_mem_alloc() function to allocate memory (i.e. the zerocopy feature), which implements a shared memory mapping with the usbfs driver, so that the data buffer that is accessed from user space is directly accessed by the xHCI hardware controller. It also speeds up things slightly by avoiding memory allocation and copying of buffers on x86 platforms, which ensure cache coherency on DMA transfers. Not sure on what happens on ARM platforms, for which cache coherency needs to be maintained explicitly.

Note that physical memory continuity is only ensured within 4 kiB pages, as the mmap() call doesn’t ensure physical memory continuity. Since virtual to physical memory translation never touches the lower 12 bits of the address, staying within 4 kiB alignment in virtual alignment ensures no 4 kiB boundaries are crossed — but this doesn’t help regarding 64 kiB boundaries.

Without libusb_dev_mem_alloc(), the usbfs framework allocates a buffer with kmalloc() for each transfer, and eventually copies the data from the DMA buffer into it. No control whatsoever on how the buffer is aligned. The position of the buffer that is supplied by the user-space software makes no difference.

Zero-copy buffers: Gory details (too much information)

Introduced in Linux kernel v4.6 (commit f7d34b445, February 2016), mmap() on the usbfs file descriptor allows the memory buffer provided by the user-space program to be used directly by the hardware xHCI controller. The patch refers to this as “zerocopy”, as there is no copying of the data, however the more important part is that the buffers don’t need to be allocated and freed each time.

The idea is quite simple: The user-space software allocates a memory buffer by calling mmap() on the file descriptor of the relevant USB device. When a URB is submitted with the data buffer pointer directed to inside the mmap’ed region, the usbfs driver detects this fact, and skips memory allocation and copying of data. Instead, it uses the memory buffer directly.

This is implemented in proc_submiturb() (drivers/usb/core/devio.c) by first checking if the buffer is in an mmap’ed segment:

as->usbm = find_memory_area(ps, uurb);

“uurb” in the context of this code is the user-space copy of the URB. find_memory_area scans the list of mmap’ed regions for one that contains the address uurb->buffer (and checks that uurb->buffer_length doesn’t exceed the region). It returns NULL if no such buffer was found (which is the common case, when mmap() isn’t used at all). Moving on, we have

if (as->usbm) {
  unsigned long uurb_start = (unsigned long)uurb->buffer;

  as->urb->transfer_buffer = as->usbm->mem + (uurb_start - as->usbm->vm_start);
 } else {
  as->urb->transfer_buffer = kmalloc(uurb->buffer_length, GFP_KERNEL);
  if (!as->urb->transfer_buffer) {
    ret = -ENOMEM;
    goto error;
  }
  if (!is_in) {
    if (copy_from_user(as->urb->transfer_buffer,
		       uurb->buffer,
		       uurb->buffer_length)) {
      ret = -EFAULT;
      goto error;
    }

So if a memory mapped buffer exists, the buffer that is used against hardware is calculated from the relative position in the mmap’ed buffer.

Otherwise, it’s kmalloc’ed, and in the case of an OUT transaction, there’s a copy_from_user() call following to populate the buffer with data.

So to use this feature, the user-space software should mmap() a memory segment that is large enough to contain all data buffers, and then manage this segment, so that each transfer URB has its own chunk of memory in this segment.

In order to do this from libusb, there’s the libusb_dev_mem_alloc() API function call, defined in core.c. It calls usbi_backend.dev_mem_alloc(), which for Linux is op_dev_mem_alloc() in os/linux_usbfs.c. It’s short and concise, so here it is:

static unsigned char *op_dev_mem_alloc(struct libusb_device_handle *handle,
	size_t len)
{
	struct linux_device_handle_priv *hpriv = _device_handle_priv(handle);
	unsigned char *buffer = (unsigned char *)mmap(NULL, len,
		PROT_READ | PROT_WRITE, MAP_SHARED, hpriv->fd, 0);
	if (buffer == MAP_FAILED) {
		usbi_err(HANDLE_CTX(handle), "alloc dev mem failed errno %d",
			errno);
		return NULL;
	}
	return buffer;
}

This capability was added to libusb in its version 1.0.21 (as the docs say) with git commit a283c3b5a, also in February 2016, and by coincidence, by Steinar H. Gunderson, who also submitted the Linux kernel patch. Conspiracy at its best.

libusb: From API call to ioctl()

This is some libusb sources dissection notes. Not clear why this should interest anyone.

From the libusb sources: A simple submitted bulk transfer (libusb_bulk_transfer(), defined in sync.c) calls do_sync_bulk_transfer(). The latter function wraps an async transfer, and then calls the API’s function for that purpose, libusb_submit_transfer() (defined in io.c), which in turn calls usbi_backend.submit_transfer(). For Linux, this data structure is populated in os/linux_usbfs.c, with a pointer to the function op_submit_transfer(), which calls submit_bulk_transfer().

submit_bulk_transfer() ends up making a ioctl() call to submit one or more URBs to Linux’ usbfs interface. The code used for this ioctl is IOCTL_USBFS_SUBMITURB, which is a libusb-specific defined as

#define IOCTL_USBFS_SUBMITURB	_IOR('U', 10, struct usbfs_urb)

in linux_usbfs.h, which matches

#define USBDEVFS_SUBMITURB         _IOR('U', 10, struct usbdevfs_urb)

in the Linux kernel source’s include/uapi/linux/usbdevice_fs.h.

The ioctl() call ends up with proc_submiturb() defined in drivers/usb/core/devio.c, which moves the call on to proc_do_submiturb() in the same file. The latter function splits the transfer into scatter-gather buffers if it’s larger than 16 kiB (where did that number come from?).

And then we have the issue with memory mapping, as mentioned above.

Extra kernel gory details: URB handling

This is the traversal of a URB that is submitted via usbfs in the kernel, details and error handling dropped for the sake of the larger picture.

proc_do_submiturb() sets up a usbfs-specific struct async container for housekeeping data and the URB (as a struct urb), puts it on a locally accessible list and then submits the URB into the USB framework with a

ret = usb_submit_urb(as->urb, GFP_KERNEL);

This finishes the usbfs-specific handling of the URB submission request. usb_submit_urb() (defined in drivers/usb/core/urb.c) makes a lot of sanity checks, and eventually calls usb_hcd_submit_urb() (defined in drivers/usb/core/hcd.c). Unless the URB is directed to the root hub itself, the function registered as hcd->driver->urb_enqueue is called with the URB. The hcd is obtained with

hcd = bus_to_hcd(udev->bus);

which merely fetches the usb_hcd structure from the usb_bus structure.

Anyhow for xHCI the list of methods is mapped in drivers/usb/host/xhci.c, where urb_enqueue is assigned with xhci_urb_enqueue() (no surprises here). This function allocates a private data structure with kzalloc() and assigns urb->hcpriv with a pointer to it, and then calls xhci_queue_bulk_tx() (defined in drivers/usb/host/xhci-ring.c) for a bulk transfer URB.

xhci_queue_bulk_tx() calls queue_trb() which actually puts 16 bytes of TRB data into the Transfer Ring (per section 6.4.1 in the xHCI spec), and calls inc_enq() to move the enqueue pointer.

Once the hardware xHCI controller finishes handling the TRB (for better or worse), it queues an entry in the event ring, and issues an interrupt, which is handled by xhci_irq() (also in xhci-ring.c). After several sanity checks, this ISR calls xhci_handle_event() until all events have been handled, and then does the housekeeping tasks for confirming the events with the hardware.

The interesting part in xhci_handle_event() is that it calls handle_tx_event() if the event was a finished transmission URB. This happens to be the function that emits the warning in the title under some conditions. After a lot of complicated stuff, it calls process_bulk_intr_td() for a BULK endpoint TD. Which in turn calls finish_td(), which returns the TD back to the USB subsystem: It calls xhci_td_cleanup(), which checks if all TDs of the relevant URB have been finished. If so, xhci_giveback_urb_in_irq() is called, which in turn calls usb_hcd_giveback_urb() (defined in drivers/usb/core/hcd.c). That function launches a tasklet that completes the URB.