FUSE / CUSE signal handling: The very gory details

This post was written by eli on February 28, 2020
Posted Under: Linux,Linux kernel

First: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I didn’t just take what libfuse offered.

Overview

This is a detour from another post of mine, which dissects the FUSE / CUSE kernel driver. I wrote this separate post on signal handling because of some confusion on the matter, which ended up with little to phone home about.

To understand why signals is a tricky issue, suppose that an application program is blocking on a read() from a /dev file that is generated by CUSE. The server (i.e. the driver of this device file in userspace) has collected some of the data, and is waiting for more, which is why it doesn’t complete the request. And then a “harmless” signal (say, SIGCHLD) is sent to the application program.

Even though that program is definitely not supposed to terminate on that signal, the read() should return ASAP. And because it has already collected some data (and possibly consumed it from its source), it should return with the number of bytes already read, and not with an -EINTR (which is the response if it has no data when returning on an interrupt).

So the FUSE / CUSE must notify the server that an interrupt has arrived, so that the relevant request is finished quickly, this way or another. To make things even trickier, it might very well be, that while notification on the interrupt is being prepared and sent to the server, the server has already finished the request, and is in the middle of returning the response.

Luckily, the FUSE / CUSE kernel interface offers a simple solution to this: An INTERRUPT request is sent to the server in response to an interrupt to the application program, with a unique ID number that matches a previously sent request. The server responds with normally returning a response for the said request, possibly with -EINTR status, exactly like a kernel character driver’s response to a signal.

The only significant race condition is when the server has already finished handling the request, for which the INTERRUPT request arrives, and has therefore forgotten the unique ID that comes with it. In this case, the server can simply ignore the INTERRUPT request — it has done the right thing anyhow.

So why this long post? Because I wanted to be sure, and because the little documentation there is on this topic, as well as the implementation in libfuse are somewhat misleading. Anyhow, the bottom line has already been said, if you’d like to TL;DR this post.

The official version

There is very little documentation on FUSE in general, however there is a section in the kernel source tree’s Documentation/filesystems/fuse.txt:

If a process issuing a FUSE filesystem request is interrupted, the following will happen:

If the request is not yet sent to userspace AND the signal is fatal (SIGKILL or unhandled fatal signal), then the request is dequeued and returns immediately.

If the request is not yet sent to userspace AND the signal is not fatal, then an “interrupted” flag is set for the request. When the request has been successfully transferred to userspace and this flag is set, an INTERRUPT request is queued.

If the request is already sent to userspace, then an INTERRUPT request is queued.

INTERRUPT requests take precedence over other requests, so the userspace filesystem will receive queued INTERRUPTs before any others.

The userspace filesystem may ignore the INTERRUPT requests entirely, or may honor them by sending a reply to the original request, with the error set to EINTR.

It is also possible that there’s a race between processing the original request and its INTERRUPT request. There are two possibilities:

The INTERRUPT request is processed before the original request is processed

The INTERRUPT request is processed after the original request has been answered

If the filesystem cannot find the original request, it should wait for some timeout and/or a number of new requests to arrive, after which it should reply to the INTERRUPT request with an EAGAIN error. In case 1 the INTERRUPT request will be requeued. In case 2 the INTERRUPT reply will be ignored.

The description above is correct (see detailed dissection of kernel code below) however beginning from the “race condition” part it gets somewhat confusing.

Race condition?

In the rest of this post, there’s a detailed walkthrough of the involved functions in the v5.3.0 kernel, and there’s apparently no chance for the race condition mentioned fuse.txt. It’s not even an old bug that was relevant when interrupt handling was introduced with Git commit a4d27e75ffb7b (where the text cited above in fuse.txt was added as well): Even looking at the original commit, there’s a clear locking mechanism that prevents any race condition in the kernel code. This was later replaced with memory barriers, which should work just the same.

All in all: An INTERRUPT request is queued, if at all, only after the related request has been submitted as the answer to a read() by the server.

So what is this all about, then? A multi-threaded server, which spreads requests randomly among work threads, might indeed handle requests in a random order. It seems like this is what the “race condition” comment refers to.

The solution to the non-existing problem

Had there been a possibility that INTERRUPT request may arrive before the request it relates to, the straightforward solution would be to maintain an orphan list of Unique IDs of INTERRUPT requests that didn’t have a request processed when the INTERRUPT request arrived. This list would then be filled with INTERRUPT requests that arrived too early (before the related request) or too late (after the request was processed).

Then, for each non-INTERRUPT request that arrives, see if it’s in the list, and if so, remove the Unique ID from the list, and treat the request as interrupted.

But the requests that were added into the list because of the “too late” scenario will never get off the list this way. So some garbage collection mechanism is necessary.

The FUSE driver facilitates this by allowing a response with an -EAGAIN status to INTERRUPT requests. Even though no response is needed to INTERRUPT requests, an -EAGAIN response will cause the repeated queuing of the INTERRUPT request by the kernel if the related request is still pending, and otherwise do nothing.

So occasionally, the server may go through its list of orphans, and send an -EAGAIN response to each entry, and delete this entry as the response is sent. If the deleted entry is still relevant, it will be re-sent by the kernel, so it’s re-listed (or possibly handled directly if the related request has arrived in the meantime). Entries from the “too late” scenario won’t be re-listed, because the kernel will do nothing in reaction to the -EAGAIN response.

This is the solution suggested in fuse.txt on the race conditions issue. The reason this solution is suggested in the kernel’s documentation, even though it relates to a problem in a specific software implementation, is probably to explain the motivation to the -EAGAIN feature. But boy, was it confusing.

How libfuse handles INTERRUPT requests

Spoiler: The solution to the non-existent problem is implemented in libfuse 3.9.0 (and way back) as described above. The related comment was written based upon a problem that arose with libfuse. Which is multithreaded, of course.

The said garbage collection mechanism is run on the first entry in the list of orphaned INTERRUPT requests each time a non-INTERRUPT request arrives and has no match against any of the list’s members. This ensures that the list is emptied quite quickly, and without risk of an endless loop circulation of INTERRUPT requests, because the arrival of a non-INTERRUPT request means that the queue for INTERRUPT requests in the kernel was empty at that moment. A quirky solution to a quirky problem.

Note that even when libfuse is run with debug output, it’s difficult to say anything about ordering, as the debug output shows processing, not arrival. And the log messages come from different threads.

The problem of unordered processing of INTERRUPT requests could have been solved much more elegantly of course, but libfuse is a patch on patch, so they made another one.

And for the interested, this is the which-function-calls-what in libfuse.

So in libfuse’s fuse_lowlevel.c, the method for handling interrupts, do_interrupt(), first attempts to find the related request, and if it fails, it adds an entry to a session-specific list, se->interrupts. Then there’s check_interrupt(), which is called by fuse_session_process_buf_int() for each arriving request that isn’t an INTERRUPT itself. This function looks up the list for the request, and if it’s found, it sets that request’s “interrupted” flag, and removes it from the list. Otherwise, if the list is non-empty, it removes the first entry of se->interrupts and returns it to the caller, which initiates an EAGAIN for that.

Read the source

Since this is an important topic, let’s look on how this is implemented. So from this point until the end of this post, these are dissection notes of the v5.3.0 kernel source. There are commits applied all the time in this region, but in essence it seems to be the same for a long time.

Generally speaking, all I/O operations that are initiated by the application program (read(), write(), etc.) end up with the setup of a fuse_req structure containing the request information in file.c, and its submission to the server front-end with a call to fuse_request_send(), which is defined in dev.c. If the I/O is asynchronous, fuse_async_req_send() is called instead, but that’s irrelevant for the flow discussed now. fuse_request_send() calls __fuse_request_send(), which in turn calls queue_request() which puts the request in the queue, and more importantly, request_wait_answer(), which puts the process to sleep until the request is completed (or something else happens…).

And now details…

So what does request_wait_answer() do? First, let’s get acquainted with some of the flags that are related to each request (i.e. in struct fuse_req’s flags entry), see also fuse_i.h:

FR_FINISHED: request_end() has been called for this request, which happens when the response for this request has arrived (but not processed yet — when that is done the request is freed). Or when it has been aborted for whatever reason (and once again, the error has not been processed yet).
FR_PENDING: The request is on the list of requests for transmission to the server. The flag is set when the fuse_req structure of a new request is initialized, and cleared when fuse_dev_do_read() has completed a server’s read() request. Or alternatively, failed for some reason, in which case request_end() has been called to complete the request with an error. So when it’s set, the request has not been sent to the server, but when cleared, it doesn’t necessarily mean it has.
FR_SENT: The request has been sent to the server. This is set by fuse_dev_do_read() when nothing can fail anymore. It differs from !FR_PENDING in that FR_PENDING is cleared when there’s an error as well.
FR_INTERRUPTED: This flag is set if an interrupt arrived while waiting for a response from the server.
FR_FORCE: Force sending of the request even if interrupted
FR_BACKGROUND: This is a background request. Irrelevant for the blocking scenario discussed here.
FR_LOCKED: Never mind this: It only matters when tearing down the FUSE connection and aborting all requests, and it determines the order in which this is done. It means that data is being copied to or from the request.

request_wait_answer()

With this at hand, let’s follow request_wait_answer() step by step:

Wait (with wait_event_interruptible(), sleeping) for FR_FINISHED to be set. Simply put, wait until a response or any interrupt has arrived.
If FR_FINISHED is set, the function returns (it’s a void function, and has no return value).
If any interrupt occurred while waiting, set FR_INTERRUPTED and check FR_SENT. If the latter was set, call queue_interrupt() to queue the request on the list of pending interrupt requests (unless it is already queued, as fixed in commit 8f7bb368dbdda. The same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt).

Note that these three bullets above are skipped if the FUSE connection had the “no_interrupt” flag on invocation to request_wait_answer(). This flag is set if the server answered to any interrupt request in the current session’s past with an -ENOSYS.

Wait again for FR_FINISHED to be set, now with wait_event_killable(). This puts the process in the TASK_KILLABLE state, so it returns only when the condition is met or on a fatal signal. If wait_event_interruptible() was awaken by a fatal signal to begin with, there will be no waiting at all on this stage (because the signal is still pending).
If FR_FINISHED is set, the function returns. This means that a response has been received for the request itself. As explained below, this is unrelated to the interrupt request’s fate.
Otherwise, there’s a fatal signal pending. If FR_PENDING is set (the request has not been sent to server yet), the request is removed from the queue for transmission to the server (with due locking). It’s status is set to -EINTR, and the function returns.

Note that these three bullets are skipped if the FR_FORCE flag is set for this request. And then, there’s the final step if none of the above got the function to return:

Once again, wait for FR_FINISHED to be set, but this time with the dreaded, non-interruptible wait_event(). In simple words, if the server doesn’t return a response for the request, the application that is blocking on the I/O call is sleeping and non-killable. This is not so bad, because if the server is killed (and hence closes /dev/fuse or /dev/cuse), all its requests are marked with FR_FINISHED.

To see the whole picture, a close look is needed on fuse_dev_do_read() and fuse_dev_do_write(), which are the functions that handle the request and response communication (respectively) with the driver.

fuse_dev_do_write()

Starting with fuse_dev_do_write(), which handles responses: After a few sanity checks (e.g. that the data lengths are in order), it looks up the request based upon the @unique field (for responses to interrupt requests, the original request is looked for). If the request isn’t found, the function returns with -ENOENT.

If the response has an odd @unique field, it’s an interrupt request response. If the @error field is -ENOSYS, the “no_interrupt” flag is set for the current connection (see above). If it’s -EAGAIN, another interrupt request is queued immediately. Otherwise the interrupt request response is ignored and the function returns. In other words, except for the two error codes just mentioned, it’s pointless to send them. The desired response to an interrupt request is to complete the original request, not responding to the interrupt request.

So now to handling regular responses: The first step is to clear FR_SENT, which sort-of breaks the common sense meaning of this flag, but it’s probably a small hack to reduce the chance of an unnecessary interrupt request, as the original request is just about to finish.

The response’s content is then copied into kernel memory, and request_end() is called, which sets FR_FINISHED, then removes the request from the queue of pending interrupts (if it’s queued there), and after that it returns with due error code (typically success).

So not much interesting here.

fuse_dev_do_read() step by step

The function returns with -EAGAIN if /dev/fuse or /dev/cuse was opened in non-blocking mode, and there’s no data to supply. Otherwise, it waits with wait_event_interruptible_exclusive_locked() until there’s a request to send to the server in any of the three queues (INTERRUPT, FORGET or regular requests queues). If the server process got an interrupt, the wait function returns with -ERESTARTSYS, and so does this function (this is bug? It should be -EINTR).

First, the queue of pending interrupts is checked. If there’s any entry there, fuse_read_interrupt() is called, which generates a FUSE_INTERRUPT request with the @unique field set to the original request’s @unique, ORed with FUSE_INT_REQ_BIT (which equals 1). The request is copied into the user-space buffer, and fuse_dev_do_read() returns with the size of this request.

Second, FORGET requests are submitted, if such are queued.

If none of the INTERRUPT and FORGET were sent, the first entry in the request queue is dequeued, and its FR_PENDING flag is cleared. The I/O data handling then takes place.

Just before returning, the FR_SENT flag is set, and then FR_INTERRUPTED is checked. If the latter is set, queue_interrupt() is called to queue the request on the list of pending interrupt requests (unless it is already queued. Once again, the same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt). Together with request_wait_answer(), this ensures that an interrupt is queued as soon as FR_SENT is set: If the waiting function returned before FR_SENT is set, FR_INTERRUPTED is set by request_wait_answer() before checking FR_SENT, so fuse_dev_do_read will call queue_interrupt() after setting FR_SENT. If the waiting function returned after FR_SENT is set, request_wait_answer() will call queue_interrupt(). And in case of a race condition, both will call this function; note that each of the two racers sets one flag and checks opposite in reverse order with respect to each other. And calling queue_interrupt() twice results in queuing the interrupt request only once.

Add a Comment

Next Post: FUSE / CUSE kernel driver dissection notes

Previose Post: Linux CUSE (and FUSE): Why I ditched two months of work with it

my tech blog

Popular Posts

Latest Posts

Archives