FUSE / CUSE kernel driver dissection notes

This post was written by eli on February 28, 2020
Posted Under: Linux,Linux kernel

What this post is about

Before anything: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I bothered looking at the kernel code instead of using libfuse.

So these are some quite random notes I took while trying to figure out how to talk with /dev/cuse directly by reading the sources directly. I’m probably not going to touch CUSE with a five-foot stick again, so maybe this will help someone out there.

Everything said here relates to Linux v5.3. As FUSE a bit of hack-on-demand kind of filesystem, things change all the time.

CUSE vs. FUSE

CUSE is FUSE’s little brother, allowing to generate a single device file in /dev, having the driver implemented in user space. Compared with FUSE’s ability to mount an entire filesystem, CUSE much lighter, and is accordingly implemented as a piggy-back on the FUSE driver.

CUSE and FUSE are reached from user space through different device files: A server (i.e. driver) for FUSE opens /dev/fuse, and a server for CUSE opens /dev/cuse.

Note that the user application program that opens /dev/cuse or /dev/fuse is called the server. It’s actually a driver, but the latter term is saved for the FUSE kernel framework.

The driver for /dev/cuse is implemented entirely in fs/fuse/cuse.c, and it does quite little: All file operation methods for /dev/cuse are redirected to those used for /dev/fuse (by literally copying the list of methods), except for open and release.

The CUSE-specific method for open runs a slightly different initialization procedure against the server (more about this below) and eventually generates a character device file instead of making a filesystem mountable.

This character device file is assigned I/O methods that are handled in cuse.c, however their implementation relies heavily on functions that are defined in the mainline FUSE driver. Effectively, this device file is a FUSE file which is forced to use “direct I/O” methods to present a data pipe abstraction.

It might very well be that it’s possible to obtain the same result by setting up a small mounted filesystem with a file with certain settings, but I haven’t investigated this further. It seems however that the application program will have to open the file with the O_DIRECT flag for this to work. See Documentation/filesystems/fuse-io.txt in the kernel source tree.

The relevant source files

The FUSE filesystem handles I/O requests of two completely different types: Those related to the file system that is mounted in relation to it (or the device file generated on behalf of CUSE), and those related to the character device which the FUSE / CUSE server opens. This might cause a slight confusion, but the kernel code sticks to a naming convention that pretty much avoids it.

The interesting files in the kernel tree:

  • fs/fuse/file.c — Methods for handling I/O requests from the FUSE-mounted file system. The typical function name prefix is fuse_file_*.
  • fs/fuse/dev.c — Methods for handling I/O requests from /dev/fuse. The typical function name prefix is fuse_dev_*.
  • fs/fuse/cuse.c — CUSE-specific driver. Responsible for generating /dev/cuse, and make it behave quite like /dev/fuse. In fact, it routes a lot of function calls to the FUSE driver. The typical function name prefix is cuse_channel_* for methods handling I/O requests from /dev/cuse. Functions named just cuse_* are handlers for the CUSE-generated character device. Note that the /dev/cuse character device is referred to as the “channel” so it’s not confused with the other one.
  • include/uapi/linux/fuse.h — Header file with all structures and constants that are visible in user space
  • fs/fuse/fuse_i.h — Header file with everything that isn’t visible from user space.

FUSE protocol

It’s probably necessary to be acquainted with writing a Linux kernel character device (at least) in order to understand the nuts and bolts of FUSE. It’s actually helpful to have worked with a device driver for Microsoft Windows as well, since flow of I/O requests resembles the IRP concept in Windows’ driver model:

Each I/O request by the user space program goes into the kernel and is translated into a data structure which contains the information, and that data structure is handed over to the server (i.e. the driver in user space). The server queues the request for processing and acknowledges its reception, but not its completion. Rather, the server processes the request in its own free time, and when finished, it turns it back to the I/O system that requested it, along with the status and possibly data. If the user program blocks on the completion of the I/O system call (async I/O is also supported), it does so until the server turns back the request.

So there’s a flow of requests arriving from /dev/fuse (or /dev/cuse, as applicable), and a flow of responses written to the same file descriptor by the driver. The relation between the requests and responses is asynchronous (which is the main resemblance with IRPs), so the responses may arrive in no particular order.

The main difference from Windows’ IRP model is that Windows’ kernel makes calls to I/O operation handlers in the device driver (just like a Linux driver, but with the driver’s hands tied) with a pointer to the IRP. With FUSE, all requests go through a single pipe (good old UNIX design philosophy) and the driver chooses what to do with each. Also, in Windows, there’s a special treatment of requests that can be finished immediately — the driver can return with a status saying so. FUSE’s take on this matter is congratulations, finish the request and submit the response. Now or later makes no essential difference.

This way or another, the FUSE / CUSE server should not block or otherwise delay the reception of requests from /dev/fuse while handling a previous request (a Windows device driver is not allowed to block because it runs in arbitrary thread context, but that’s really irrelevant here). Even if it can’t or isn’t expected to handle another request before the current one is done, it must keep receiving requests while handling previous ones, at least for one reason: Accepting requests to handle a signal (interrupt) for an already queued request. More on that below.

The other side of the coin: A read() call from /dev/fuse or /dev/cuse may block, and will do so until there’s a request available to handle. On the other hand, a write() never blocks (which makes sense, since it merely informs the kernel driver a request has been finished). The poll() system call is implemented, so epoll() and select() can be used on /dev/fuse and /dev/cuse, rather than blocking a thread on waiting for a request (libfuse doesn’t take advantage of this).

I/O requests

The request from /dev/fuse or /dev/cuse is starts with a header of the following form (defined in the kernel’s include/uapi/linux/fuse.h and libfuse’s libfuse/include/fuse_kernel.h):

struct fuse_in_header {
	uint32_t	len;
	uint32_t	opcode;
	uint64_t	unique;
	uint64_t	nodeid;
	uint32_t	uid;
	uint32_t	gid;
	uint32_t	pid;
	uint32_t	padding;
};

The header is then followed by data that is related to the request, if necessary.

@len is the number of bytes in the request, including the fuse_in_header struct itself.

@opcode says what operation is requested, out those listed in enum fuse_opcode in the same header files (the opcodes are also listed and explained on this page).

@unique is the identifier of the request, to be used in the response. Note that if bit 0 is set (i.e. @unique is odd), the request is an interrupt notification to another request (with the @unique after clearing bit 0). This is not true on all kernel versions however.

The rest — nodeid, uid, gid and pid are quite obvious. But it’s noteworthy that the process ID is exposed to the driver in user space.

Reads from /dev/{cuse,fuse} are done in one single read() requests, which dequeues one request from one of the kernel driver’s requests queues: One for INTERRUPT requests, one for FORGET requests, and one for all the others. They are prioritized in this order (i.e. INTERRUPT go before any other etc.).

The read() call is atomic: It must request a number of bytes that is larger or equal to the request’s @len, or the request is discarded and -EIO is returned instead. For this reason, the number of bytes of any read() from /dev/cuse or /dev/fuse must be max_write + fuse_in_header, where @max_write is as submitted on the cuse_init_out structure in response to an INIT request (see below) (max_write is expected to be 4096 at least).

However oddly enough, in libfuse’s fuse_lowlevel.c it says

	se->bufsize = FUSE_MAX_MAX_PAGES * getpagesize() +
		FUSE_BUFFER_HEADER_SIZE;

(the session’s buffer size of arriving requests are se->bufsize) and then libfuse’s fuse_i.h goes

#define FUSE_MAX_MAX_PAGES 256

but how is that an upper limit of something?

I/O responses

Responses are written by the server into the same file descriptor of /dev/fuse or /dev/cuse, starting with a header as follows:

struct fuse_out_header {
	uint32_t	len;
	int32_t		error;
	uint64_t	unique;
};

The meaning of @len and @unique are the same in the request: @len includes the header itself, and @unique is a copy of the identifier of the request (with some extra care when handling interrupt requests).

@error is the status. Zero means success, negative numbers are in the well-known form of -EAGAIN, -EINVAL etc. It’s expected to be zero or negative (but not below -999). If it’s non-zero, the response must consist of a header only, or the write() call that submits the response returns -EINVAL.

A response write() is atomic as well: The number of bytes requested in the call must equal to @len, or the call returns -EINVAL.

How requests are made in the kernel code

For each request to the server, a struct fuse_req is allocated and initialized to contain the information on the request to send and what the answer is about to look like. This begin with calling fuse_get_req() or fuse_get_req_for_background(), which both call __fuse_get_req(struct fuse_conn *fc, unsigned npages, bool for_background).

To make a long story short, this function allocates the memory for the struct fuse_req itself as well a memory array of npages entries of struct page and struct fuse_page_desc. It also initializes several functional fields of the structure, among others the pages, page_descs, max_pages entries, as well as setting the reference count to 1, the FR_PENDING flag and initializing the two list headers and the wait queue. The pid, uid and gid fields in the information for the request are also set.

Then the fuse_req structure is set up specifically for the request. In particular, the @end entry points at the function to call by request_end() following the arrival of a response from the server or the abortion of the request.

The fuse_req has two entries, @in and @out, which are of type fuse_in and fuse_out, respectively. Note that “in” and “out” are from the server’s perspective, so “in” means kernel to server and vice versa.

struct fuse_arg {
	unsigned size;
	void *value;
};

struct fuse_in_arg {
	unsigned size;
	const void *value;
};

struct fuse_in {
	struct fuse_in_header h;
	unsigned argpages:1;
	unsigned numargs;
	struct fuse_in_arg args[3];
};

struct fuse_out {
	struct fuse_out_header h;
	unsigned argvar:1;
	unsigned argpages:1;
	unsigned page_zeroing:1;
	unsigned page_replace:1;
	unsigned numargs;
	struct fuse_arg args[2];
};

Despite the complicated outline, the usage is quite simple. It’s summarized in detail at the rest of this section, but in short: The request consists of a fuse_in_header followed by arguments, which is just a concatenation of strings (there are @in.numargs of them), which are set up when the request is prepared. @value and @size are set up in an array of struct fuse_in_arg.

The response is a concatenation of fuse_out_header and @out.numargs arguments, once again these are concatenated strings. The sizes and buffers are set up when the request is generated. The @argvar flag is possibly set to allow a shorter response at the expense of the last argument. Look at the function pointed by @end for how these arguments are interpreted.

And now the longer version of the two clauses above:

When a request is prepared for transmission to the server by fuse_dev_do_read(), it concatenates the @h entry in the struct fuse_in with @numargs “arguments”. Each “argument” is a string, which is represented as a fuse_in_arg entry in the @args array, by a pointer @value and the number of bytes given as @size. So it’s a plain string concatenation of @numargs + 1 strings, the first with a fixed size (of struct fuse_in_header) and some variable-length strings. What makes it seem complicated is the paging-aware data copying mechanism.

As for handling the arrival of responses from the server: Except for notifications and interrupt replies, fuse_dev_do_write() handles the write() request, which must include everything in the buffer submitted, as follows. The first bytes are copied into the fuse_req’s @out.h, or in other words, the fuse_out’s @h entry. So this consumes the number of bytes in a struct fuse_out_header.

The rest is chopped into arguments (by copy_out_args() ), following the same convention of @numargs concatenated strings, each having the length of @size and written into the buffer pointed by @value. @numargs as well as entries of the struct fuse_arg array are set when preparing the request — when the response arrives, the relevant buffers are filled. And don’t confuse struct fuse_arg with struct fuse_args, which is completely different.

copy_out_args() checks the header’s @error field before copying anything. If it’s non-zero, no copying takes place: The response is supposed to consist of a struct fuse_out_header only.

The last argument of a response from the server may be shorter (possibly zero length) than its respective @size entry if and only if the @argvar entry in the related struct fuse_out struct is set (which is possibly done when preparing the request). If this is the case, the server simply submits less bytes than the sum of the header + all argument’s @size, and the last argument is shortened accordingly. This may sound complicated, but it just means, for example, that a response to READ submits the data that it managed to collect.

Once again, all this sounds a bit scary, but take the relevant snippet from cuse_send_init() defined in the kernel’s fs/fuse/cuse.c:

	req->in.h.opcode = CUSE_INIT;
	req->in.numargs = 1;
	req->in.args[0].size = sizeof(struct cuse_init_in);
	req->in.args[0].value = arg;
	req->out.numargs = 2;
	req->out.args[0].size = sizeof(struct cuse_init_out);
	req->out.args[0].value = outarg;
	req->out.args[1].size = CUSE_INIT_INFO_MAX;
	req->out.argvar = 1;
	req->out.argpages = 1;
	req->pages[0] = page;
	req->page_descs[0].length = req->out.args[1].size;
	req->num_pages = 1;
	req->end = cuse_process_init_reply;
	fuse_request_send_background(fc, req);

It’s quite clear: The driver sends one argument (i.e. one string) after the header, and expects two back in the response. And the function that handles the response is cuse_process_init_reply(). So it’s fairly easy to tell what is sent and what is expected in return.

How CUSE implements read()

The CUSE driver (cuse.c) assigns cuse_read_iter() for the read_iter fops method. This function sets the file position to zero, and calls fuse_direct_io(), defined in file.c. Not to be confused with fuse_direct_IO(), defined in the same file.

The latter function retrieves the number of bytes to process as its local variable @count. It then loops on sending requests and retrieving the data as follows (outlined for non-async I/O): fuse_send_read() is called for sending a READ request to the server by calling fuse_read_fill() and fuse_request_send(). The latter is defined in dev.c, and calls __fuse_request_send(), which queues the request for transmission (with queue_request()) and then waits (i.e. blocks, sleeps) until the response with a matching unique ID has arrived (by calling request_wait_answer()). This happens by virtue of the server’s invocation of a write() on its /dev/cuse filehandle, with a matching unique ID.

Back to the loop on @count, fuse_send_read() returns with the number of bytes of the response’s first argument — that is, the length of the data that arrived. The loop hence continues with checking the error status of the response (in the @error field). If there was an error, or if there were less bytes than requested in the response, the loop terminates. Also if @count is zero after deducing the number of arrived bytes from it.

The return value of fuse_direct_io(), which is also the return of the cuse_read_iter(), is the number of bytes that were read (in total), if this number is non-zero, even if the loop quit because of an error. Only no bytes were received, the function returns the @error field in the response (which is zero if there was neither an error nor data).

The rationale behind the loop and the way it handles errors is that a single read() request by the application may be chopped into several READ requests if the read() can’t be fit into a single READ request (i.e. the read()’s @count is larger than max_read, as specified on the INIT response). It’s therefore necessary to iterate.

How CUSE implements write()

The CUSE driver (cuse.c) assigns cuse_write_iter() for the write_iter fops method. This function sets the file position to zero, and like cuse_read_iter(), it calls fuse_direct_io(), defined in file.c. Only with different arguments to tell the latter function that the data goes in the opposite direction.

fuse_direct_io() calls fuse_send_write() instead of fuse_send_read, which calls fuse_write_fill() instead of fuse_read_fill(). And then fuse_request_send() is called, which sends the request and waits for its response. fuse_send_write() returns with the number of bytes that were actually written, as it appears in the @size entry of the struct fuse_write_out in the response.

Note that the kernel driver sends a buffer along with the WRITE call, and the server chooses how much to consume from it, and then tells the kernel about that in the response. This requires a small discussion on partial handling of write().

The tricky thing with a write() is that the application program supplies a buffer to write, along with the number of bytes to write. It’s perfectly fine to point to a huge buffer and set the count to the entire buffer. Any character device driver may write the entire buffer, or just as much as it can at the moment, and return the number of bytes written. The fact that a huge number of bytes were requested makes no difference, because the character device driver treats the request as if it was for the number of bytes it could write. The rest of the buffer is ignored.

So there are two problems, both arising when the buffer of the write() from the application program is large: One is how to make sure that the server has allocated a buffer large enough to receive the data in one go (recall that both requests and responses must be done in a single I/O operation). The second and smaller problem is the wasted I/O of data in a WRITE request that is eventually ignored, because the server chose to consume less than available.

To prevent huge buffers from being transmitted to the server and then ignored, the server supplies a max_write parameter in its response to an INIT request, that sets the maximal number of bytes for transmitted on a WRITE server request (it should be 4096 or larger). So the write() operation is chopped up into smaller buffers by FUSE / CUSE as necessary to meet this restriction.

This parameter is a tradeoff between reducing the number of I/Os with the server and the possibility to waste data transfers. fuselib picks 128 kB.

There is no similar problem with read() calls, because the server submits the number of bytes actually read in the response after the response header that says how many bytes are submitted. Nevertheless, there is a separate max_read limit for CUSE sessions nevertheless (but not for FUSE, which copies it from max_write).

Handling interrupts (signals)

There is a lot of fuss about this topic, which is discussed on a separate post. To make a long story short, a server must be able to process INTERRUPT requests. To the server, such request is just like the others, in the sense that it comprises of a struct fuse_in_header followed by a single argument:

struct fuse_interrupt_in {
	uint64_t	unique;
};

The function that implements this in the kernel is fuse_read_interrupt() in dev.c.

Note that there are two @unique IDs in the request. One is in the header, which is ID of the interrupt request itself. The second is in the argument, which is the unique ID of the request that should be interrupted. The server should not assume any special connection between the two (there is such since kernel v4.20, due to commit c59fd85e4fd07).

When a server receives an INTERRUPT request, it shall immediately send a response (i.e. completion) of the request with the @unique given in the argument. An -EINTR status may be reported, in accordance the common POSIX rules.

Note that even though an INTERRUPT request is guaranteed to be conveyed to the server after the request it relates to, it may arrive after the server’s response has been submitted if a race condition occurs. As a result, the server may receive INTERRUPT requests with a @unique ID that it doesn’t recognize (because it has removed its records while responding). Therefore, the server should ignore such requests.

On the other hand, if multiple threads fetch requests from the same file descriptor (of /dev/cuse or /dev/fuse), one thread may decode the INTERRUPT request before the original request has been recorder. This possibility is present in the libfuse implementation, and is the reason behind the complication discussed in that other post.

POLL requests

Poll is different from many other requests in that it requires two (or even more) responses from the server:

  • An immediate response, with the bitmap informing which operations are possible right away
  • Possibly additional notifications, when one or more of the selected operations have become possible.

fuse_file_poll in file.c handles poll() is calls on a file. It queues a FUSE_POLL request, with one argument, consisting of a fuse_poll_in struct:

struct fuse_poll_in {
	uint64_t	fh;
	uint64_t	kh;
	uint32_t	flags;
	uint32_t	events;
};

The @events entry is set with

inarg.events = mangle_poll(poll_requested_events(wait));

which supplies a bitmap of the events that are waited for in POSIX style (mangle_poll() is defined in the kernel’s poll.h, which does the conversion).

@flags may have one flag set, FUSE_POLL_SCHEDULE_NOTIFY, saying that there’s a process actually waiting. If it’s set, the server is required to send a notification when the file becomes ready. If cleared, the server may send such notification, but it will be ignored.

@fh and @kh are the file’s file handle, in userspace and kernel space respectively (the latter is systemwide unique).

If there is a process waiting, the file is then registered in a dedicated data structure (an RB tree), and will be kept there until the file is released. The underlying idea is that if a file descriptor has been polled once, it’s likely happen a lot of times to follow.

Either way, the POLL request is submitted, and the server is expected to submit a response with a poll bitmap, which is deconverted into kernel format, and used as the poll() return value. Consequently, poll() blocks until the response arrives.

Should the server respond with an -ENOSYS status, no more POLL requests are sent to the server at the rest of the session, and DEFAULT_POLLMASK is returned on this and all subsequent poll() calls. Defined in poll.h:

#define DEFAULT_POLLMASK (EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM)

So there’s the poll response:

struct fuse_poll_out {
	uint32_t	revents;
	uint32_t	padding;
};

Rather trivial — just the events that are active.

More interesting, is the notifications. The server may send a notification anytime by setting @unique to zero and the @error field to the code of the notification request (FUSE_NOTIFY_POLL == 1). The @opcode field is ignored in this case (there is no opcode for notifications).

There’s one argument in a poll notification:

struct fuse_notify_poll_wakeup_out {
	uint64_t	kh;
};

where @kh echoes back the value in the poll request.

In dev.c, fuse_notify() calls fuse_notify_poll(), which in turn calls fuse_notify_poll_wakeup() (in file.c) after a few sanity checks.

fuse_notify_poll_wakeup() looks up the value of @kh entry in the dedicated data structure. If it’s not found, the notification is silently ignored. This is considered OK, since the server is allowed to send notifications even if FUSE_POLL_SCHEDULE_NOTIFY wasn’t set.

If the entry is found, wake_up_interruptible_sync() is called on the file’s wait queue that is used only in relation to poll (which is known from the entry in the data structure). That’s it.

poll() is supported by FUSE since kernel v2.6.29 (Git commit 95668a69a4bb8, Nov 2008)

CUSE INIT requests

The bringup of the device file is initiated by the kernel driver, which sends an CUSE_INIT request. The server sets up the connection and device file’s attributes by responding to this request.

In cuse.c, cuse_channel_open(), implements /dev/cuse’s method for open(). Aside from allocating and initializing a struct cuse_conn for containing the private data of this connection, it calls cuse_send_init() for queuing an CUSE_INIT (opcode 4096) request to the new file handle. Note that this is different from the FUSE_INIT (opcode 26) that arrives from /dev/fuse.

The request consists of a struct fuse_in_header concatenated with a struct cuse_init_in:

struct cuse_init_in {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
};

The major and minor fields are the FUSE_KERNEL_VERSION and FUSE_KERNEL_MINOR_VERSION, telling the server which FUSE version the kernel offers. flags is set to 0x01, which is CUSE_UNRESTRICTED_IOCTL.

The pid, uid and gid in the header are those of the process that opened /dev/cuse — not really interesting. @unique is typically 1 (but don’t rely on it — it can be anything in future versions). On fairly recent kernels, it continues with 2 and increments by 2 for each request to follow. On older kernels, it just counts upwards with steps of 1. The unique ID mechanism was changed in kernel commit c59fd85e4fd07 (September 2018, v4.20) for the purpose of allowing a hash of unique IDs in the future.

The response is a string concatenation of the following three elements (header + two arguments):

  • A struct fuse_out_header, with the header for the response (with @unique typically set to 1)
  • A struct cuse_init_out with some information (more on that below)
  • A null-terminated string that reads e.g. “DEVNAME=mydevice” (without the quotes, of course) for generating the device file /dev/mydevice. Don’t forget to actually write the null byte in the end, or the device generation fails with a “CUSE: info not properly terminated” in the kernel log.

struct cuse_init_out is defined as

struct cuse_init_out {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
	uint32_t	max_read;
	uint32_t	max_write;
	uint32_t	dev_major;
	uint32_t	dev_minor;
	uint32_t	spare[10];
};

The fields of cuse_init_out are as follows:

  • @major and @minor have the same meaning as these fields in struct cuse_init_in, but they reflect the version that the server is designed for, and hence rules the session. As of kernel v5.3 (which implement FUSE version 7.26), @major must be 7 and @minor at least 11, or the initialization fails. FUSE 7.11 was introduced in kernel v2.6.29 in 2008. See include kernel sources’ uapi/linux/fuse.h for revision history.
  • @max_read and @max_write are the maximal number of bytes in the payload of a READ and WRITE request, respectively. Note that @max_write forces read() requests from /dev/cuse to supply a @count parameter of at least @max_write + the size of struct fuse_out_header + the size of struct fuse_write_out, or WRITE requests may fail. Same goes for @max_read and struct fuse_in_header and struct fuse_read_in. What counts is the length of the requests and their possible responses, which includes the lengths of the non-data parts.
  • @flags: If bit 0 (CUSE_UNRESTRICTED_IOCTL) is set, unrestricted ioctls is enabled.
  • @dev_major and @dev_minor are the created device file’s major and minor numbers. This means that the server needs to make sure that the aren’t already allocated.

FORGET requests

These requests inform a FUSE server that there’s no need to retain information on a specific inode. This request will never appear on /dev/cuse.

Add a Comment

required, use real name
required, will not be published
optional, your blog address