A dissection of WDK’s PCIDRV sample driver’s IRP juggling
Scope
The WDK arrives with sample code for a PCI driver, known as PCIDRV. It demonstrates the recommended old-school methods for avoiding mishaps when maintaining your own IRP queue.
Frankly speaking, I’ve written this post as I learned the different corners of IRP juggling, so pretty much like certain operating systems, it’s a bit of a patchwork with possible inconsistencies. The written below should be taken as hints, not as a reference manual.
So this post consists of my own understanding of the example’s underlying logic. It’s definitely not an introduction to anything. This blob won’t make any sense unless you’re pretty familiar with the PCIDRV.C source and understand the API for handling (and canceling) IRPs.
To be fair, I’ll mention that Cancel-Safe Queues exist (but are not used in PCIDRV), and that the framework suggested by Microsoft for new drivers is KMDF (again, PCIDRV follows the older WDM). The choice is yours.
If I missed something, comments are welcome.
The possible bads
Above anything else, the goal is to avoid a nasty BSOD (Blue Screen of Death, bug check, you name it). In this context, this would happen because some code segment attempts to access memory a region which is either freed (unmapped virtual memory = bug check on the spot) or has a new function (freed and realocated, resulting in memory corruption). The following situations can cause this to happen:
- Attempting to access the device’s I/O, but the resources has been released (so nothing is mapped on that virtual address)
- Attempting to access device object data (the device extension region in particular, a.k.a. FdoData) after that has been freed
- Attempting to access IRP data structure which has been freed
- Writing to the process’ data buffer (in data I/O mode) after the process has shut down and its memory has been freed.
Slightly less catastrophic is not responding to cancel requests. Or more accurately put: Not responding with an IoCompleteRequest() to the fact that the Cancel entry of the IRP has gone TRUE (and the cancel routine, if registered, had been called) within a time period not perceived by humans. This is bad because the userspace application behind the IRP will not terminate until all its IRPs are gone. No matter how many times the user attempts to kill it with various weapons. And then the system will not be able to shut down, because the driver will refuse to unload, having outstanding IRP requests.
As for the possibility of trying to interact with hardware that isn’t physically there anymore, that’s a driver-dependent function. For example, trying to read from a PCIe device that doesn’t respond (possibly because it has been unplugged) should initiate a timeout mechanism, and finish the transaction with a peaceful all-1′s for data. What will actually happen depends on several factors (how the hardware actually behaves, and how Windows responds to that).
Carrying out the essence task of an IRP that has been canceled isn’t a problem in itself, as long as no illegal memory accesses take place.
Things guaranteed
To tackle the issues above, the Windows kernel promises a few things:
- The IRP entry’s memory will remain allocated at least until IoCompleteRequest() has been called, by the Cancel Routine or the normal completion (but IoFreeIrp() can be called any time afterwards).
- The calling application will not terminate until all its outstanding IRPs has undergone IoCompleteRequest().
- The driver’s object will not be freed until the dispatch call for IRP_MN_REMOVE_DEVICE has returned (as a matter of fact, the dispatch routine for this IRP does most of the release explicitly, e.g. with IoDeleteDevice).
As for freeing the I/O resources, that’s done by the driver itself, so it’s up to the driver not to release them (responding to STOP or REMOVE requests and/or queries) while they are being used.
Stop and Remove IRPs
To make things even trickier, every driver is required to hold any incoming IRP requests after receiving an IRP_MN_QUERY_STOP_DEVICE request, if it confirms its readiness for an IRP_MN_STOP_DEVICE (actually, the real requirement is to pause the device after either of these two, and PCIDRV chose to pause on the earlier request). This is because some other driver in the stack may refuse this request, leading to a cancellation of the intention to stop the device, in which case everything should go on as usual. The API assures that IRP_MN_STOP_DEVICE will not be issued if IRP_MN_QUERY_STOP_DEVICE failed by any of the targets. The IRP_MN_STOP_DEVICE request, on the other hand, must not fail.
By the way, IRP_MN_REMOVE_DEVICEs don’t fall out of the blue either: They are preceded either by an IRP_MN_QUERY_REMOVE_DEVICE (when the removal is optional) or by an IRP_MN_SURPRISE_REMOVAL (when the removal is imminent, like an unplugged USB device).
So all in all, the trick for avoiding blue screens boils down to holding the return from those “game over” kind of calls just long enough to let the currently running operations finish, and make sure new ones don’t start.
Accessing I/O resources or the device object’s data
So the goal is not to release the I/O resources or the device object structure while some other thread expects them to be there. This is done by calling PciDrvReleaseAndWait() before releasing any precious stuff. In particular, this function is called with the STOP flag on receipt of an IRP_MN_QUERY_STOP_DEVICE. Since this IRP always precedes IRP_MN_STOP_DEVICE, the handler of the latter safely calls PciDrvReturnResources(), which in turn gets rid of the I/O resources with no worries. The exact same call to PciDrvReleaseAndWait() is issued when IRP_MN_QUERY_REMOVE_DEVICE is received. If a IRP_MN_REMOVE_DEVICE arrives, on the other hand, the call is made with the REMOVE flag.
So what is this PciDrvReleaseAndWait() function about? It merely waits for a situation in which no IRPs are queued for processing and no such processing is ongoing. To skip the gory details, it waits for FdoData->OutstandingIO, which functions as a sort-of reference counter, to hit a value saying nothing is going on and won’t go on. This counter is incremented every time an IRP is queued in the receive or send queue. It’s also incremented when an IRP is handled directly by the dispatch routine, e.g. on PciDrvCreate(). It’s decremented in opposite situations: When the IRPs are dequeued and have finished processing, or when they have gone “off the hook” (that is, canceled or moved to an offline queue).
But waiting for the reference counter to hit a no-IRP value is not enough. IRPs can continue arriving after an IRP_MN_QUERY_STOP_DEVICE, so these IRPs must be prevented from execution while the device’s future is unknown. To cope with this, there’s a state variable, fdoData->QueueState, which reflects what to do with incoming IRPs. It can take three values, AllowRequests, HoldRequests and FailRequests.
Before calling PciDrvReleaseAndWait(), the handler of the stop/remove IRPs sets QueueState to HoldRequests or FailRequests, depending on whether there is hope to resume normal operation (power management routines change QueueState as well, by the way).
The HoldRequest state causes PciDrvDispatchIO() (the dispatch routine for read, write and ioctl IRPs) not to handle them in the normal manner, which would be to queue them on the RecvQueue, for example. Instead, it queues the IRP on the NewRequestsQueue (the “offline queue”) by calling PciDrvQueueRequest(). This queue is just for storage, so OutstandingIO is not incremented. The IRPs are just kept there in case the queuing state changes back to AllowRequests.
When the device is restarted, or when an IRP_MN_CANCEL_STOP_DEVICE arrives, the handler calls PciDrvProcessQueuedRequests(), which literally replays the IRPs in NewRequestsQueue by calling PciDrvDispatchIO() with each. An exception is those canceled (or in the middle of being canceled). Also, if the queue state happens to be FailRequest, all IRPs are completed with a juicy STATUS_NO_SUCH_DEVICE. As a matter of fact, if the queue state happens to turn back to HoldRequests while this IRP replaying is going on, IRP which was just about to be replayed is queued back to NewRequestsQueue, and the replay process is halted.
In fact, the way the IRP is queued back in this rare condition of the state going back to QueueState, may cause an odd situation. The thing is that the IRP which was the next one to be processed, and hence first in the queue, was pushed back to the last position in the same queue. So the IRPs’ order was changed as a result of this wizardry. This is not a bugcheck kind of problem, but if any application relies on the IRPs arriving in a certain order (are there any?), this could cause a very rare bug. The question is whether applications should rely on IRPs arriving in a certain order.
So much for the IRPs arriving in the future. What about those already queued? They are counted by OutstandingIO, but since these IRPs can remain pending for an indefinite time, their existence in the queues can cause PciDrvReleaseAndWait() to block forever.
To solve this, there’s yet another thing the handler of those stop/remove IRPs does before calling PciDrvReleaseAndWait(). That’s a call to PciDrvWithdrawIrps(), which does what its name implies: Moves all IRPs in the read and ioctl queues to the NewRequestsQueue, decrementing OutstandingIO for each one moved. Well, if you look in the code carefully, it’s PciDrvQueueRequest() which does the actual decrementing. But the job is done, anyhow.
And yet again, we have an IRP reordering issue: Since the state switches to HoldRequests before the call to PciDrvWithdrawIrps(), new IRPs will be queued before those in the ready-to-run queues, so they will be replayed in the wrong order. And again, I’m not sure if this matters.
Finally, just to have this mentioned: The initial state of the queue is HoldRequests, given by PciDrvAddDevice(). It’s only when the device is started, as an indirect result of an IRP_MN_START_DEVICE, that the state changes to AllowRequests. So the NewRequestsQueue isn’t just for helping the OutstandingIO reach a no-IRP value.
Handling canceled IRPs
In this respect, PCIDRV follows a pretty established paradigm. The cancel routine releases the global cancel lock, takes the dedicated queue’s lock, removes itself from the queue, and releases the queue lock. Then it marks the IRP’s status as STATUS_CANCELLED, zeroes the Information field and calls IoCompleteRequest(). If the queue affects the OutstandingIO reference counter, it’s decremented as well.
This is a good point to remind ourselves, that prior to calling the cancel routine, the Windows kernel first acquires the global Cancel spin lock, then sets the IRP’s Cancel entry to TRUE, after which it calls IoSetCancelRoutine() in order to get the address of the Cancel routine from the IRP’s structure and nullify the written value in an atomic operation. This use of IoSetCancelRoutine() makes the cancel routine’s pointer a safe indicator of the IRP’s state: If it’s NULL, the IRP isn’t cancelable, so it’s either in the process of being carried out or in the middle of cancellation.
The inline comments in PciDrvProcessQueuedRequests() explain how this is taken advantage of when removing an entry from a queue. The core is in calling IoSetCancelRoutine(nextIrp, NULL), so the cancel routine entry is read from and nullified before checking the Cancel entry.
As for carrying out the IRP, NICServiceReadIrps() in nic_recv.c demonstrates a similar process. It relies solely on IoSetCancelRoutine(irp, NULL) to either indicate that the cancel routine will run, is running or has run. And if that’s not the case, the atomic nullification makes sure it won’t run, so the IRP is executed and completed normally. It’s interesting to note, that the IRP Cancel flag isn’t even checked. In other words, if IoCancelIrp() just marked this flag, but was a nanosecond too late in grabbing and nullifying the cancel routine, this function will fail, and the IRP will be executed anyhow. In particular, any layers above the function driver will have their completion routine called with the Cancel flag set. Which they should be able to handle, of course.
This way or another, the IRP’s data structure will remain allocated and valid as a result of IoCancelIrp() failing, so there’s no memory corruption risk here.
Writing to a data buffer of a cleaned up process
In its NICServiceReadIrps() function (see nic_recv.c) it calls MmGetSystemAddressForMdlSafe() for a pointer to the application’s buffer memory, after making sure the IRP isn’t already canceled. It then releases the spinlock.
If the requesting application is in the middle of croaking, it’s possible that a cancel request will be issued for this IRP. Even worse, it wants to release all application memory, including the buffer. But on the other hand, before it released the spinlock, the driver code nullified the pointer to the cancel routine, so any cancel request would fail.
First, I wasn’t sure if Microsoft promises that the process will go on living as long as it has outstanding IRPs. Or maybe, is it OK to rely on MmGetSystemAddressForMdlSafe() returning a non-NULL value, indicating that some virtual space was allocated? After all, the physical memory is allocated in kernel virtual space, so if the process dies and frees its virtual memory resources, the driver’s pointer goes on pointing at real, nonpaged memory. The question remaining is whether the Windows kernel handles this race condition gracefully or not.
So I wasn’t 100% sure why the reference driver is so confident, until I found this page, which finally said it black on white: No user space application will terminate before all its IRP requests have been completed. As a matter of fact, the underlying assumption is that an uncompleted IRP may have a user application buffer mapped for DMA, so unless the driver confirms that the IRP is done, hardware could write directly to the physical memory, for all Windows knows. Quite amusingly, the page’s purpose was to urge driver programmers to allow for a quick cancellation, and not assure me that it’s safe to access the physical memory until completion is performed.
Summary
Not surprisingly, the PCIDRV sample, which has been inspected and imitated by quite a few programmers, seems to have it all covered. Why all this plumbing should be done by each and every WDM driver is a different question. I know. There’s new API for this. Let’s hope it’s better. For those brave enough to use it, that is.