Linux 2.6.35 hanging/oopsing on large memory allocations

Short summary

Kernel 2.6.35.4 kernel on x86_64 doesn’t seem to handle large memory allocations well. In particular,

  • Running malloc() with Gigabyte chunks can cause a kernel oops
  • Quickly allocating all memory will make the system hang (but not oops)

If anyone has an idea why this is, some helpful clues or whatever, please comment below.

I should also mention that I applied the patch, which fixes the general system freeze on rapid disk I/O, mentioned in another post of mine. I didn’t feel like going on crashing my system over and over again, so I skipped the test without this patch.

Allocating a huge chunk

Running a vanilla 2.6.35.4 kernel on a x86_64 machine (as 64-bit Linux), I wanted to see how well my 16 GB of RAM worked. So I decided to allocate a lot of memory and see what happens. More precisely, I wrote the following somewhat dirty program, and ran it (calling it memeater):

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

int main() {
 long int size = 1024*1024*1024;
 long int i;

 size *= 4;

 char *p = malloc(size);

 printf("Size is %ld, Pointer is %08lx\n", size, (unsigned long int) p);

 if (p)
   for (i=0; i<size/1024; i++) {
     *p = 0;
   p += 1024;
 }

 getc(stdin);

 return 0;
}

A short explanation: This program merely requires 4 Gigabyte of memory in a single malloc(). The loop writes something on each memory page (a 4096 jump rather that 1024 would be OK as well, I believe). This is necessary, since memory allocation doesn’t really consume memory until used.

It then expects the user to press RETURN on console, so that the memory is held until deliberately released.

In theory, this shouldn’t be a problem. Allocating 4 Gigs of memory should either return a pointer or a NULL. Definitely not oops as follows:

May 17 18:11:14 kernel: general protection fault: 0000 [#3] SMP
May 17 18:11:14 kernel: last sysfs file: /sys/devices/virtual/sound/timer/uevent
May 17 18:11:14 kernel: CPU 4
May 17 18:11:14 kernel: Modules linked in: nfsd exportfs it87 hwmon_vid vmnet vmblock vmci vmmon cpufreq_ondemand
 acpi_cpufreq freq_table mperf ipv6 dm_multipath kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_
intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd iTCO_wdt ppdev 8139too soundcore
 8139cp tulip parport_pc r8169 iTCO_vendor_support pcspkr snd_page_alloc parport i2c_i801 mii sha256_generic
 cryptd aes_x86_64 aes
_generic dm_crypt raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ata_generic
 pata_acpi pata_jmicron radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: microcode]
May 17 18:11:14 kernel:
May 17 18:11:14 kernel: Pid: 3056, comm: memeater Tainted: G      D     2.6.35.4-OCHO3 #1 P55-UD3R/P55-UD3R
May 17 18:11:14 kernel: RIP: 0010:[<ffffffff811024ae>]  [<ffffffff811024ae>] mem_cgroup_charge_statistics+0x9/0x50
May 17 18:11:14 kernel: RSP: 0018:ffff88040a85fa48  EFLAGS: 00010246
May 17 18:11:14 kernel: RAX: 00000000ffffff01 RBX: ffffea000a918e00 RCX: 0000000000000060
May 17 18:11:14 kernel: RDX: 0000000000000000 RSI: ffff8804142c8a00 RDI: ffbfc90001817000
May 17 18:11:14 kernel: RBP: ffff88040a85fa48 R08: ffff880409eba958 R09: 00000000ffffffc0
May 17 18:11:14 kernel: R10: 0000000000400000 R11: ffffea000a9d0df0 R12: 0000000000000001
May 17 18:11:14 kernel: R13: ffff8804142c8a00 R14: ffbfc90001817000 R15: ffff880409fadcc0
May 17 18:11:14 kernel: FS:  00007f36017db700(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
May 17 18:11:14 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 17 18:11:14 kernel: CR2: 0000000001861000 CR3: 0000000001a42000 CR4: 00000000000006e0
May 17 18:11:14 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 17 18:11:14 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 17 18:11:14 kernel: Process memeater (pid: 3056, threadinfo ffff88040a85e000, task ffff880409fadcc0)
May 17 18:11:14 kernel: Stack:
May 17 18:11:14 kernel: ffff88040a85fa98 ffffffff8110579c ffff88040a85faa8 ffffffff810ec2d1
May 17 18:11:14 kernel: <0> ffff88040a85fa88 ffffea000a918e00 00007f35e72e1000 8000000305040067
May 17 18:11:14 kernel: <0> 0000000000099000 ffff880409eba958 ffff88040a85faa8 ffffffff81105826
May 17 18:11:14 kernel: Call Trace:
May 17 18:11:14 kernel: [<ffffffff8110579c>] __mem_cgroup_uncharge_common+0x194/0x1e5
May 17 18:11:14 kernel: [<ffffffff810ec2d1>] ? free_pages_and_swap_cache+0x63/0x80
May 17 18:11:14 kernel: [<ffffffff81105826>] mem_cgroup_uncharge_page+0x27/0x29
May 17 18:11:14 kernel: [<ffffffff810e7267>] page_remove_rmap+0x28/0x50
May 17 18:11:14 kernel: [<ffffffff810dd7b3>] unmap_vmas+0x5c5/0x928
May 17 18:11:14 kernel: [<ffffffff810e2ec1>] exit_mmap+0xce/0x132
May 17 18:11:14 kernel: [<ffffffff8104ad9b>] mmput+0x5e/0xca
May 17 18:11:14 kernel: [<ffffffff8104f1cf>] exit_mm+0x114/0x121
May 17 18:11:14 kernel: [<ffffffff81050b7b>] do_exit+0x226/0x726
May 17 18:11:14 kernel: [<ffffffff8105a08e>] ? try_to_del_timer_sync+0x7b/0x89
May 17 18:11:14 kernel: [<ffffffff810510f8>] do_group_exit+0x7d/0xa5
May 17 18:11:14 kernel: [<ffffffff8105ee27>] get_signal_to_deliver+0x373/0x395
May 17 18:11:14 kernel: [<ffffffff812c4dce>] ? n_tty_read+0x6b3/0x786
May 17 18:11:14 kernel: [<ffffffff81009010>] do_signal+0x72/0x68d
May 17 18:11:14 kernel: [<ffffffff812c758d>] ? tty_ldisc_deref+0xe/0x10
May 17 18:11:14 kernel: [<ffffffff812c025f>] ? tty_read+0x8c/0xc5
May 17 18:11:14 kernel: [<ffffffff81009657>] do_notify_resume+0x2c/0x6e
May 17 18:11:14 kernel: [<ffffffff81009f00>] int_signal+0x12/0x17
May 17 18:11:14 kernel: Code: ff 4c 89 e3 4d 8b 24 24 4c 39 eb 75 de 48 c7 c7 60 29 a6 81 e8 b5 eb 39 00 5b 41 5c 41 5d 41 5e c9
 c3 55 48 89 e5 0f 1f 44 00 00 <48> 8b 87 10 11 00 00 80 fa 01 19 c9 83 c9 01 f6 06 02 48 63 c9
May 17 18:11:14 kernel: RIP  [<ffffffff811024ae>] mem_cgroup_charge_statistics+0x9/0x50
May 17 18:11:14 kernel: RSP <ffff88040a85fa48>
May 17 18:11:14 kernel: ---[ end trace 4d26f08f6051ed51 ]---
May 17 18:11:14 kernel: Fixing recursive fault but reboot is needed

And I should mention that when I tried this for 16 GB, the system just hung. But that’s explained next.

Allocating all RAM

So I said, OK, that must be because nobody is really expected to allocate those huge chunks in one go. So what happens when the memory just ends? I have some swap space, so this should work…

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

int main() {
 long int size = 1024*1024;
 long int i,k;

 size *= 16;

 for (k=0; k<1024; k++) {

   char *p = malloc(size);

   printf("Size is %ld, Pointer is %08lx\n", size, (unsigned long int) p);

   if (p)
     for (i=0; i<size/1024; i++) {
       *p = 0;
        p += 1024;
     }
 }
 getc(stdin);

 return 0;
}

The idea here is simple: Loop 1024 times, allocating 16MB at a time. This is sane and eventually takes 16 GB.

But no, the system just hung. No oops, no drama. Just nothing happened. Processes were stalled, typing and switching consoles with Shift-Alt-Fx worked. Ctrl-Alt-Delete is ignored. No reboot. Only a reset got me out of this.

For the record, running the program with chunks of 14 MB each, so almost all memory was allocated (14 GB out of 16 GB), worked cleanly, and the system remained stable.

Of course I have swap

And this is the proof: This is my /proc/meminfo with a system barely doing anything:

MemTotal:       16463436 kB
MemFree:        14909616 kB
Buffers:           75428 kB
Cached:           664284 kB
SwapCached:            0 kB
Active:           459044 kB
Inactive:         604272 kB
Active(anon):     324088 kB
Inactive(anon):   115208 kB
Active(file):     134956 kB
Inactive(file):   489064 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:              1348 kB
Writeback:             0 kB
AnonPages:        323592 kB
Mapped:           116232 kB
Shmem:            115704 kB
Slab:              95920 kB
SReclaimable:      46340 kB
SUnreclaim:        49580 kB
KernelStack:        4008 kB
PageTables:        36940 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12426016 kB
Committed_AS:    1576024 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      123416 kB
VmallocChunk:   34359587772 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8064 kB
DirectMap2M:    16764928 kB

Sometimes I’ve sees some of the swap actually used. It’s not like it doesn’t work.

 

Thunderbird: Making a folder with new messages stand out

On Thunderbird 3.0.7, this little extra file turned the folder names with new mail red and bold:

(the “chrome” directory needs to be created)

$ cat ~/.thunderbird/sdf2k45i.default/chrome/userChrome.css
@namespace
url("http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"); /* set default namespace to XUL */

/* Setting the color of folders containing new messages to red */

treechildren::-moz-tree-cell-text(folderNameCol, newMessages-true) {
 font-weight: bold;
 color: red !important;
}

 

 

Gnome: An icon in the system tray (new mail)

This should have been obvious: Have a “new mail” icon in the system tray when new mail is fetched by Thunderbird on a FC12 Linux machine. I ended up writing my own script.

My starting point was the “Gnome Integration” add-on. Not that I liked that it forced me to have another notification popup on new mails (aside from Thunderbird’s native one) if I wanted the icon. But when an icon was added to the system tray for each new mail arriving (that is, a lot of icons in the tray), I knew I had to do a thing of my own.

So I wrote a small, and not so sophisticated script, which checks if an icon is there (more precisely: If a process owning the icon is running) and gives up if it already exists.

The lucky things about the “Gnome Integration” add-on is that it allows choosing the command for the mail notification. So instead of running whatever puts that unnecessary popup, it runs my script, which puts the icon in place if necessary. Popup away, single icon instead. Could I ask for more?

Here’s the script. I saved it as /usr/local/bin/new-mail-icon (you can put it anywhere you want, but make sure the executable has the same name):

#!/bin/bash

THEDIR=$HOME/.thunderbird
ICON=$THEDIR/green-mail-unread.png
LOCKFILE=$THEDIR/new-mail-icon-process

NOW=`date "+%F %T"`

if [ -e $LOCKFILE ] && grep new-mail-icon /proc/`cat $LOCKFILE`/cmdline > /dev/null 2>/dev/null ;
 then exit 0;
else
 ( echo $BASHPID > $LOCKFILE
 zenity --notification --window-icon=$ICON --text="New mail on $NOW"
 rm -f $LOCKFILE
 ) &
fi

Important notes about the script:

  • It relies on that the script’s name is “new-mail-icon”. This is how it detects another icon exists. If the script has another name, several icons will appear. Or change the argument to the grep command.
  • There is a slight race condition if two of these scripts are run in parallel (which is an unlikely situation, and still). It’s possible that more than one icon will appear, if two (or more) scripts race on the little time gap between checking for the locking file and writing to it.

The script above doesn’t just check if the lockfile exists, but also verifies that the process holding it exists and has the correct command line. This makes it behave correctly if the system shuts down or crashes, leaving the lock file in place.

Then set the Gnome Integration add-on preferences as follows. Note that the “Show icon” is not checked. The icon is created by running my script instead of the notification application. It’s a hack indeed.

Controlling desktop effects: Compiz notes for FC12

Using Fedora Core 12, I wanted some control over my wobbling windows, and in particular which corner is the hot spot for displaying all open windows.

The first thing I needed to find out, is the name of the toolkit doing this: So it’s Compiz.

First, install the Compiz Configuration Setting Manager

# yum install ccsm

It now appears in System > Preferences (as Compiz Configuration Setting Manager).

The next issue is that the settings are ignored. This is because compiz is executed with the gconf plugin by /usr/bin/compiz-gtk, which is what the Gnome environment runs.

So following one of the comments in this bug report, edit /usr/bin/compiz-gtk so that the runCompiz function in the file’s beginning says:

exec compiz --ignore-desktop-hints glib ccp $@

And to get this launched, go to System > Preferences > Desktop effects, turn off the effects and turn them on again (choose “standard” and then “compiz” again). Note that the preferences in the Desktop effect are now ignored, and the ones chosen in Compiz Configuration Setting Manager are used instead. But hey, this is what I wanted in the first place.

In particular, I moved the “hot corner” to the bottom left, which I hit by accident much less.  This is done in Compiz Configuration Setting Manager > Window Management > Application Switcher > Scale > Initiate Window Picker. Maybe I should exchange this with a keystroke instead.

When compiz crashes

That’s an ugly one. The windows lose their window manager frames, so they can’t be moved. Navigating between windows is impossible or difficult. The trick is to get compiz up and running without messing things up too much.

Quick solution. Use CTRL-ALT-F3 to get a plain textual console, log in, and type:

$ compiz --display :0 --ignore-desktop-hints glib ccp --replace &

Note that ampersand at the end. It’s a bad idea to have compiz running in the foreground (even worse to stop it).

This should be an alias, as a matter of fact. The –display flag is redundant if the command is issued from a shell window (as opposed to the text console).

The –replace flag is most likely not necessary (since compiz has crashed anyhow), but it doesn’t hurt.

Radeon: reserve failed for wait in /var/log/messages

Looking at my system log (kernel 2.6.35.4), this is a common thing:

Apr 24 12:04:49 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:04:49 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:04:51 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:04:51 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:04:57 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:05:08 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait
Apr 24 12:05:14 kernel: radeon 0000:01:00.0: ffff8802d0aa1a00 reserve failed for wait

This doesn’t sound too good, but it looks like it’s harmless, according to this and this. It also looks like the decision was to remove the error message, rather than to fix something else. I suppose this issue is old news pretty soon.

It has been suggested that this error message appears only when compiz (that cool OpenGL toy making all kinds of effects on the desktop).

So just let it be.

No sound on Firefox on x86_64 Linux with Pulseaudio / ALSA

This is yet another suggestion on what to do if the Firefox doesn’t play sound (with flash videos in particular) but all other applications do (mplayer, for example). The video in Firefox runs smoothly, but no sound at all. This is a popular subject on the web, it turns out.

My platform: Fedora Core 12, 64 bit, hda sound card. I should also mention this all was fine on my computer until I upgraded to alsa-1.0.24 kernel drivers from sources. With “make install-modules” (as opposed to “make install”). Even after this installation, I don’t have an ALSA init script on my computer (and neither should I have one, as the drivers are autoloaded anyhow).

The first thing to do is to go to System > Preferences > Sound and pick the “Applications” tab. Now do something that does play sound. The application should be listed there while the sound is played. Stop playing, and the application vanishes from the list.

Now try to play a flash video. If you’re lucky, the Firefox applications appears there (as “ALSA plug-in [firefox-bin]“), only with the sound muted or volume very low. Just fix that, and you’re good.

In my case, nothing appeared on this list when playing a video on Firefox. This means that Firefox doesn’t connect to Pulseaudio at all. This has in fact nothing to do with the flash player: This happens when the video is played in HTML5 mode as well (a common situation in Youtube, for example).

Like many others who have addressed this issue, I know how I solved it, but not why things broke in the first place.

The problem, as it turned out, was that Firefox didn’t find the /usr/lib/alsa-lib/libasound_module_pcm_pulse.so shared library. Note that it’s “lib” and not “lib64″ it was looking for, because the part producing sound runs in 32-bit mode. Or something like that.

So I went

$ yum provides '/usr/lib/alsa-lib/libasound_module_pcm_pulse.so'

Which told me I needed alsa-plugins-pulseaudio-1.0.21-2.fc12.i686. I then went (as root):

# yum install alsa-plugins-pulseaudio-1.0.21-2.fc12.i686

just to be told by yum that “package alsa-plugins-pulseaudio-1.0.22-1.fc12.x86_64 (which is newer than alsa-plugins-pulseaudio-1.0.21-2.fc12.i686) is already installed”. This is when I really love this RPM magic.

Googled for the exact RPM file, downloaded it, and basically went (nothing like the –force flag)

# rpm -i --force alsa-plugins-pulseaudio-1.0.21-2.fc12.i686.rpm

And the sound is back. Hurray! Not. That problem shouldn’t have started in the first place.

Windows: Tracking process launches and ends

I’m sure there’s a saner way to do this than the Perl script below, but jotting it down was so easy, that it wasn’t worth looking for another tool to do it.

The idea is the simplest ever: Grab the list of processes 10 times a second or so, and write a line to standard output when a change is detected. If a process lives longer than ~100 ms, it’s caught.

The correct way to do this would be to use some kind of hook of the Windows operating system. Or more precisely, look for a program that already does that. And then hope that program isn’t some kind of spyware.

So here’s the script. You may need to install Win32::Process::List from somewhere, assuming you have Perl on board anyhow.

Perl code:

#!/usr/bin/perl

use strict;
use warnings;
use Win32::Process::List;

my %old = ();

while (1) { 
  my $P = Win32::Process::List->new();  # constructor

  select(undef, undef, undef, 0.1);

  my $now = scalar localtime time();

  my %list = $P->GetProcesses();

  %old = %list
    unless (%old);

  my %left = %old;

  foreach my $key ( keys %list ) {
    print "START $now: PID $key, $list{$key}\n"
      if (not defined $old{$key});

   delete $left{$key};
  }

  foreach my $key ( keys %left ) {
    print "STOP  $now: PID $key, $left{$key}\n";
  }
  %old = %list;
}

This is what the output can look like (screensaver, then FacebookUpdate and then what happens when I open a new window on Firefox).

> perl processtrace.pl
START Mon Apr 16 22:04:07 2012: PID 1380, sstext3d.scr
STOP  Mon Apr 16 22:04:50 2012: PID 1380, sstext3d.scr
START Mon Apr 16 22:27:00 2012: PID 2144, FacebookUpdate.
STOP  Mon Apr 16 22:27:03 2012: PID 2144, FacebookUpdate.
START Mon Apr 16 22:28:05 2012: PID 2268, jqsnotify.exe
STOP  Mon Apr 16 22:28:07 2012: PID 2268, jqsnotify.exe

 

Windows device drivers: Notes to self

Yet another messy post with just things I wrote down while developing a device driver for Windows. Not necessary coherent, not necessarily accurate.

General notes while playing around

  • To start Windows 10 allowing installation of unsigned drivers, hold Shift while selecting “Restart” in the Power menu. Windows will appear to restart normally, but enters a menu before shutting down. Go Troubleshoot > Startup Settings The Windows 7-like warning appears during installation, and the driver installation survives the reboot it regular mode. However text saying “Test Mode” appears at the bottom right of the desktop on following boots. Windows may then randomly refuse to use the driver on the basis that it isn’t signed on following reboots.
  • It seems like Windows restarts immediately after installing a driver if the installation fails. Message in event log: “The process C:\WINDOWS\system32\mmc.exe (DESKTOP-xxxxx) has initiated the restart of computer DESKTOP-xxxxx on behalf of user DESKTOP-xxxxx\theuser for the following reason: Hardware: Installation (Planned)
  • “Read the source”, WDK-style: Read the inc\ddk\wdm.h include file, which supplies a lot of information and hints. Sometimes it’s much better than the docs. There are also sources for the implementation of basic functions such as open(), printf() etc. in Visual Studio’s installation directory.
  • A list of status codes can be found here. Possibly relevant in particular: STATUS_UNSUCCESSFUL,  STATUS_NOT_IMPLEMENTED, STATUS_INVALID_PARAMETER, STATUS_NO_MEMORY, STATUS_INVALID_DEVICE_REQUEST, STATUS_DEVICE_BUSY, STATUS_END_OF_FILE, STATUS_ACCESS_DENIED, STATUS_DATA_ERROR, STATUS_NO_SUCH_DEVICE, STATUS_NO_SUCH_FILE, STATUS_OBJECT_NAME_NOT_FOUND,  STATUS_INSUFFICIENT_RESOURCES, STATUS_IO_TIMEOUT, STATUS_FILE_FORCED_CLOSED,  STATUS_CANCELLED, STATUS_NOT_FOUND, STATUS_RETRY, STATUS_RECEIVE_PARTIAL
  • IRP’s major determine the dispatch routine launched (as set by DriverEntry). The minor is interpreted by the dispatch routine itself.
  • A list of safe string functions: In MSDN’s page
  • How to use symbolic links to create a name space: here.
  • To see the internal object structure (and symbolic links), download WinObj from somewhere.
  • Sniff IRPs in the system: Download IrpTracker.
  • MSDN’s explanation on resource rebalancing, why, how and what to do about it.
  • On a typical open for write, DesiredAccess=0x00120196 (FILE_GENERIC_WRITE | READ_CONTROL). On an open for read, DesiredAccess=0x00120089 (FILE_GENERIC_READ | READ_CONTROL). This is based upon command line > and <. Trying to “dir” the file gives DesiredAccess=0x00000080.
  • The difference between IRP_MJ_CLEANUP and IRP_MJ_CLOSE: IRP_MJ_CLEANUP is called when the file is officially closed (all owners of the file handle have closed it). IRP_MJ_CLOSE is called when the file handle can be removed, which is when all outstanding IRPs have been completed as well. Note that IRP_MJ_CLEANUP dispatch routine must walk through the queue and cancel all requests (Why doesn’t the system do this?). Note that non-I/O IRPs for the file handler may still arrive for the file handler.
  • Do implement a handler for IRP_MJ_SET_INFORMATION, so that such an IRP doesn’t fail. Even a yes-yes handler doing nothing but returning success. In particular, Cygwin sends this IRP with FileEndOfFileInformation when writing to a regular file. Go figure.
  • Using MmGetMdlByteCount(irp->MdlAddress) to get the number of bytes to read or write is bad: It causes a bugcheck if the number of bytes to read or write is zero. Which is bad behavior from the application’s side, and still.
  • Not directly relevant, but user space applications with POSIX functions such as open(), read() close() and friends should include <io.h> rather than <unistd.h> which doesn’t exist on VC++.
  • Setting the DeviceType parameter is utterly important, since the native fopen() and open() calls will fail with an “Invalid Argument” error otherwise. The parameter in the call to IoCreateDevice() is ignored since the actual type is taken from the PDO. Hence a line in the INF file for setting up a registry value is necessary.
  • When seeking a file, be either with POSIX-style _seek() and friends, or with native SetFilePointer(), this merely updates CurrentByteOffset, but no IRP is issued, so the device has no idea it happened at all. In particular, no IRP_MJ_SET_INFORMATION IRP with class FilePositionInformation is issued, despite what one could expect. At least so it works in Windows 7.
  • pnpdtest (which is part of the WDK, and tge wtllog.dll is also somewhere in its directories) should be run with the verifier (standard Windows utility) configured for toughest tests specifically on the driver. Just type “verifier” at command prompt, reboot, and run pnpdtest. To run verifier with a good set of tests, go
    > verifier /volatile /flags 0xbfb /adddriver mydriver.sys
  • Unlike Linux, the allocation of PCI resources (BAR addresses and interrupts) is not exclusive. As a result, if a device driver doesn’t deallocate the interrupt resource at unload, and hence the pointer to the ISR remains, attempting to reload the driver will cause a jump to the unloaded ISR pointer and a bugcheck. In Linux this wouldn’t happen, because the newly loaded instance of the driver wouldn’t pass the stage of getting the resources.
  • devcon is the old tool for installing and looking at driver information, with source code in the Windows Driver Samples git repo. pnputil should be used instead, according to Microsoft.

Runlevels (IRQLs)

A very short summary (see Microsoft’s page for more):

  • PASSIVE_LEVEL: The level of user space applications, system threads and work items specially generated to run in this level. May block, but in most cases it’s not allowed because of arbitrary thread context.
  • APC_LEVEL: The special level for delivering data to user applications. Rarely seen, but often considered.
  • DISPATCH_LEVEL: Used in DPC (Deferred Procedure calls) queued by ISR or by a custom call. Also, when acquiring a spinlock, the runlevel is raised to DISPATCH_LEVEL.
  • Higher levels: When serving interrupts. No kind of mutex can be acquired (neither spinlocks). Some simple memory operations and queuing DPCs is more or less what’s left to do.

Notes:

  • Paged code runs below DISPATCH_LEVEL. So when PAGED_CODE appears in the beginning of a function, it’s at most at APC_LEVEL.
  • DPCs run at DISPATCH_LEVEL. When requested, a single instance is queued and executed when possible. The DPC instance is dequeued just before execution, so if another DPC request occurs immediately after launching a DPC, another run will take place, possibly simultaneously on another CPU. On the other hand, if several requests are made before execution is possible, the DPC is run only once on behalf of these requests.
  • Custom DPCs can be queued from any runlevel. If the “background routine” needs to be queued from DISPATCH_LEVEL and down, use a work item instead.
  • Confusingly enough, most (almost all) IRP dispatch calls are made in passive level. To make things more difficult, read, write and ioctl IRPs can go up to DISPATCH_LEVEL.

Bug check handling

What to do when the Blue Screen (BSOD) pays a visit. Or more precisely, how to catch exactly where your code wanted to use a zero pointer or something.

It’s important to save the entire binaries directory, including .obj files, for the driver distributed, since those files are necessary for bug check dissection.

The files are typically something like:
C:\Windows\Minidump\030312-17503-01.dmp
C:\Users\theuser\AppData\Local\Temp\WER-33571-0.sysdata.xml

However the XML file isn’t necessary for the bugcheck analysys, only the dump file.

To enable generation of a dump file, go to “System” from the Start menu, pick “Advanced system settings”  and click “Settings…” in the “Startup and Recovery” section. In the “System Failure” section, select “Small memory dump (128 kB)” in the drop-down menu. This option is possible only if there is swap space enabled. An alternative directory for the dump file can be selected there as well (but why?).

Analyze a bugcheck dump (change last file, of course). Be patient. Each step takes time.

> C:\WinDDK\7600.16385.1\Debuggers\kd.exe -n -y srv*c:\symbols*http://msdl.microsoft.com/download/symbols -i C:\devel\objchk_win7_x86\i386 -z C:\copies\022512-24414-01.dmp

The symbols which are downloaded are stored in C:\symbols (as required). These relate to a given version of the Windows kernel (and loaded drivers), so keep a different directory for each platform (?), or delete the directory altogether before invoking the debugger if unsure. It’s a cache, after all.

When those downloads are finished, go

0: kd> !analyze -v

Quit with “q”

To get a disassembly of the relevant code, go something like:

1: kd> u nt!PnpCallDriverEntry nt!PnpCallDriverEntry+0x4c

This prints out the disassembly from the beginning of PnpCallDriverEntry() to offset 0x4c. If the latter offset is the one appearing in the stack trace, the disassembly goes until the command after the call to the function above in the trace (because the address in the stack is the one to return to, not the caller’s address). The interesting disassemblies are of course those on the driver itself, not the system code as shown in this example.

On a 32-bit architecture, it’s also possible to disassemble the object file (but why?). Linux’ objdump -d works on Windows’ object files, surprisingly enough, and then there’s Microsoft’s counterpart:

> dumpbin.exe /disasm \path\to\objchk_win7_x86\i386\project.obj > project.asm

Note that we disassemble the object file to which the code belongs. That alone is a good reason to maintain an exact snapshot of released versions. The .sys file can be disassembled to verify that nothing has changed, but that disassembly doesn’t contain symbols names.

Unfortunately, the build for 64 bit (amd64) creates .obj files that can’t be disassembled, so only the .sys files can be used, and once again, they don’t contain symbols.

dumpbin.exe is part of the Visual C++ package. The necessary file bunlde are in my resumption directory, containing dumpbin.exe, link.exe, msdis140.dll and mspdb71.dll


April 2021 update: After another round of work on Windows drivers (still WDM), I’ve added a few more notes:

Another look at IRP handling

While the do’s and don’ts of IRP processing with the WDM model are well-documented, somehow I’ve never seen the idea behind the API explained. And it’s a fascinating one (not clear what kind of fascinating, though): It mimics a subroutine call, and implements the call stack as an array going with the IRP.

This is easiest explained by comparing with Linux’ driver API: For example, a read() system calls ends up as a call to the related function presented by the driver. That driver may call other functions in the kernel to fulfill its request. It may sleep waiting for data, causing the user-space process to sleep. It may call functions belonging to drivers at a lower level, i.e. closer to the hardware. Once it’s done, it returns, and the user space program gets its data.

Microsoft took an asynchronous approach, meaning that there is no direct connection between the user space program that initiated the read() call and the threads in the kernel that handles it. Rather, a data structure (the IRP) containing details on the read() request is set up, and then the execution flow goes event-driven style. Sometimes, the IRP is fulfilled by a single call to a handler functions, but in general, it’s juggled among several such. The fulfillment is cut into fragments, each running in a function of its own, and each needing to discover the overall state of the progress, not to mention tackle race conditions with the possibility that the IRP is canceled.

But the interesting point is the stack. And it has two meanings, actually: One is that drivers are stacked up, so there’s a low-level driver (say, the USB hub driver, which processes URBs) and higher-level drivers (e.g. for a sound card). Obviously, the intended usage was that if the user-space process makes a read() call for X bytes of data, an IRP is set up for that request, and handed over to the sound card’s driver for fetching X bytes of sound data. This driver goes on by setting up a request which is tied to the same IRP, for a data read URB of X bytes, and passing the IRP to the driver below it. And so it goes down the stack until some driver does the actual I/O. So the IRP represents the task to be fulfilled, and each driver in the stack presents its interpretation of the request.

And this brings me to the second meaning of “stack”: An execution stack. If the same thing was to run Linux kernel style, this would take the form of the upper driver literally calling a function belonging to a lower level driver, which could go on deeper and deeper until the ultimate driver is reached.

Apparently, Microsoft decided to mimic this execution stack by attaching an array of IO_STACK_LOCATION structures to the IRP’s main data structure, which is, well, the IRP stack . One way to look at these structures is that they contain the arguments of the function call to the driver beneath. So each of these IO_STACK_LOCATION structs contain the stuff that would have been pushed into the stack, had this lower-driver call been done Linux style. Which would have been the case, had Microsoft decided to give each IRP an execution thread that could sleep.

But wait, you say. There’s a return address too in the stack as well when a function is called. Well, I’m just getting to that.

So the idea is like this: The handling of an IRP consists of three parts: Things to do before calling the driver beneath, calling the driver beneath, and things doing after that. In Linux style it would have been something rhyming with

int stuff_handler( ... arguments ...)
{
   do_some_stuff();

   call_lower_driver( ... arguments ... );

   do_stuff_afterwards();
}

But this doesn’t work with Window’s model, because in most interesting cases, the call to the lower driver can be pended. The do_stuff_afterwards() needs to be done after the lower drivers have finished their business. That’s where the completion routine comes in: It contains the code for do_stuff_afterwards().

So effectively, the call to IoSetCompletionRoutine() takes the role of pushing the address to return to into the execution stack, in a regular function call. The completion routine is sort-of the return-from-function address for the IoCallDriver() call.

The completion routine can return one of two statuses:

  • STATUS_CONTINUE_COMPLETION, which has the effect of returning from stuff_handler(), i.e. giving control to the function that called stuff_handler(). In Window’s stack model, this results in calling the completer function of some driver above. That is, the do-stuff-afterwards part of the upper driver.
  • STATUS_MORE_PROCESSING_REQUIRED, meaning nope, don’t return from stuff_handler. Because all execution is done with functions, the completion function has to return when there’s no more useful things to do, but in this game of faking an execution thread, this status means “let’s pretend I never returned”.

So finally, we have IoCompleteRequest(). It simply means “stuff_handler returns now”. It’s the opposite to STATUS_MORE_PROCESSING_REQUIRED. It’s not (necessarily) called from a completer function, but it has the effect of returning from it with a STATUS_CONTINUE_COMPLETION.

The completion routine doesn’t have to be set. This is like finishing a Linux-style handler with

  return call_lower_driver( ... arguments ... );

which is typically translated into a JMP.

Looking at the IRP framework as a mimic of an execution thread is quite counterintuitive, in particular the idea that returning from that faked function is done by calling a function in the actual C code. Nevertheless, it helps somewhat when trying to make sense of some of the whole IRP framework.

Disassembling amd64 files

At some point, I wanted to disassemble the obj files created in an amd64 compilation. Dumpbin gave me:

Microsoft (R) COFF/PE Dumper Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file thefile.obj

FileType: ANONYMOUS OBJECT

and indeed, it’s not a COFF file, but rather an Anonymous Object file, with a rather odd magic word:

00000000  00 00 ff ff 01 00 64 86  a5 68 78 60 38 fe b3 0c  |......d..hx`8...|
00000010  a5 d9 ab 4d ac 9b d6 b6  22 26 53 c2 9b 81 01 00  |...M...."&S.....|
00000020  13 0c 07 00 a5 68 78 60  89 80 01 00 0f 00 00 00  |.....hx`........|
00000030  00 00 00 00 2e 64 72 65  63 74 76 65 00 00 00 00  |.....drectve....|

After the first four bytes, there a version number (0x0001) and then an 0x8664 signature for amd64. This is most likely a result of compiling with the /GL option for a whole program optimization. Hence there is nothing to disassemble here: Unlike a COFF file, which is just relocated into the final output, it appears like the content of this kind of object file is up for some additional mangling. Hence a disassembly of this file would have been useless, if at all possible.

So bottom line: Don’t even try. The .sys file can be disassembled, but it’s without symbols. So the sane way seems to be to use the kd debugger when disassembly is needed.

Bugcheck for memory leak

This is what it looks like at !analyze -v (minus lots of mumbo-jumbo)

Loading unloaded module list
.................
3: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught.  This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arguments:
Arg1: 00000062, A driver has forgotten to free its pool allocations prior to unloading.
Arg2: 84caebd4, name of the driver having the issue.
Arg3: 84b9b418, verifier internal structure with driver information.
Arg4: 00000001, total # of (paged+nonpaged) allocations that weren't freed.
        Type !verifier 3 drivername.sys for info on the allocations
        that were leaked that caused the bugcheck.

[ ... ]

BUGCHECK_STR:  0xc4_62

IMAGE_NAME:  thedriver.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  60635055

MODULE_NAME: thedriver

FAULTING_MODULE: 99a9b000 thedriver

VERIFIER_DRIVER_ENTRY: dt nt!_MI_VERIFIER_DRIVER_ENTRY ffffffff84b9b418
Symbol nt!_MI_VERIFIER_DRIVER_ENTRY not found.

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  2

LAST_CONTROL_TRANSFER:  from 82d51f03 to 82af9d10

STACK_TEXT:
8afa8a28 82d51f03 000000c4 00000062 84caebd4 nt!KeBugCheckEx+0x1e
8afa8a48 82d565eb 84caebd4 84b9b418 99a9b000 nt!VerifierBugCheckIfAppropriate+0x30
8afa8a58 82a29e8a 84caeb78 99a9b000 40000000 nt!VfPoolCheckForLeaks+0x33
8afa8a94 82bae69f 84caeb78 99a9b000 40000000 nt!VfTargetDriversRemove+0x66
8afa8aa8 82bae338 82b657e0 849a4a70 00000000 nt!VfDriverUnloadImage+0x5e
8afa8ae0 82baf58d 84caeb78 ffffffff 00000000 nt!MiUnloadSystemImage+0x1c6
8afa8b04 82cd8517 84caeb78 849c09c8 84c59668 nt!MmUnloadSystemImage+0x36
8afa8b1c 82c3e6f4 84c59680 84c59680 84c59668 nt!IopDeleteDriver+0x38
8afa8b34 82a85f60 00000000 84ab9818 84ab9768 nt!ObpRemoveObjectRoutine+0x59
8afa8b48 82a85ed0 84c59680 82bd6cb2 849c4838 nt!ObfDereferenceObjectWithTag+0x88

8afa8b50 82bd6cb2 849c4838 84ab9750 84ab9768 nt!ObfDereferenceObject+0xd

[ ... ]

STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  0xc4_62_LEAKED_POOL_IMAGE_thedriver.sys

BUCKET_ID:  0xc4_62_LEAKED_POOL_IMAGE_thedriver.sys

Followup: MachineOwner
---------

Just in case I’ll see one of these one day.

Authoritative timestamping: Some silly games

Ever wanted to prove you knew something before everyone else? Here’s the way: You write it down in a file, and timestamp it. Then, when you want to prove you were there first, you’ll show the original file, the signed timestamp, and then everyone know you’re a hopeless geek.

But this can help a bit if something goes wrong with timestamp servers used to sign software.

Anyhow, I have to admit I know very little about this issue. In particular, I have no idea how to dissect the timestamp file (tsr). Ideas are welcome as comments below.

You may need to install a certain package (or several):

# yum install openssl-perl

And this can never hurt:

$ man ts

Now suppose that the file containing our genius knowledge is in thefile.txt.

$ openssl ts -query -data thefile.txt -cert -out request.tsq

This created the request to be sent to the timestamp server. The important part is that the hash of thefile.txt is in there, so the timestamping is specific to the file.

Now read the manual a bit again…

$ man tsget

and go timestamping!

$ /etc/pki/tls/misc/tsget -h http://tsa.starfieldtech.com -o timestamp.tsr request.tsq

Pardon my weird path. This is how yum installed it.

The server mentioned above is GoDaddy’s official timestamp server, so the reply stored in timestamp.tsr is a real proof. If you want to use this in court, expect to pay a lot to experts and hope the judge doesn’t kick you out before you’re done with all the technicalities. But hey, that’s a watertight proof you paid nothing to get (and it’s probably worth accordingly).

It would be wonderful to know what’s actually in that timestamp. I know for a fact that the file’s hash has to be there, as well as the digital signature of the issuer of the time stamp. Lacking a better tool to peek inside, I just went:

$ strings -n 10 timestamp.tsr
20120330152409Z
Starfield Technologies, Inc.1200
)Starfield Class 2 Certification Authority0
110503070000Z
160503070000Z0
Scottsdale1%0#
Starfield Technologies, Inc.1301
*http://certs.starfieldtech.com/repository/1&0$
Starfield Timestamp Authority0

http://ocsp.starfieldtech.com/08

'http://crl.starfieldtech.com/sfroot.crl0S
+https://certs.starfieldtech.com/repository/0
Starfield Technologies, Inc.1200
)Starfield Class 2 Certification Authority
120330152409Z0

So there are some clues here. I dissector to cleartext would be nice, though. Ideas, anyone?

When Cygwin gets fussy about forking a new process

Things like

     3 [main] bash 2216 child_info_fork::abort: data segment start: parent(0x26D000) !=     child(0x38D000)
-bash: fork: retry: Resource temporarily unavailable

It appears to be a recent Red Hat mishap. One of those bugs that happen because of the common belief that if you don’t play around with your software all the time, it will be considered dead.

Workaround: Close all Cygwin programs (and windows of course), find the ash.exe in the /bin directory, and double click it to open a primitive command window.

Go

$ /bin/rebaseall

And the problem is over with. I won’t bother understanding what went wrong, since the explanation is most like not very inspiring in cases like this. More about rebaseall here.