Compiling a kernel module after “make clean” on the sources.

The textbook says, that if one wants to compile a module against a kernel, the headers must be there. Those who run distribution kernels are urged to apt-get or yum-install something, and their trouble is over. People like me, who cook their own food and download vanilla kernels, need to handle this themselves.

In the old times, I used to have the kernel compiled on the disk, but nowadays the kernel subdirectory takes a few Gigabytes after compilation, so one has to do a “make clean” sooner or later. Which should be OK, since the following comment can be found in the kernel source’s Makefile, in its very root directory:

# make clean     Delete most generated files
#                Leave enough to build external modules

But unfortunately, it’s not possible to compile any module after “make clean”. It blows with a nasty error message. (Ehm, see update at the bottom)

As it turns out, “make clean” removes two header files, whose absence kills the possibility to compile a module:  include/generated/asm-offsets.h and include/generated/bounds.h. As a matter of fact, the removal of these two files is the only change in the “include” subdirectory.

So a quick workaround is to make a copy of the “include” subdirectory, run “make clean” and then restore “include” to its pre-make-clean state.

Which make you wonder why those files are removed in the first place. Someone overlooked this issue? No, no and no. Is there any real reason to remove these files? I don’t know. Has this issue been fixed since 2.6.35? I’ll check it sometime.

If you know something I don’t, please comment below.

Update (Oct 1st, 2015):

$ make prepare scripts

on the machine that the kernel runs on solves this on v3.16.

Update (May 4th, 2016): Maybe this?

$ make modules_prepare

Will try next time. See Documentation/kbuild/modules.txt on your favorite kernel tree.

Perl-only CRC32 function (without C code)

It may look like a stupid idea, since the CRC32 has been implemented in some Perl modules which can be downloaded from CPAN. But all functions I found involve an XS part, which essentially means that a compilation of C code has to take place during the installation.

In other words, these modules can’t be just attached to a bundle of Perl code, but they have to be installed on a machine where it’s permitted. Or at least, compilation capabilities are necessary. Which can turn into a mess if the target is a Windows-running computer barely having Perl installed.

So I adapted one of the implementations of the mostly used CRC32 calculation, and wrote it in pure Perl. It’s really not the efficient way to obtain a CRC, and I wouldn’t try it on long sequences. Its advantage is also its disadvantage: It’s all in Perl.

Before giving the function away, I’d just point out that if one tries the following snippet

#!/usr/bin/perl

use warnings;
use strict;
require String::CRC32;
require Digest::CRC;

my $str = "This is just any string.";

my $refcrc = String::CRC32::crc32($str);
my $refcrc2 = Digest::CRC::crc32($str);
my $mycrc = mycrc32($str);

then $refcrc, $refcrc2 and $mycrc will have the same value (the last of which is calculated by the function at the end of this post).

And if we’re at it, I’d also point out that for any string $str, the following code

my $append = mycrc32($str) ^ 0xffffffff;
my $res = mycrc32($str.pack("V", $append));

yields a CRC in $res which equals Oxffffffff. This is a well-known trick with CRCs: Append the CRC to the string for which it was calculated, and get a fixed value. And by they way, the CRC is done in Little Endian setting, so the parallel Linux kernel operation would be crc32_le(~0, string, len), only that crc32_le returns the value XORed by Oxffffffff (or is it the Perl version being upside down? I don’t know). Also note that the initial value of the CRC is set to ~0, which is effectively Oxffffffff again.

OK, time to show the function. It’s released to the public domain (Creative Commons’ CC0, if you like), so feel free to make any use of it.

sub mycrc32 {
 my ($input, $init_value, $polynomial) = @_;

 $init_value = 0 unless (defined $init_value);
 $polynomial = 0xedb88320 unless (defined $polynomial);

 my @lookup_table;

 for (my $i=0; $i<256; $i++) {
   my $x = $i;
   for (my $j=0; $j<8; $j++) {
     if ($x & 1) {
       $x = ($x >> 1) ^ $polynomial;
     } else {
       $x = $x >> 1;
     }
   }
   push @lookup_table, $x;
 }

 my $crc = $init_value ^ 0xffffffff;

 foreach my $x (unpack ('C*', $input)) {
   $crc = (($crc >> 8) & 0xffffff) ^ $lookup_table[ ($crc ^ $x) & 0xff ];
 }

 $crc = $crc ^ 0xffffffff;

 return $crc;
}

Workaround: Pending signal making wait_event_interruptible() return prematurely

It all starts with this: I’m not ready to return from my character device’s release() method before I know that the underlying hardware has acknowledged the shutdown. It is actually expected to do so quickly, so I relied on a wait_event_interruptible() call within the release method to do the short wait for the acknowledge.

And it actually worked well, until specific cases where I hit CTRL-C while a read() was blocking. I’m not exactly sure of why, but if the signal arrived while a blocking on wait_event_interruptible() within read(), the signal wouldn’t be cleared, so release() was called with the signal pending. As was quite evident with this little snippet in the release() code.

if (signal_pending(current))
  printk(KERN_WARNING "Signal is pending on release()...\n");

… which ended up with the wait_event_interruptible() in the release() method returning immediately, yelling that the hardware didn’t respond.

Blocking indefinitely is out of the question, so the simple workaround is to sleep another 100ms if wait_event_interruptible() returns prematurely, and then check if the hardware is done. That should be far more than needed for the hardware, and a fairly small time penalty for the user.

So the waiting part in release() now goes:

if (wait_event_interruptible(fstr->wait, (!fstr->flag)))
  msleep(100);

The cute trick here is that the sleep takes place only in the event of an interrupt, so in a normal release() call we quit much faster.

Xilinx’ XST synthesizer bug II: Inferred RAM and mux

It looks like inferring RAMs and ROMs is the weak spot of XST. This is the second bug I find using this synthesizer, this time on XST M.63c, coming with ISE Release 12.2. The previous bug was ROM creation from a case statement. But hey, that was two years ago.

This time I the code says (irrelevant parts eliminated):

   reg [3:0] 	 writeidx;
   reg [31:0] 	 buf_w0;
   reg [31:0] 	 buf_w1;
   reg 		 buf_wen;
   reg 		 buf_wen_d;
   reg [31:0] 	 buf[0:15];
   reg [3:0] 	 counter;

   if (buf_wen)
     begin
	buffer[writeidx] <= buf_w0;
	writeidx <= writeidx + 1;
     end
   else if (buf_wen_d)
     begin
	buffer[writeidx] <= buf_w1;
	writeidx <= writeidx + 1;
     end

The slightly nasty thing about this clause is that “buffer” is an inferred distributed RAM (i.e. implemented in slices) because it’s small, and there’s an “if” statement which controls what is written to it. This messed things up. I’ll forgive the synthesizer for failing to optimize away RAM elements that clearly have a constant value of zero, since their input is always zero. What I can’t leave alone is that it created wrong logic. In particular, it completely ignored the existence of buf_w0, and generated code as if only the buf_w1 assignment existed. As a matter of fact, buf_w0 wasn’t even mentioned in the synthesis report. There was no warning about its disappearance. Like a good-old Soviet elimination. I was lucky enough to read the synthesis warnings to learn that a register, which drives buf_w0, was optimized out, and I couldn’t understand why. Until I checked what happened in FPGA editor, and saw that buf_w0 had gone up in smoke.

And here’s the silly workaround that fixed it. The code is logically equivalent, of course, but feeds XST with what I really want: A mux. Hurray. Not.

   if (buf_wen || buf_wen_d)
     begin
	buffer[writeidx] <= buf_wen ? buf_w0 :  buf_w1;
	writeidx <= writeidx + 1;
     end

PCI express from a Xilinx/Altera FPGA to a Linux machine: Making it easy

Update: The project is up and running, available for a large range of FPGAs. Click here to visit its home page.


Over the years in which I’ve worked on FPGA projects, I’ve always been frustrated by the difficulty of communicating with a PC. Or an embedded processor running a decent operating system, for that matter. It’s simply amazing that even though the FPGA’s code is implemented on a PC and programmed from the PC, it’s so complicated to move application data between the PC and the FPGA. Having a way to push data from the PC to the FPGA and pull other data back would be so useful not only for testing, but it could also be the actual application.

So wouldn’t it be wonderful to have a standard FIFO connection on the FPGA, and having the data written to the FIFO showing up in a convenient, readable way on the PC? Maybe several FIFOs? Maybe with configurable word widths? Well, that’s exactly the kind of framework I’m developing these days.

Xillybus usage example

Xillybus usage example (click to enlarge)

The usage idea is simple: A configurable core on the FPGA, which exposes standard FIFO interface lines to its user. On the other side we have a Linux-running PC or embedded processor, where there is a /dev device file for each FIFO in the FPGA. When a byte is written to the FIFO in the FPGA, it’s soon readable from the device file. Data is streamed naturally and seamlessly from the FIFO on the FPGA to a simple file descriptor in a userspace Linux application. No hassle with the I/O. Just simple FPGA design on one side, and a simple application on the Linux machine.

Ah, and the same goes for the reverse direction, of course.

The transport is a PCI Express connection. With certain Spartan-6 and Virtex 5/6 devices, this boils down to connecting seven pins from the FPGA to the processor’s PCI Express port, or to a PCIe switch. Well, not exactly. A clock cleaner is most probably necessary. But it’s seven FPGA pins anyhow, with reference designs to copy from. It’s quite difficult to get this wrong.

No kernel programming will be necessary either. All that is needed, is to compile a certain kernel module against the headers of the running Linux kernel. On many environments, this merely consists of typing “make” at shell prompt. Plus copying a file or two.

So all in all, the package consists of a healthy chunk of Verilog code, which does the magic of turning plain FIFO interfaces into TLPs on the PCIe bus, and a kernel module on the Linux machine’s side, which talks with the hardware and presents the data so it can be read with simple file access.

If you could find an IP core like this useful, by all means have a look on the project’s home page.

Nokia 6267 restarting itself and how I got around it

I know, I know. I have a very old cellular phone. But since I have enough electronic toys, I couldn’t care less about turning my phone into one. And it happens to be a good one.

Everything was OK until it failed to start. Or more precisely, it started, and then restarted itself. Like this:

And again. And again. It turned out that a defective MicroSD flash card caused it to go crazy. So I replaced the card, and everything looked fine again. But then it had a horrible relapse: It went back to this restarting pattern again, but this time it didn’t help to take out the MicroSD card. What turned out to be really bad, was that it was impossible to connect it to a computer through USB for backup, because it would restart all the time.

It wasn’t a power supply thing. I learned that from the fact, that when the phone was started without a SIM, it asked me whether it should start the phone even so. And it didn’t restart as long as I didn’t press any button on that question.

So it was clear that the phone did something that went wrong a few seconds after being powered on. So the trick was to prevent it from getting on with its booting process, but still allow a USB connection.

Connecting the USB cord while in any of the pre-start menus turned out useless (Use without SIM? Exit from Flight mode?). So I looked a bit at the codes.

What did eventually work, was to use the *#06# code, which is used to check IMEI. The phone showed me the serial number and didn’t restart, and when I plugged in the USB cord, I got the usual menu allowing me to choose mode. From there on it was a lot of playing around, trying and retrying until I finally recovered my phone list.

This also made it possible to reprogram the handset with Nokia’s Phoenix software, which didn’t work otherwise. Neither did the Green-*-3 three finger salute for a deep reset nor the infamous *#7370# code for the same purpose. These two never did anything, even when the phone appeared to be sane.

I should point out, that it’s possible that this trick may have solved a very specific issue on my own phone’s internal messup, and still, I thought it was best to have it written down for rainy days.

Random notes as I wrote a PCI kernel module

These are a bunch of things I jotted down as I wrote a Linux kernel module for a PCI express peripheral I developed.

About kernel module Makefiles

A great guide here.

lspci and setpci

lspci is quite well-known. What is less known, is that it can be used to get the tree structure of bridges and endpoints with

$ lspci -tv

lspci can also be used to get cleartext info about any card using the -x or -xxxx flag, possibly along with -v. The little trick with -x is that the byte ordering is wrong in little endian systems, and reflects the host’s byte ordering rather than PCI’s native big endian.

setpci is useful to look at specific registers in the card’s configuration space (and possibly alter them). For example, look at two registers of a PCI card picked by Vendor/Product IDs (no need to be root on my computer):

$ setpci -v -d 10b7:9300 BASE_ADDRESS_0 BASE_ADDRESS_1
0000:05:05.0 @10 = 0000bc01
0000:05:05.0 @14 = fbbfe000

To get a list of registers (such as BASE_ADDRESS_0) just go

$ setpci --dumpregs

__devexit_p ???

static struct pci_driver xillybus_driver = {
  .name = "xillybus",
  .id_table = xillyids,
  .probe = xilly_probe,
  .remove = __devexit_p(xilly_remove),
};

The __devexit_p is a macro turning into NULL when the module is compiled into the kernel itself (as opposed loadable module) and isn’t hotpluggable. Since the module can’t ever exit, this gives the compiler an opportunity to optimize away the function alltogether?

Talking with your PCI device in 5 steps

(or: How to initialize a PCI device in the probe function)

  • Enable the device: pci_enable_device(pdev);
  • Check that it’s mapped as you expect it (optional): if (!(pci_resource_flags(pdev, the_bar_you_want) & IORESOURCE_MEM)) { fail here }
  • Declare yourself the exclusive owner of the device (give or take /dev/mem): pci_request_regions(pdev, “your-driver-name”);
  • Get a handle for I/O operations. Don’t let the function’s name confuse you. It applies to memory BARs as well as I/O BARs: pointer = pci_iomap(pdev, the_bar_you_want, length);

And when this is successful, reading and writing is possible with iowrite32(i, pointer) or ioread32(pointer).

“pointer” is just a pointer to any type (considered void by kernel), and there are 16 and 8 bits operators as well. Keep in mind that the PCI bus is big endian, so when putting a 32-bit number on the bus and reading it as 32 bit on a x86 machine, you get it twisted around.

It’s also common to allow the device to be master on bus (necessary for DMA access and MSI interrupts: pci_set_master(pdev);

As for includes, I got away with only this set:

#include <linux/pci.h>
#include <linux/device.h>
#include <linux/io.h>

Can the probe function sleep?

That was important for me, because I’m went  for a heavy setup process involving DMA transfers in the very beginning. The following encouraging sentence was found in Documentation/pci.txt, section 1:

The probe function always gets called from process context, so it can sleep.

When PCIe card isn’t responsive

If the PCIe interface logic does nothing about requests sent to a legal BAR address, it’s OK with iowrite32() operations (since they’re posted, so nobody bothers to check what happened with them), but ioread32() will make my computer freeze. It’s a complete kernel crash without even an oops message. It looks like the processor locks on waiting for the completion packet, which never arrives.

Conclusion: Messing with the FPGA while the host PC is on will most likely hang the entire system, if the host attempts to do something with that interface.

This is actually surprising, because section 2.8 of the PCI Express spec 1.1 is very clear about a mandatory timeout mechanism, and that “The Completion Timeout timer must expire if a Request is not completed in 50 ms.” Is this my G31 chipset not meeting spec?

DMA through PCI calls?

Several writeouts resembling the kernel’s own DMA-API-HOWTO.txt are out there. I go for this one, because it says a few words about using the PCI-specific functions, which are used extensively in the kernel code.

__get_free_page(s) vs. alloc_page(s)

That’s confusing. Both get you either one page of PAGE_SIZE of memory (4096 bytes in most cases), or several of them (the -s versions), but __get_free_pages returns the actual address as kmalloc, while alloc_pages returns page information via a struct page pointer. For DMA, the former’s memory chunk is mapped with e.g. pci_map_single(), but the latter with dma_map_page(). Some more details here.

MSI interrupt handlers

Unlike classic IRQ handlers, which may be called because a shared interrupt fired off, there’s no reason to check if an interrupt is really pending. Yet another good reason to use MSI.

Volatile on wake-up conditions?

If I’m waking up on some condition like

wait_event_interruptible(privdata->waitq, (privdata->flag != 0));

should the “flag” entry be marked as “volatile” in the privdata structure? I mean, an interrupt will change its value, so “volatile” is the old way to do this. The answer is no, no, and again no. One should trust the kernel’s API to handle the volatility issue, since the kernel’s underlying opinion (which seems pretty justified) is that “volatile” is inherently bad.

Mutexes and spinlocks

Things I always heard people say about mutexes, and still worth to emphasize:

  • Use as much mutex granuality as possible, as long as it’s structured enough not to mess up into a deadlock.
  • Set up a logical who’s-locked-first schema, and write it as a comment in the code. It goes something like mutex_rd -> mutex_wr -> spinlock_rd -> spinlock_wr and so on. So it’s absolutely clear that if mutex_wr is taken, mutex_rd must not be asked for, but spinlock_wr may. One is allowed to skip mutexes in the list, but not go backwards.
  • Whenever possible, release mutexes before going to sleep (waiting for events in particular, and even more in particular if the waiting has no timeout). For example, waiting for data in a read() method handler with the mutex taken may sound reasonable if the mutex is only related to read() operations, but what if the device is open()ed a second time during this sleep? All of the sudden the open() blocks for no apparent reason.
  • … but this makes it a bit tricky: We’re awaken because our condition was met (say, data arrived) but while trying to take the mutex for handing the event, the data was read() by some other process, and the state data structure was also completely changed. So after regaining the mutex, the overall situation must be reassessed from scratch.
  • And of course, no sleeping with spinlocks on.
  • Use the wait_event_* functions’ return values to tell why we were woken up, and don’t rely on the condition tested for that. In particular if we sleep without mutex, because the flag be altered by another process from the moment the event was triggered until our process actually got the CPU.

Cygwin: rm is actually move to Recycle Bin

This one really pissed me off. I installed Cygwin recently on an XP machine. All kinds of .cyg000(mumbo jumbo) directories started to show up, and I had no idea why.

Then I got it: The rm command in recent Cygwin (versions > 1.7?) is “friendly”: Rather than unlinking the files, they are moved to a “hidden” directory. Isn’t it wonderful that Cygwin is nice to me? Not. I want the utilities to do what I expect them to do, nothing else. One day, I hope, people will realize that making “improvements” in software is just a way to make things break. In this case, it was my “make clean” not doing what I expected.

I don’t know if there is an environment variable to fiddle with. If you do, please comment below. Impatient as I get to get rid of these pests, I downloaded the clean GNU version for Windows (a non-Cygwin one) and kissed the old one goodbye. As simple as that.

Thunderbird: Recovering corrupt address book

It all started with a blue screen (Windows, right?), which seems to have something to do with Firefox. Anyhow, after that crash Thunderbird told me it can’t open the abook.mab file, and hence my contacts are lost. Which means it won’t autocomplete email addresses as I type them, which I’m not ready to take.

The solution was given in a forum thread in the form of a PHP script. Perl would be the correct language to do this, but since it saved my bottom, who am I to complain.

Since it’s so good, I hope it’s OK that I’ll repeat it here. It’s by the Ubuntu forum user mikerobinson:

<?php
error_reporting(E_ALL);
$abook = file_get_contents('abook.mab-1.bak');

preg_match_all('/\((.*)\)/Ums', $abook, $matches);

$matches = $matches[1];

foreach ($matches as $key => $match) {
    $entry = explode('=', $match);
    if (isset($entry[1]) && strlen($entry[1]) > 4 && !isset($skipnext)) {
        $entry[1] = str_replace("\\\n", '', $entry[1]);
        $entry[1] = str_replace('\\\\', '', $entry[1]);
        $entry[1] = str_replace('\\', ')', $entry[1]); // the backslashes SHOULD be at the end of each line

        // Unicode characters
        if (strstr($entry[1],'$')) {
            $entry[1] = str_replace('$', "\\x", $entry[1]);
            $entry[1] = preg_replace("#(\\\x[0-9A-F]{2})#e", "chr(hexdec('\\1'))", $entry[1]);
        }

        $matches[$key] = utf8_decode($entry[1]);
        if (strstr($entry[1],'@')) $skipnext = true;
    }
    else {
        unset($matches[$key]);
        unset($skipnext);
        if (strstr($entry[1],'@')) $skipnext = true;
    }
    unset($entry);
}

$previous = null;
foreach ($matches as $match) {
    if (strstr($match,'@')) {
        if (strtolower($match) != strtolower($previous)) {
            if (isset($addy)) $addressbook[] = array($match, end($addy));
            else $addressbook[] = array($match, $match);
            unset($addy);
            $previous = $match;
        }
    }
    else {
        $addy[] = $match;
    }
}

echo "First Name\tLast Name\tDisplay Name\tNickname\tPrimary Email\tSecondary Email\tScreen Name\tWork Phone\tHome Phone\tFax Number\tPager Number\tMobile Number\tHome Address\tHome Address 2\tHome City\tHome State\tHome ZipCode\tHome Country\tWork Address\tWork Address 2\tWork City\tWork State\tWork ZipCode\tWork Country\tJob Title\tDepartment\tOrganization\tWeb Page 1\tWeb Page 2\tBirth Year\tBirth Month\tBirth Day\tCustom 1\tCustom 2\tCustom 3\tCustom 4\tNotes\t";
foreach ($addressbook as $addy) {
    echo "\t\t{$addy[1]}\t\t{$addy[0]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n";
}

After saving this script as savior.php, one goes at shell prompt (on Linux, right?):

# php savior > book.txt

And then manually edit any garbage away from the text file. Keep in mind that a title line should be kept as the first line.

And then go back to thunderbird, click Tools > Address book > Tools > Import… and import the file as a tab separated file. That’s it. The new addresses will be in a new folder, but who cares. Autocompletion is back!

Root over NFS: Diskless boot from network

Motherboard, power supply and optional screen

Introduction

Since I wanted to develop a kernel module on Linux for a specific piece of hardware, I thought it would be a nice idea to do that on a computer I wouldn’t mind crashing every now and then. Wanting physical access to the target hardware, a bare motherboard with nothing else but the absolute minimum seemed like a nice direction. The idea was this motherboard wouldn’t have any disk, hence no fsck’s every time I make a mistake in kernel mode.

Using cobbler for this was an overkill, but this is how I did it back in 2011.

So I went for booting with PXE, and mount the root over NFS. I also chose to do so using the kernel’s native support, rather than doing this with an initrd image. I should mention, that root over NFS has the drawback of not having a disk cache. This means that every time a file is read, there’s some network activity, as opposed to files read from a local disk, which are actually read from RAM if the file has been accessed recently. The difference is evident when running bash scripts and make builds (compiling a kernel, for example) because each invocation of any executable involves a lot of access to more or less the same set of supporting files (libraries, locale configurations, C header files, you name it). This massive file access is not noticed when the disk cache is used, and neither is it evident when running a single application. But on the diskless machine, this is what I got compiling a recent kernel:

# time { make bzImage && make modules ; }

<< ... lots of output ... >>

real	95m22.497s
user	51m46.023s
sys	6m7.257s

Hmmm… We have some 55% of the time for actual crunching. Not as horrible as I expected, and still not so impressive.

Maybe running a virtualized machine on a disk image over the NFS mount would solve this, since the guest would maintain a cache of its own. I would also think about running on an network block device, which would most likely be an elegant solution. I did neither, because my main objective was tolerance to kernel crashes and not necessarily running heavy scripts.

The machine with a real disk was a Fedora Core 12.

Powering up a bare motherboard

Just a few things to keep in mind:

  • An ATX power supply will not supply power (and neither will its fan turn) unless connected to a motherboard. A couple of pins on the 20/24 pin connector are used for the purpose of the motherboard telling the power supply when to go on and off.
  • A motherboard won’t power on just like that. When external power is applied to it, it expects the power button to be pressed. Locate the header connector on your motherboard, and find the two pins that are normally connected to the power button (the motherboard’s manual will help). Use a jumper or just a piece of metal to short-circuit these two with the power supply connected, and power should go on.
  • It’s possible to change the BIOS setting so that power goes on immediately on external power application. Somewhere in the BIOS’ menus under “Power Management”
  • Once the fans start rolling, the BIOS should show something on the on-board VGA output, or either the CPU or the memory is not plugged in OK. Or check the 12V power connector.
  • Also, there may be some trouble to get the on-board gigabit NIC to talk nicely with an Edimax ethernet switch. Running a SMART LAN test and enabling the NIC’s boot ROM (both in the BIOS menues) solved this. I don’t know which of these two did the job.

The boot process

These are the steps which I eventually got:

  • Diskless computer waking up, making a DHCP broadcast request on the network. The host computer assigns it with an IP address (possibly detecting the NIC’s MAC address and giving it a constant address), and also tells it that pxelinux.0 is the file to load to if PXE is desired
  • Diskless loads pxelinux.0 through TFTP. The tftp daemon’s root is /var/lib/tftpboot
  • pxelinux.0 then attempts to find a configuration file, ending up loading /pxelinux.cfg/default, and then /menu.c32 through TFTP.
  • If all is well, a GRUB-like menu appears.
  • Pxelinux then loads the appropriate kernel into memory and starts it (I didn’t run with initrd) with given parameters.
  • The kernel makes its own DHCP broadcast, gets its IP address, and connects to the NFS server, according to kernel parameters.
  • The NFS mount is used as read-only root.
  • Boot proceeds normally like any Linux system (executing /sbin/init or alike)

Getting started: cobbler

(Once again, my year 2019 wisdom is that cobbler is an overkill)

In order to get a quick start, cobbler is a nice tool to get the right files in the right places. In the case of setting up a diskless computer, cobbler is useful up to the stage in which a boot is successful. After that, I closed its daemon and made sure the service is off in all runlevels in order to ensure that it won’t mess with the configuration files.

First of all:

# yum install cobbler
# yum install dhcp

(and agree to a zillion other packages for dependencies, tftp-server in particular)

It’s written in Python (yuck!) so when things go wrong, one gets these ugly exception reports.

And I’ll say this now: You may want to do this every now and then to check up how much disk space cobbler is eating (you may be surprised):

# du -sh /var/www/cobbler/

Edited /etc/cobbler/dhcp.template (and NOT /etc/dhcp/dhcpd.conf) to set up the DHCP service properly: Set ‘subnet’ to match my internal network, ‘routers’ to the default gateway to WAN, ‘domain-name-servers’ to the DNSes available, ‘range dynamic-bootp’ to the range of addresses free for lease.

And also an entry of this form, to give a constant address (which is out of the conventional range, so it doesn’t get stolen).

host mylaptop {
     hardware ethernet 08:00:2b:4c:59:23;
     fixed-address 192.168.1.222;
   }

Enable (with chkconfig or GUI alike) and start (service start or GUI alike) the services cobblerd, dhcpd and httpd. And then

# cobbler check

Which had a lot of remarks. Since I run SELinux in permissive mode, I ignored all SELinux related remarks. iptables is doing nothing right now, so I didn’t do anything special with the firewall either. I couldn’t care less about debmirror or the password for sample templates.

I did change the server and next_server entry in /etc/cobbler/settings to the computer’s IP as seen through the relevant ethernet card.

Also, I changed manage_dhcp to 1 in the same file, since I didn’t have DHCP running previously, and I’m somewhat lazy.

Finally, I went

# cobbler get-loaders

otherwise cobbler yells at me when trying to sync.

And then

# cobbler sync

which is where I got most error messages (and wrote their fixes above as if I had a clue).

When I got an OK from this command, I saw DHCP up and running, and also the first stage of PXE boot going into my laptop. But there was nothing to boot yet, so I copied the laptop’s running kernel and initrd to the boot server, and configured it as follows:

# cobbler distro add --name=justatest --kernel=/path/to/vmlinuz-2.6.27 --initrd=/path/to/initrd-2.6.27.img
# cobbler profile add --name=testingPXE --distro=justatest
# cobbler sync

At this point, PXELINUX was clearly loaded and reported IP addresses correctly, but hung after the line “Trying to load: pxelinux.cfg/default”. Checking with wireshark revealed that the specific file had indeed loaded, and so had menu.c32, but after that nothing happened. A menu should have appeared, as it did when I did the same thing on more recent hardware. Which means, as usual, that maintainers are so happy about playing with their software that they forget that some people are actually supposed to use it, including those not running top-gun hardware.

Anyhow, editing /var/lib/tftpboot/pxelinux.cfg/default so that PROMPT is 1 and not 0, I got a “boot:” prompt, on which I could enter the word “testingPXE”, and that started a successful boot of the laptop.

Useful command when things don’t really work (except for checking up /var/log/messages:

# cobbler report

Since the target motherboard is a piece of recent hardware, I could conclude this phase as successful. At this point, cobbler should be disabled by stopping the cobblerd service and making sure with chkconfig it will never wake up again. It’s better to edit the target files directly from now on.

Make sure that tftp and dhcpd services are enabled in the relevant runlevels.

Preparing the NFS share

The next step was to install Linux on the to-be remote disk, so that the computer will boot from network and behave as if it ran on a local disk. Only the data is somewhere else. I went for CentOS-5.5-i386. I really wanted to try Slackware, but their bittorrent ISO image took way too much time to download.

First I set up an NFS share with an empty directory, so that the installation can go somewhere. I did it with Fedora’s GUI interface (shame on me, I’m sure I’ll pay for this somehow) so it was just System > Administration > Server Settings > NFS. I added a share, pointed at the directory, set hosts to “*”, allowed Read/Write, and most important checked “Treat remote root as local root”. After all, I want to install a system, so root squashing is out of the question. This is not the safest setting in the world, and is a clear opening for implanting root kits on your system. The only thing left to protect the computer at this stage is a firewall.

my /etc/exports now read:

/storage/diskless/cent55_root      *(rw,sync,no_root_squash)

which I reduced to (after copying files from virtual machine, as described below)

/storage/diskless/cent55_root      10.0.0.0/255.255.255.0(rw,sync,no_root_squash)

so only computers on my little private LAN will get access. I have my limits too. After restarting the NFS service (was is necessary?) I mounted the share easily.

My original intention was to run the installation on the diskless machine right away, but I found out the hard way that installation software behaves exactly the way Microsoft thinks it should: Either you run it the way it was intended, or forget about it. After a while I realized, that the simplest way was to create a virtual machine, run the full installation on that one, and after all is finished copy the files into my to-be NFS directory. Not very elegant, not quick at all, but at least I knew this would work.

After the installation was done, we’re left with copying the files. I tried doing so by booting the machine in rescue mode, mount the disk and also mount the NFS share, and copying everything. But since I didn’t set up any paravirtualization on either disk or NIC, it turned out horribly slow. Instead, I did a dirty mount-by-probing on the disk image, and mounted the first and only partition with

# i=512; while ! mount -o loop,offset=$i centos.img mnt ; do i=$((i+512)); echo Now trying $i ; done

Which is such a dirty trick it should be outlawed, in particular because it’s unnecessary. After spitting out a lot of junk, it stopped after writing “Now trying 32256″ which happens to be (virtual) sector #63 (first sector is zero). And then I changed directory to mnt, and just went

# { tar -c --one-file-system --to-stdout --preserve * ; } | { cd /path/to/new/root && tar --preserve -v -x ; }

which took a few minutes or so.

Year 2019 update: Recent Linuxes might not support NFSv2 out of the box, which might prevent mounting by the kernel. Try:

# cat /proc/fs/nfsd/versions
-2 +3 +4 +4.1 +4.2

If you got that “-2″, NFSv2 isn’t supported. Edit /etc/default/nfs-kernel-server, possibly following this page.

Preparing the client for running root over NFS

It turns out, that the idea is pretty unpopular. It’s one thing to boot a system from network to install something on the local disk, but judging from the availability of documentation and ready-to-go utilities, running on a remove disk seems to be an exotic idea.

For example, the tools for generating an root-over-NFS are in general archlinux-dependent, and are based upon working with an initrd to get things in place. Moreover, booting non-initrd from NFS requires certain kernel options, that weren’t even module-enabled in a few distribution kernels I checked.

I made the decision of not having an initrd in the bootup process, simply because it doesn’t make sense. The kernel has the ability to do the DHCP discovery and mount the root from NFS, so I can’t see why messing up with maintaining an initrd image. The lack of initrd is why cobbler isn’t a candidate anymore: It’s not ready to import a kernel without an initrd.

So the first phase is to compile a kernel with root on NFS. Following {kernel source directory}/Documentation/filesystems/nfs/nfsroot.txt and a nice HOWTO, I enabled (as in Y, not M) the following flags on my kernel configuration:  CONFIG_NFS_FS, CONFIG_ROOT_NFS, CONFIG_NET_ETHERNET, CONFIG_IP_PNP, CONFIG_IP_PNP_RARP, CONFIG_IP_PNP_BOOTP, CONFIG_IP_PNP_DHCP.

I also made sure my specific on-board Ethernet card was also in the kernel itself, and not as a module. My motherboard won’t boot from an external NIC, but it’s actually possible to load the kernel from one NIC and let the kernel use an external NIC, if that makes life easier (even though it’s weird having two network cables going to a bare motherboard. Oh well).

Then I compiled the kernel as usual (some of us do that from time to time). For the heck if it, I copied bzImage and friends to /boot, but the real thing was to copy vmlinuz to /var/lib/tftpboot/images/ and edit /var/lib/tftpboot/pxelinux.cfg/default so it has an entry saying:

LABEL nfs-centos55
        kernel /images/vmlinuz-2.6.18-NFS1
        MENU LABEL nfs-centos55
        append ip=::::::dhcp text rootfstype=nfs root=/dev/nfs nfsroot=10.0.0.1:/diskless/cent55_root
        ipappend 2

Note that there is no initrd image mentioned here at all!

I’d also mentioned that sometimes there’s a problem mounting NFS in the standard way (using UDP), probably due to problems with packet fragmentation. The symptom is that the DHCP goes fine, but a message saying “server X not responding, still trying” appears and the boot process is stuck. The immediate solution is to run NFS over TCP, which is done simply by adding the “tcp” mount option, e.g. “nfsroot=10.0.0.1:/diskless/cent55_root,tcp

Finally, under the to-be-NFS’ed directory, I changed /etc/fstab, so that the first line reads

10.0.0.1:/diskless/cent55_root   /      nfs     defaults        0 0

Or the boot fails on attempting to remount the root directory as read-write.

I booted my little motherboard, and apparently nothing happened. But looking at /var/log/messages under NFS’ed directory it was clear that the boot process had kicked off, but went horribly wrong. Why? Because I skipped the initram phase. And a few things happen there.

Making up for not using initram

So I had a look on the initram going along with the distribution with a

# zcat cent55_root/boot/initrd-2.6.18-194.el5.img | cpio -i

and looked at the file named init in the initrd’s root directory. It turns out, that the /dev subdirectory is mounted as a tmpfs at the initram stage, with the necessary regular elements mknod’ed every time. I have to say there is something clever about this, since it keeps the /dev clean. On the other hand, it’s likely to confuse traditional UNIXers and it makes the whole system depend on the initram stage, which are both pretty yuck.

The simple solution is to create these files for real in the /dev directory. So I copied that init file into cent55_root, and edited it to look like this:

mkdir /dev/pts
mkdir /dev/shm
mkdir /dev/mapper
mknod /dev/null c 1 3
mknod /dev/zero c 1 5
mknod /dev/urandom c 1 9
mknod /dev/systty c 4 0
mknod /dev/tty c 5 0
mknod /dev/console c 5 1
mknod /dev/ptmx c 5 2
mknod /dev/rtc c 10 135
mknod /dev/tty0 c 4 0
mknod /dev/tty1 c 4 1
mknod /dev/tty2 c 4 2
mknod /dev/tty3 c 4 3
mknod /dev/tty4 c 4 4
mknod /dev/tty5 c 4 5
mknod /dev/tty6 c 4 6
mknod /dev/tty7 c 4 7
mknod /dev/tty8 c 4 8
mknod /dev/tty9 c 4 9
mknod /dev/tty10 c 4 10
mknod /dev/tty11 c 4 11
mknod /dev/tty12 c 4 12
mknod /dev/ttyS0 c 4 64
mknod /dev/ttyS1 c 4 65
mknod /dev/ttyS2 c 4 66
mknod /dev/ttyS3 c 4 67
chmod 0755 /dev/{null,zero,urandom,systty,tty,console,ptmx} /dev/tty*

And then chrooted myself in the cent55_root directory:

# chroot .

after which I ran the little snippet above. It’s important to check that the /dev directory doesn’t contain any regular files by any chance. Mine had urandom and null, which resulted in mknod failing, and in turn I got errors  saying /dev/null is on a read-only file system.

And now the machine booted pretty OK. I disabled the microcode_ctl and yum_updatesd, the first one because it held the boot stuck for a minute or so, and the second because I find it annoying.

And that’s it! The motherboard now booted like a clockwork!


Kind-of Appendix


My failed attempt to install directly on the diskless machine

Oh, I was so naive thinking this would be quick and easy. Having the ISO images at hand, I went

# mount -o loop CentOS-5.5-i386-bin-DVD.iso mnt
# cobbler import --path=/home/eli/mnt --name=centos55 --arch=i386
# umount mnt
# cobbler sync

Note to self: Use absolute paths! Going –path=mnt made rsync go to /mnt, and copy all remote mounts I had, which is a nice backup of my system, but no thanks. Ah, and it didn’t end there! When I discovered my mistake and CTRL-C’ed the process, and saw how the cobbler subdirectory keeps on growing even so. Using pstree I found out that cobblerd (cobbler’s daemon, right?) continues to rsync data into my false repository in the background. I had previously wondered why a daemon was necessary, and I also wondered when that moment of annoyance of using a do-it-all tool would come, and there I had both answers at once. And one can’t just turn off cobblerd, because everything is around that daemon. Yuck!

And even if I had got this right, I don’t think I’ll get the Nobel prize for this, in particular since this made cobbler call rsync requesting it to basically duplicate the ISO image’s content. It also scanned the files in all kind of ways to support kickstart and virtual stuff, which I have no interest in. And there went some extra 4 GB.

Reinstalling

Since cobbler started backing up the neighbouring computers (as a result of my mistake, I have to admit) and filled up my root partition, I had to take some violent measures (that is, some rm -rf) which soon ended up with cobblerd refusing to start at all (producing yucky and meaningless Python error messages, what else). The only solution was to save my /etc/cobbler /dhcp.template and /etc/cobbler/settings. Then yum remove cobbler, and remove directories /etc/cobbler, /var/www/cobbler and /var/lib/cobbler. Go yum install cobbler again, and copy the two saved files back. And cobbler get-loaders.

That’s the only thing these know-it-all tools understand: violence.