LVM volume resizing jots

These are my jots as I resized a partition containing an encrypted LVM physical volume, and then took advantage of that extra space by extending a logic volume containing an ext4 file system. The system is an Ubuntu 14.04.1 with a 3.13.0-35-generic kernel.

There are several HOWTOs on this, but somehow I struggled a bit before I got it working. Since I’ll do this again sometime in the future (there’s still some space left on the physical volume) I wrote it down. I mainly followed some of the answers to this question.

The overall setting:

$ ls -lR /dev/mapper
/dev/mapper:
total 0
crw------- 1 root root 10, 236 Nov 17 16:35 control
lrwxrwxrwx 1 root root       7 Nov 17 16:35 cryptdisk -> ../dm-0
lrwxrwxrwx 1 root root       7 Nov 17 16:35 vg_main-lv_home -> ../dm-3
lrwxrwxrwx 1 root root       7 Nov 17 16:35 vg_main-lv_root -> ../dm-2
lrwxrwxrwx 1 root root       7 Nov 17 16:35 vg_main-lv_swap -> ../dm-1

$ ls -lR /dev/vg_main/
/dev/vg_main/:
total 0
lrwxrwxrwx 1 root root 7 Nov 17 16:35 lv_home -> ../dm-3
lrwxrwxrwx 1 root root 7 Nov 17 16:35 lv_root -> ../dm-2
lrwxrwxrwx 1 root root 7 Nov 17 16:35 lv_swap -> ../dm-1

And the LVM players after the operation described below:

lvm> pvs
 PV                    VG      Fmt  Attr PSize   PFree 
 /dev/mapper/cryptdisk vg_main lvm2 a--  465.56g 121.56g
lvm> vgs
 VG      #PV #LV #SN Attr   VSize   VFree 
 vg_main   1   3   0 wz--n- 465.56g 121.56g
lvm> lvs
 LV      VG      Attr      LSize   Pool Origin Data%  Move Log Copy%  Convert
 lv_home vg_main -wi-ao--- 300.00g                                          
 lv_root vg_main -wi-ao---  40.00g                                          
 lv_swap vg_main -wi-ao---   4.00g

Invoke “lvm” in order to run LVM related commands (probably not really required)

Make LVM detect that I’ve resized the underlying partition (mapped as cryptdisk):

lvm> pvresize -t /dev/mapper/cryptdisk

Now to resizing the logical volume. Unfortunately, the Logical Volume Management GUI tool refused that, saying that the volume is not mounted, but in use (actually, I think it *was* mounted). So I went for the low-level way.

Under “Advanced Options” I went for a rescue boot, and chose a root shell.

Check the filesystem in question

fsck -f /dev/mapper/vg_main-lv_home

Back to the “lvm” shell. A little test, not the -t flag (making lv_home, under vg_main 200 GiB larger):

lvm> lvextend -t -L +200g /dev/vg_main/lv_home

It should write out the desired final size (e.g. 300 GiB)

Then for real:

lvm> lvextend -L +200g /dev/vg_main/lv_home

Oops, I got “Command failed with status code 5″. The reason was that the root filesystem was mount read-only. Fixing that I got “Logical volume successfully resized”.

But wait! There is no device file /dev/vg_main/lv_home

Now resize the ext4 filesystem

resize2fs /dev/mapper/vg_main-lv_home

And run a final check again:

fsck -f /dev/mapper/vg_main-lv_home

And rebooted the computer normally.

systemd jots

I not an expert on this

These are just my what-on-earth-is-going-on-here notes as I tried to understand how my Debian 8.2 (“Jessie”) machine boots up. Conclusion: It’s a mess. More specifically, it’s a weird mix between good-old SystemV init scripts and a nasty flavor of upstart. And they say it’s here to stay. Maybe. But I doubt those init.d scripts will remain for long.

General notes

  • systemctl is the Swiss knife. Most notable commands: systemctl {halt, poweroff, reboot}
  • Also: systemctl status (for a general view, with PIDs for jobs) or with the name of a service to get more specific info
  • For analysis of what’s going on: systemctl {cat, list-dependencies}
  • Reload configuration files (after making changes): systemctl daemon-reload
  • LSB stands for Linux Standard Base. In systemd context, it’s the standard Linux services
  • There are several special units: man systemd.special
  • An example for a service definition file (for SSH): /etc/systemd/system/sshd.service. There aren’t so many of these.

The general view

My atd daemon didn’t kick off, so I got this:

(the numbers are process IDs, which is quite nice, but don’t kill them directly — use systemctl for that too)

$ systemctl status
 diskless
    State: degraded
     Jobs: 0 queued
   Failed: 1 units
    Since: Wed 2015-11-11 14:45:42 IST; 4min 39s ago
   CGroup: /
           ├─1 /sbin/init text
           └─system.slice
             ├─dbus.service
             │ └─352 /usr/bin/dbus-daemon --system --address=systemd: --nofork -
             ├─cron.service
             │ └─345 /usr/sbin/cron -f
             ├─nfs-common.service
             │ ├─299 /sbin/rpc.statd
             │ └─342 /usr/sbin/rpc.idmapd
             ├─exim4.service
             │ └─632 /usr/sbin/exim4 -bd -q30m
             ├─systemd-journald.service
             │ └─127 /lib/systemd/systemd-journald
             ├─ssh.service
             │ ├─347 /usr/sbin/sshd -D
             │ ├─639 sshd: fake [priv]
             │ ├─641 sshd: fake@pts/0
             │ ├─642 -bash
             │ ├─666 systemctl status
             │ └─667 systemctl status
             ├─systemd-logind.service
             │ └─349 /lib/systemd/systemd-logind
             ├─system-getty.slice
             │ └─getty@tty1.service
             │   └─402 /sbin/agetty --noclear tty1 linux
             ├─systemd-udevd.service
             │ └─139 /lib/systemd/systemd-udevd
             ├─rpcbind.service
             │ └─266 /sbin/rpcbind -w
             ├─irqbalance.service
             │ └─370 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
             └─rsyslog.service
               └─398 /usr/sbin/rsyslogd -n

Networking service who-does-what

What’s about the networking service? Just

$ systemctl

(not necessarily as root) listed all known services (including those that didn’t start), and among others

  networking.service                 loaded active exited    LSB: Raise network interfaces.

so let’s take a closer look on the networking service:

$ systemctl status networking.service
 networking.service - LSB: Raise network interfaces.
   Loaded: loaded (/etc/init.d/networking)
  Drop-In: /run/systemd/generator/networking.service.d
           └─50-insserv.conf-$network.conf
        /lib/systemd/system/networking.service.d
           └─network-pre.conf
   Active: active (exited) since Wed 2015-11-11 11:56:35 IST; 1h 16min ago
  Process: 242 ExecStart=/etc/init.d/networking start (code=exited, status=0/SUCCESS)

OK, let’s start with the drop-in file:

$ cat /run/systemd/generator/networking.service.d/50-insserv.conf-\$network.conf
# Automatically generated by systemd-insserv-generator

[Unit]
Wants=network.target
Before=network.target

Not really informative. Note that /run is a tmpfs, so no doubt the file was automatically generated. So what about

$ cat /lib/systemd/system/networking.service.d/network-pre.conf
[Unit]
After=network-pre.target

Even more internal mumbo-jumbo. So much for the drop-ins.

Now, why am I working so hard? There the “cat” command!

$ systemctl cat networking.service
# /run/systemd/generator.late/networking.service
# Automatically generated by systemd-sysv-generator

[Unit]
SourcePath=/etc/init.d/networking
Description=LSB: Raise network interfaces.
DefaultDependencies=no
Before=sysinit.target shutdown.target
After=mountkernfs.service local-fs.target urandom.service
Conflicts=shutdown.target

# /run/systemd/generator.late/networking.service
# Automatically generated by systemd-sysv-generator

[Unit]
SourcePath=/etc/init.d/networking
Description=LSB: Raise network interfaces.
DefaultDependencies=no
Before=sysinit.target shutdown.target
After=mountkernfs.service local-fs.target urandom.service
Conflicts=shutdown.target

[Service]
Type=forking
Restart=no
TimeoutSec=0
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
SysVStartPriority=12
ExecStart=/etc/init.d/networking start
ExecStop=/etc/init.d/networking stop
ExecReload=/etc/init.d/networking reload

# /run/systemd/generator/networking.service.d/50-insserv.conf-$network.conf
# Automatically generated by systemd-insserv-generator

[Unit]
Wants=network.target
Before=network.target

# /lib/systemd/system/networking.service.d/network-pre.conf
[Unit]
After=network-pre.target

Say what? The actual networking.service was generated on the fly? Based on what?

Say what II? /etc/init.d/networking??? Really? Besides, what’s all those /etc/rcN.d/ directories? Are they used for something?

OK, so it goes like this: According to systemd-sysv-generator’s man page (the program that generated these service files) scans through /etc/init.d/* and reads through their LSB headers. It probably also scanned /etc/rcS.d, where it found S12networking symlinking to ../init.d/networking. That’s where it got the SysVStartPriority=12 part, I suppose.

So this is how systemd emulated SystemV.

Netwoking service: What actually happens

  • Systemd calls /etc/init.d/networking start (systemctl status networking.service supplied that info)
  • /etc/init.d/networking runs /etc/default/networking (if it exists), which allows overriding the parameters
  • Then calls “ifup -a”, unless CONFIGURE_INTERFACES has been set to “no”, and with due exclusions

References

Windows 10 is Windows 7 with a new-old skin

… with one little difference: It seems like you can install Windows 10 without any product key. According to this post, one can install Windows 10 from scratch, and when prompted for the product key, click “Skip for now”. Twice. The installation will have this “Activate Now” watermark, and personalization will be off. But otherwise, the post says, everything will work fine. Never tried this myself, though.

Either way, it’s the regular Windows 10 you want to download. Not the N or KN or something.

Wanting to be sure that some driver I’ve released will work with Windows 10, I upgraded from Windows 7, where the driver was installed, to Windows 10.

To my great surprise, Windows 10 started with the same desktop, including program shortcuts, all running as before. Only a new look and feel, which resembles Windows 8, just slightly less annoying.

I should mention that at the “Get going fast” stage of the installation, I went for Customize Settings and turned off basically everything. That’s where all the “Send Microsoft X and Y” goes.

The real surprise was that my own driver was already installed and running on the upgraded Windows 10. If I was looking for a sign that everything is the same under the hood, an automatic adoption of already installed driver is a good one. I don’t think Microsoft would risk doing that unless there was really nothing new.

Needless to say, removing the driver and reinstalling it went as smooth as always. Same device manager, same everything.

IMPORTANT: For a bare-metal install, boot the USB stick with the ISO image (possibly generated with winusb under Ubuntu) in non-UEFI mode, or the installer refuses to use existing MBR partitions (unless the partition table is GPT anyhow).

VirtualBox installation notes

Jots while installing a fresh Windows 10 on VirtualBox v4.3.12 under 64-bit Linux. Correction: I eventually upgraded to v5.0.12, which lists Windows 10 as a target OS. This was required to make the Windows Addons work.

  • Set OS to Other Windows 32 bit (I suppose other Microsoft selections will work as well)
  • Under processor, enable PAE/NX
  • Attempting a 64-bit installation got stuck on the initial Windows splash image, no matter what I tried (maybe this was solved with 5.0.12, didn’t try this)
  • Turn off screen saver after installation
  • The installation will ask for the product key in two different occasions during the installation. Just skip.
  • Didn’t work:
    In order to install VirtualBox Windows Additions, pick Devices > Insert Guest Additions CD Image… on the hosts’s VirtualBox menu. Then start the virtual machine. VirtualBox v4.3.12 doesn’t support Windows 10, so refuse the automatic run of the CD. Instead, browse the disc’s content and right-click VBoxWindowsAdditions-x86.exe. Choose Properties. Pick the Compatibility tab, check “Run this program in compatibility mode” and pick Windows 8 (as suggested on this post). Then run this program, which will then install the drivers properly. Windows will complain that the display adapter doesn’t work, but that’s fine. Just reboot.

Reading the firmware ROM from a Renesas uPD720202 USB 3.0 Host Controller using Linux

Pretty much as a side note, I should mention that the firmware should and can be loaded with a Windows utility named K2024FWUP1.exe. Get it from whereever you can, and verify it isn’t dirty with

$ shasum K2024FWUP1.exe
c9414cb825af79f5d87bd9772e10e87633fbf125  K2024FWUP1.exe

If this isn’t done, Window’s Device Manager will say that the device can’t be started, and Linux kernel will complain with

pci 0000:06:00.0: xHCI HW not ready after 5 sec (HC bug?) status = 0x1801

[...]

xhci_hcd 0000:06:00.0: can't setup: -110
xhci_hcd 0000:06:00.0: USB bus 3 deregistered
xhci_hcd 0000:06:00.0: init 0000:06:00.0 fail, -110
xhci_hcd: probe of 0000:06:00.0 failed with error -110

Now to the Linux part. This is just the series of commands I used to read from the firmware ROM of a Renesas USB controller detected as:

# lspci -s 06:00
06:00.0 USB controller: Renesas Technology Corp. uPD720202 USB 3.0 Host Controller (rev 02)

The point was to check if the ROM was erased (it was). I followed the instructions in the “μPD720201/μPD720202 User’s Manual: Hardware” (R19UH0078EJ0600, Rev.6.00), section 6.

Check if ROM exists:

# setpci -s 06:00.0 f6.w
8000

Bit 15=1, so yes, ROM exists. Check type and parameter:

# setpci -s 06:00.0 ec.l
00c22210
# setpci -s 06:00.0 f0.l
00000500

OK, according to table 6-1 of the Hardware User Manual, it’s a MX25L5121E.

Write magic word to DATA0:

# setpci -s 06:00.0 f8.l=53524F4D

Set “External ROM Access Enable”:

# setpci -s 06:00.0 f6.w=8001

Check “Result Code”:

# setpci -s 06:00.0 f6.w
8001

Indeed, bits 6:4 are zero — no result yet, as required for this stage in the Guide.

Now set Get DATA0 and Get DATA1, and check that they have been cleared:

# setpci -s 06:00.0 f6.w=8c01
# setpci -s 06:00.0 f6.w
8001

Get first piece of data from DATA0:

# setpci -s 06:00.0 f8.l
ffffffff

The ROM appears to be erased… Set Get DATA0 again, and read DATA1 (this is really what the Guide says)

# setpci -s 06:00.0 f6.w=8401
# setpci -s 06:00.0 fc.l
ffffffff

Yet another erased word. And now the other way around: Set Get DATA1 and read DATA0 again:

# setpci -s 06:00.0 f6.w=8801
# setpci -s 06:00.0 f8.l
ffffffff

And the other way around again…

# setpci -s 06:00.0 f6.w=8401
# setpci -s 06:00.0 fc.l
ffffffff

When done, clear “External ROM Access Enable”

# setpci -s 06:00.0 f6.w=8000

This rewinds the next set of operation to the beginning, of the ROM, as I’ve seen by trying it out, even though the Guide wasn’t so clear about it. So if the sequence shown above starts from the beginning, we read the beginning of the ROM again.

Again, with the ROM loaded with firmware

# setpci -s 06:00.0 f6.w
8000
# setpci -s 06:00.0 f8.l=53524F4D
# setpci -s 06:00.0 f6.w=8001
# setpci -s 06:00.0 f6.w
8001
# setpci -s 06:00.0 f6.w=8c01
# setpci -s 06:00.0 f6.w
8001
# setpci -s 06:00.0 f8.l
7da655aa
# setpci -s 06:00.0 f6.w=8401
# setpci -s 06:00.0 fc.l
00f60014
# setpci -s 06:00.0 f6.w=8801
# setpci -s 06:00.0 f8.l
004c010c
# setpci -s 06:00.0 f6.w=8401
# setpci -s 06:00.0 fc.l
2ffc015c
# setpci -s 06:00.0 f6.w=8801
# setpci -s 06:00.0 f8.l
0008315c
# setpci -s 06:00.0 f6.w=8401
# setpci -s 06:00.0 fc.l
1a5c2024
# setpci -s 06:00.0 f6.w=8000

I stopped after a few words, of course. Note that the first word is indeed the correct signature.

Cursor control characters in a bash script

To control the cursor’s position with a plain bash “echo” command, use the fact that the $’something‘ pseudo-variable interprets that something more or less like a C escape sequence. So the ESC character, having ASCII code 0x1b, can be generated with $’0x1b’. $’\e’ is also OK, by the way.

There are plenty of sources for TTY commands, for example this and this.

So, to jump to the upper-left corner of the screen, just go

$ echo -n $'\x1b'[H

Alternatively, one can use echo’s -e flag, which is the method chosen in /etc/init.d/functions to produce color-changing escape characters. So the “home” sequence could likewise be

$ echo -en \\033[H

As easy as that.

Using Linux’ setpci to program an EEPROM attached to an PLX / Avago PCIe switch

Introduction

These are my notes as I programmed an Atmel AT25128 EEPROM, attached to a PEX 8606 PCIe switch, using PCIe configuration-space writes only (that is, no I2C / SMBus cable). This is frankly quite redundant, as Avago supplies software tools for doing this.

In fact, in order to get their tools, register at Avago’s site, then make the extra registration in PLX Tech’ site. None of these registrations require signing an NDA. At PLX Tech’s site, pick SDK -> PEX at the bottom of list of devices to get documentation for, and download the PLX SDK. Among others, this suite includes the PEX Device Editor, which is quite a useful tool regardless of switches, as it gives a convenient tree view of the bus. The Device Editor, as well as other tools, allow programming the EEPROM from the host, with or without an I2C cable.

There are also other tools in the SDK that do the same thing PLXMon in particular. If you have an Aardvark I2C to USB cable, the PLXMon tool allows reading and writing to the EEPROM through I2C. And there’s a command line interface, probably for all functionality. So really, this is really for those who want to get down to the gory details.

All said below will probably work with the entire PEX 86xx family, and possibly with other Avago devices as well. The Data Book is your friend.

The EEPROM format

The organization of data in the outlined in the Data Book, but to keep it short and concise: It’s a sequence of bytes, consisting of a concatenation of the following words, all represented in Little Endian format:

  1. The signature, always 0x5a, occupying one byte
  2. A zero (0x00), occupying one byte
  3. The number of bytes of payload data to come, given as a 16-bit words (two bytes). Or equivanlently, the number of registers to be written to, multiplied by 6.
  4. The address of the register to be written to, divided by 4, and ORed with the port number, left shifted by 10 bits. See the data book for how NT ports are addressed. This field occupies 16 bits (two bytes). Or to put it in C’ish:
    unsigned short addr_field = (reg_addr >> 2) | (port << 10)
  5. The data to be written: 32 bits (four bytes)

Items #4 and #5 are repeated for each register write. There is no alignment, so when this stream is organized in 32-bit words, it becomes somewhat inconvenient.

And as the Data Book keeps saying all over the place: If the Debug Control register (at 0x1dc) is written to, it has to be the first entry (occupying bytes 4 to 9 in the stream). Its address representation in the byte stream is 0x0077, for example (or more precisely, the byte 0x77 followed by 0x00).

Accessing configuration space registers

Given the following PCI bus setting:

02:00.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:01.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:05.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:07.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:09.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)

In particular note that the switch’ upstream port 0 is at 02:00.0.

Reading from the Serial EEPROM Buffer register at 264h (as root, of course):

# setpci -s 02:00.0 264.l
00000000

The -s 02:00.0 part selects the device by its bus position (see above).

Note that all arguments as well as return values are given in hexadecimal. An 0x prefix is allowed, but it’s redundant.

Making a dry-run of writing to this register, and verifying nothing happened:

# setpci -Dv -s 02:00.0 264.l=12345678
02:00.0:264 12345678
# setpci -s 02:00.0 0x264.l
00000000

Now let’s write for real:

# setpci -s 02:00.0 264.l=12345678
# setpci -s 02:00.0 264.l
12345678

(Yey, it worked)

Reading from the EEPROM

Reading four bytes from the EEPROM at address 0:

# setpci -s 02:00.0 260.l=00a06000
# setpci -s 02:00.0 264.l
0012005a

The “a0″ part above sets the address width explicitly to 2 bytes on each operation. There may be some confusion otherwise, in particular if the device wasn’t detected properly at bringup. The “60″ part means “read”.

Just checking the value of the status register after this:

# setpci -s 02:00.0 260.l
00816000

Same, but read from EEPROM address 4. The lower 13 LSBs are used as bits [14:0] of the EEPROM address. It’s also possible to access higher addresses (see the respective Data Book).

# setpci -s 02:00.0 260.l=00a06001
# setpci -s 02:00.0 264.l
0008c03a

Or, to put it in a simple Bash script (this one reads the first 16 DWords, i.e. 64 bytes) from the EEPROM of the switch located at the bus address given as the argument to the script (see example below):

#!/bin/bash

DEVICE=$1

for ((i=0; i<16; i++)); do
  setpci -s $DEVICE 260.l=`printf '%08x' $((i+0xa06000))`
  usleep 100000
  setpci -s $DEVICE 264.l
done

Rather than checking the status bit for the read to be finished, the script waits 100 ms. Quick and dirty solution, but works.

Note: usleep is deprecated as a command-line utility. Instead, odds are that “sleep 0.1″ replaces “usleep 100000″. Yes, sleep takes non-integer arguments in non-ancient UNIXes.

Writing to the EEPROM

Important: Writing to the EEPROM, in particular the first word, can make the switch ignore the EEPROM or load faulty data into the registers. On some boards, the EEPROM is essential for the detection of the switch by the host and its enumeration. Consequently, writing junk to the EEPROM can make it impossible to rectify this through the PCIe interface. This can render the PCIe switch useless, unless this is fixed with I2C access.

Before starting to write, the EEPROM’s write enable latch needs to be set. This is done once for each write as follows, regardless of the desired target address:

# setpci -s 02:00.0 260.l=00a0c000

Now we’ll write 0xdeadbeef to the first 4 bytes of the EEPROM.

# setpci -s 02:00.0 264.l=deadbeef
# setpci -s 02:00.0 260.l=00a04000

If another address is desired, add the address in bytes, divided by 4 to 00004000 above. The write enable latch is the same (no change in the lower bits is required).

Here’s an example of the sequence for writing to bytes 4-7 of the EEPROM (all three lines are always required)

# setpci -s 02:00.0 260.l=00a0c000
# setpci -s 02:00.0 264.l=010d0077 # Just any value goes
# setpci -s 02:00.0 260.l=00a04001

Or making a script of this, which writes the arguments from address 0 and on (for those who like to make big mistakes…)

#!/bin/bash

numargs=$#
DEVICE=$1

shift

for ((i=0; i<(numargs-1); i++)); do
  setpci -s $DEVICE 260.l=00a0c000
  setpci -s $DEVICE 264.l=$1
  setpci -s $DEVICE 260.l=`printf '%08x' $((i+0xa04000))`
  usleep 100000
  shift
done

Again, usleep can be replaced with a plain sleep with a non-integer argument. See above.

Example of using these scripts

# ./writeeeprom.sh 02:00.0 0006005a 00ff0081 ffff0001
# ./readeeprom.sh 02:00.0
0006005a
00ff0081
ffff0001
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff

When the EEPROM gets messed up

It’s more than possible that the switch becomes unreachable to the host as a result of messing up the EEPROM’s registers. For example, by changing the upstream port setting. A simple way out, if a blank EEPROM is good enough for talking with the switch, is to force the EEPROM undetected by e.g. short-circuiting the EEPROM’s SO pin (pin number 2 on AT25128) to ground with a 33 Ohm resistor or so. This prevents the data from being loaded, but the commands above will nevertheless work, so the content can be altered. Yet another “dirty, but works” solution.

Moving a Windows 7-installed hard disk to a new computer

This has been documented elsewhere, but it’s important enough to have a note about here.

In short, before switching to a new hardware, it’s essential to prepare it, or an 0x0000007b blue screen will occur on the new hardware.

The trick is to run sysprep.exe (under windows\system32\sysprep\) before the transition. Have “Generalize” checked, and choose “shutdown” at the end of the operation (“Shutdown Options”).

Once the computer shuts down, move the hard disk to the new computer. Windows should boot smoothly, and start a series of installation stages, including feeding the license key and language settings. Also, an account needs to be created. This account can be deleted afterwards, as the old account is kept. Quite silly, as a matter of fact.

 

Linux kernel hack for calming down a flood of PCIe AER messages

While working on a project involving a custom PCIe interface, Linux’ message log became flooded with messages like

pcieport 0000:00:1c.6:   device [8086:a116] error status/mask=00001081/00002000
pcieport 0000:00:1c.6:    [ 0] Receiver Error
pcieport 0000:00:1c.6:    [ 7] Bad DLLP
pcieport 0000:00:1c.6:    [12] Replay Timer Timeout
pcieport 0000:00:1c.6:   Error of this Agent(00e6) is reported first
pcieport 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
pcieport 0000:02:00.0:   device [10b5:8606] error status/mask=00003000/00002000
pcieport 0000:02:00.0:    [12] Replay Timer Timeout
pcieport 0000:00:1c.6: AER: Corrected error received: id=00e6
pcieport 0000:00:1c.6: can't find device of ID00e6
pcieport 0000:00:1c.6: AER: Corrected error received: id=00e6
pcieport 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)

And before long, some 400 MB of log messages accumulated in /var/log/messages. In this context, they are merely informative AER (Advanced Error Reporting) messages, telling me that errors have occurred in the link between the computer’s PCIe controller and the PCIe switch on the custom board. But all of these errors were correctable (presumably with retransmits) so from a functional standpoint, the hardware worked.

Advanced Error Reporting, and its Linux driver was explained in OLS 2007 (pdf).

Had it not been for these messages, I could have been mislead to think that all was fine, even though there’s a method to tell, which I’ve dedicated an earlier post to. So they’re precious, but they flood the system logs, and even worse, the system is so busy handling them, that the boot is slowed down, and sometimes the boot process got stuck in the middle.

At first I thought that it would be enough to just turn off the logging of these messages, but it seems like the flood of interrupts was the problem.

So one way out is to disable the handler of AER altogether: Use the pci=noaer kernel parameter on boot, or disable the CONFIG_PCIEAER kernel configuration flag, and recompile the kernel. This removes the piece of code that configures the computer’s root port to send interrupts if and when an AER message arrives, but that way I won’t be alerted that a problem exists.

So I went for hacking the kernel code. In an early attempt, I went for not producing error messages for each event, but to keep it down to no more than 5 per second. It worked in the sense that the log wasn’t flooded, but didn’t solve the problem of a slow or impossible boot. As mentioned earlier, the core problem seems to be a bombardment of interrupts.

So the hack that eventually did the job for me tells the root port to stop generating interrupts after 100 kernel messages have been produced. That’s enough to inform me that there’s a problem, and give me an idea of where it is, but it stops soon enough to let the system live.

The only file I modified was drivers/pci/pcie/aer/aerdrv_errprint.c on a 4.2.0 Linux kernel. In retrospective, I could have done it more elegant. But hey, now that it works, why should I care…?

It goes like this: I defined a static variable, countdown, and initialized it to 100. Before a message is produced, a piece of code like this runs:

	if (!countdown--)
		aer_enough_is_enough(dev);

aer_enough_is_enough() is merely a copy of aerdrv.c’s aer_disable_rootport(), which is defines as static there, and requires an uncomfortable argument. It would have made more sense to make aer_disable_rootport() a wrapper of another function, which could have been used both by aerdrv.c and my little hack — that would have been much more elegant.

Instead, I copied two additional static functions that are required by aer_disable_rootport() into aerdrv_errprint.c, and ended up with an ugly hack that solves the problem.

With all due shame, here’s the changes in patch format. It’s not intended to apply on your kernel as is. It’s more intended to be a guideline to how to get it done. And by all means, take a look on aerdrv.c’s relevant functions, and see if they’re different, by any chance.

From b007850486167288ea4c6c6a1bf30ddd1a299f24 Mon Sep 17 00:00:00 2001
From: Eli Billauer <my-mail@gmail.com>
Date: Sat, 17 Oct 2015 07:37:19 +0300
Subject: [PATCH] PCIe AER handler: Turn off interrupts from root port after 100 messages

---
 drivers/pci/pcie/aer/aerdrv_errprint.c |   78 ++++++++++++++++++++++++++++++++
 1 files changed, 78 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 167fe41..31a8572 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -20,6 +20,7 @@
 #include <linux/pm.h>
 #include <linux/suspend.h>
 #include <linux/cper.h>
+#include <linux/pcieport_if.h>

 #include "aerdrv.h"
 #include <ras/ras_event.h>
@@ -129,6 +130,74 @@ static const char *aer_agent_string[] = {
 	"Transmitter ID"
 };

+/* Two functions copied from aerdrv.c, to prevent name space pollution */
+
+static int set_device_error_reporting(struct pci_dev *dev, void *data)
+{
+	bool enable = *((bool *)data);
+	int type = pci_pcie_type(dev);
+
+	if ((type == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (type == PCI_EXP_TYPE_UPSTREAM) ||
+	    (type == PCI_EXP_TYPE_DOWNSTREAM)) {
+		if (enable)
+			pci_enable_pcie_error_reporting(dev);
+		else
+			pci_disable_pcie_error_reporting(dev);
+	}
+
+	if (enable)
+		pcie_set_ecrc_checking(dev);
+
+	return 0;
+}
+
+/**
+ * set_downstream_devices_error_reporting - enable/disable the error reporting  bits on the root port and its downstream ports.
+ * @dev: pointer to root port's pci_dev data structure
+ * @enable: true = enable error reporting, false = disable error reporting.
+ */
+static void set_downstream_devices_error_reporting(struct pci_dev *dev,
+						   bool enable)
+{
+	set_device_error_reporting(dev, &enable);
+
+	if (!dev->subordinate)
+		return;
+	pci_walk_bus(dev->subordinate, set_device_error_reporting, &enable);
+}
+
+/* Allow 100 messages, and then stop it. Since the print functions are called
+   from a work queue, it's safe to call anything, aer_disable_rootport()
+   included. */
+
+static int countdown = 100;
+
+/* aer_enough_is_enough() is a copy of aer_disable_rootport(), only the
+   latter requires to get the aer_rpc structure from the pci_dev structure,
+   and then uses it to get the pci_dev structure. So enough with that too.
+*/
+
+static void aer_enough_is_enough(struct pci_dev *pdev)
+{
+	u32 reg32;
+	int pos;
+
+	dev_err(&pdev->dev, "Exceeded limit of AER errors to report. Turning off Root Port interrupts.\n");
+
+	set_downstream_devices_error_reporting(pdev, false);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	/* Disable Root's interrupt in response to error messages */
+	pci_read_config_dword(pdev, pos + PCI_ERR_ROOT_COMMAND, &reg32);
+	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
+	pci_write_config_dword(pdev, pos + PCI_ERR_ROOT_COMMAND, reg32);
+
+	/* Clear Root's error status reg */
+	pci_read_config_dword(pdev, pos + PCI_ERR_ROOT_STATUS, &reg32);
+	pci_write_config_dword(pdev, pos + PCI_ERR_ROOT_STATUS, reg32);
+}
+
 static void __print_tlp_header(struct pci_dev *dev,
 			       struct aer_header_log_regs *t)
 {
@@ -168,6 +237,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	int layer, agent;
 	int id = ((dev->bus->number << 8) | dev->devfn);

+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	if (!info->status) {
 		dev_err(&dev->dev, "PCIe Bus Error: severity=%s, type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity], id);
@@ -200,6 +272,9 @@ out:

 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
 {
+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	dev_info(&dev->dev, "AER: %s%s error received: id=%04x\n",
 		info->multi_error_valid ? "Multiple " : "",
 		aer_error_severity_string[info->severity], info->id);
@@ -226,6 +301,9 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	u32 status, mask;
 	const char **status_strs;

+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	aer_severity = cper_severity_to_aer(cper_severity);

 	if (aer_severity == AER_CORRECTABLE) {
--
1.7.2.3

And again — it’s given as a patch, but really, it’s not intended for application as is. If you need to do this yourself, read through the patch, understand what it does, and make the changes with respect to your own kernel. Or your system may just hang.

syslogd notes

A few jots on playing with the system logger (the one that writes to /var/log/messages) on an ancient CentOS 5.5.

First, check the version: It says

Oct  6 15:12:06 diskless syslogd 1.4.1: restart.

So it’s a quite old revision of syslogd, unfortunately. There are no filter conditions to rely on.

The relevant configuration file is /etc/syslog.conf. First, one may divert the log messages from /var/log/messages to /var/log/kernel by changing

*.info;mail.none;authpriv.none;cron.none                /var/log/messages

to

*.info;mail.none;authpriv.none;cron.none;kern.none              /var/log/messages

kern.*                                                          /var/log/kernel-junk

Or, alternatively, divert only less-than-warnings messages to kernel-junk (with lazy flushing):

*.info;mail.none;authpriv.none;cron.none;kern.none;kern.warn		/var/log/messages

kern.*							-/var/log/kernel-junk

The trick is that kern.none disables all kernel messages to /var/log/messages. The following kern.warn turns warnings and up back on. kernel-junk gets everything.

Hexdump notes

General notes

For plain byte-per-byte hex dump,

$ hexdump -C

To dump a limited number of bytes, use the -n flag:

$ hexdump -C -n 64 /dev/urandom
00000000  9c 72 b0 43 da 6e 27 2f  f9 f1 34 06 60 d5 71 ad  |.r.C.n'/..4.`.q.|
00000010  cc 07 89 02 f7 f9 5f 85  f6 ba a5 24 cc 9f 2d d5  |......_....$..-.|
00000020  6d da 5b 91 a6 23 d4 94  51 1d 96 a7 5c 34 1a 48  |m.[..#..Q...\4.H|
00000030  6e 13 d4 3a 54 5d c5 c4  7b 1e f3 7b 6f 84 af 8b  |n..:T]..{..{o...|
00000040

And possibly add the -v flag so that repeated lines are printed out explicitly

$ hexdump -C -n 64 /dev/zero
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040
$ hexdump -C -v -n 64 /dev/zero
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040

Hexdump scripting

Hexdump has a somewhat weird one-liner scripting syntax. It consists of the -e flag(s) followed by a string, which must be enclosed in a single quote signs. Within this string, there may be several double quotes containing formatting info. Probably, the only way to really figure this out is trying some examples.

  • Everything in the expression runs as a loop.
  • n/m (n and m are integers) means n times consume m bytes regarding the expression following immediately.
  • If there is more than one -e, they consume the same data for each -e
  • %08_ax is the data offset in hex. Also try “%10_ad: ” for decimal position.
  • Anything not interpreted is printed (a bit like printf). That includes, of course, “\n”.
  • For editing hex data, ghex can be handy

Scripting examples

Print out the input as 32-bit hex integers, one per line:

$ hexdump -v -e '1/4 "%08x " "\n"'

Same, but as 32-bit decimal numbers:

$ hexdump -v -e '1/4 "%08d " "\n"'

Dump mouse raw motion data, three bytes per line, each as a hex number:

$ hexdump -v -e '3/1 "%02x " "\n"' /dev/input/mice

Like “hexdump -C”, only explicitly:

$ hexdump -e '"%08_ax " 16/1 "%02x "' -e '" |" 16/1 "%_p" "|\n"'

The manpage offers a lot more detail on this.