The sledge hammer: Forcing a permanent screen resolution mode on Linux

When to do this

Because Gnome desktop is sure it knows what’s best for me, and it’s virtually impossible to just tell it that I want this screen resolution mode and no other, there is only one option left: Lie about the monitor’s graphics mode capabilities. Make the kernel feed it with fake screen information (EDID), that basically says “there is only this resolution”. Leave it with one choice only.

What is EDID? It’s a tiny chunk of information that is stored on a small EEPROM memory on the monitor. The graphics card fetches this blob through two I2C wires on the cable, and deduces from it what graphics mode (with painfully detailed timing parameters) the monitor supports. It’s that little hex blob that appears when you go xrandr –verbose.

I should mention a post in Gentoo forum, which suggests making X ignore EDID info by using

Option       "UseEDID" "false"
Option       "UseEDIDFreqs" "false"

in /etc/X11/xorg.conf, or is it a file in /usr/share/X11/xorg.conf.d/? And then just set the screen mode old-school. Didn’t bother to check this. There are too many players in this game. Faking EDID seemed to be a much better idea than to ask politely not to consider it.

How to feed a fake EDID

The name of the game is Kernel Mode Setting (KMS). Among others, it allows loading a file from /lib/firmware which is used as the screen information (EDID) instead of getting it from the screen.

For this to work, the CONFIG_DRM_LOAD_EDID_FIRMWARE kernel compilation must be enabled (set to “y”).

Note that unless Early KMS is required, the firmware file is loaded after the initramfs stage. In other words, it’s not necessary to push the fake EDID file into the initramfs, but it’s OK to have it present only in the filesystem that is mounted after the initramfs.

The EDID file should be stored in /lib/firmware/edid (create the directory if necessary) and the following command should be added to the kernel command line:

drm_kms_helper.edid_firmware=edid/fake_edid.bin

(for kernels 4.15 and later, there’s a drm.edid_firmware parameter that is supposed to be better in some way).

Generating a custom EDID file

I needed a special graphics mode to solve a problem with my OLED screen. Meaning I had to cook my own EDID file. It turned out quite easy, actually.

The kernel’s doc for this is Documentation/admin-guide/edid.rst

In the kernel’s tools/edid, edit one of the asm files (e.g. 1920x1080.S) and set the parameters to the correct mode. This file has just defines. The actual data format is produced in edid.S, which is included at the bottom. The output in this case is 1920x1080.bin. Note that the C file (1920x1080.c) is an output as well in this case — for reference of some other use, I guess.

And then just type “make” in tools/edid/ (don’t compile the kernel, that’s really not necessary for this).

The numbers in the asm file are in a slightly different notation, as explained in the kernel doc. Not a big deal to figure out.

In my case, I translated this xrandr mode line

  oledblack (0x10b) 173.000MHz -HSync +VSync
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

to this:

/* EDID */
#define VERSION 1
#define REVISION 3

/* Display */
#define CLOCK 173000 /* kHz */
#define XPIX 1920
#define YPIX 1080
#define XY_RATIO XY_RATIO_16_9
#define XBLANK 656
#define YBLANK 40
#define XOFFSET 128
#define XPULSE 200
#define YOFFSET 3
#define YPULSE 5
#define DPI 96
#define VFREQ 60 /* Hz */
#define TIMING_NAME "Linux FHD"
/* No ESTABLISHED_TIMINGx_BITS */
#define HSYNC_POL 0
#define VSYNC_POL 0

#include "edid.S"

There seems to be a distinction between standard resolution modes and those that aren’t. I got away with this, because 1920x1080 is a standard mode. It may be slightly trickier with a non-standard mode.

When it works

This is what it looks like when all is well. First, the kernel logs. In my case, because I didn’t put the file in the initramfs, loading it fails twice:

[    3.517734] platform HDMI-A-3: Direct firmware load for edid/1920x1080.bin failed with error -2
[    3.517800] [drm:drm_load_edid_firmware [drm_kms_helper]] *ERROR* Requesting EDID firmware "edid/1920x1080.bin" failed (err=-2)

and again:

[    4.104528] platform HDMI-A-3: Direct firmware load for edid/1920x1080.bin failed with error -2
[    4.104580] [drm:drm_load_edid_firmware [drm_kms_helper]] *ERROR* Requesting EDID firmware "edid/1920x1080.bin" failed (err=-2)

But then, much later, it loads properly:

[   19.864966] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[   93.298915] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[  109.573124] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[ 1247.290084] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"

Why several times? Well, the screen resolution is probably set up several times as the system goes up. There’s clearly a quick screen flash a few seconds after the desktop goes up. I don’t know exactly why, and at this stage I don’t care. The screen is at the only mode allowed, and that’s it.

And now to how xrandr sees the situation:

$ xrandr -d :0 --verbose
[ ... ]
HDMI3 connected primary 1920x1080+0+0 (0x10c) normal (normal left inverted right x axis y axis) 500mm x 281mm
 Identifier: 0x48
 Timestamp:  21339
 Subpixel:   unknown
 Gamma:      1.0:1.0:1.0
 Brightness: 1.0
 Clones:   
 CRTC:       0
 CRTCs:      0
 Transform:  1.000000 0.000000 0.000000
 0.000000 1.000000 0.000000
 0.000000 0.000000 1.000000
 filter:
 EDID:
 00ffffffffffff0031d8000000000000
 051601036d321c78ea5ec0a4594a9825
 205054000000d1c00101010101010101
 010101010101944380907238284080c8
 3500f41911000018000000ff004c696e
 75782023300a20202020000000fd003b
 3d424412000a202020202020000000fc
 004c696e7578204648440a2020200045
 aspect ratio: Automatic
 supported: Automatic, 4:3, 16:9
 Broadcast RGB: Automatic
 supported: Automatic, Full, Limited 16:235
 audio: auto
 supported: force-dvi, off, auto, on
 1920x1080 (0x10c) 173.000MHz -HSync -VSync *current +preferred
 h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
 v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

Compare the EDID part with 1920x1080.c, which was created along with the binary:

{
 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00,
 0x31, 0xd8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 0x05, 0x16, 0x01, 0x03, 0x6d, 0x32, 0x1c, 0x78,
 0xea, 0x5e, 0xc0, 0xa4, 0x59, 0x4a, 0x98, 0x25,
 0x20, 0x50, 0x54, 0x00, 0x00, 0x00, 0xd1, 0xc0,
 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x94, 0x43,
 0x80, 0x90, 0x72, 0x38, 0x28, 0x40, 0x80, 0xc8,
 0x35, 0x00, 0xf4, 0x19, 0x11, 0x00, 0x00, 0x18,
 0x00, 0x00, 0x00, 0xff, 0x00, 0x4c, 0x69, 0x6e,
 0x75, 0x78, 0x20, 0x23, 0x30, 0x0a, 0x20, 0x20,
 0x20, 0x20, 0x00, 0x00, 0x00, 0xfd, 0x00, 0x3b,
 0x3d, 0x42, 0x44, 0x12, 0x00, 0x0a, 0x20, 0x20,
 0x20, 0x20, 0x20, 0x20, 0x00, 0x00, 0x00, 0xfc,
 0x00, 0x4c, 0x69, 0x6e, 0x75, 0x78, 0x20, 0x46,
 0x48, 0x44, 0x0a, 0x20, 0x20, 0x20, 0x00, 0x45,
};

So it definitely took the bait.

Root over NFS remains read only with Linux v5.7

Upgrading the kernel should be quick and painless…

After upgrading the kernel from v5.3 to 5.7, a lot of systemd services failed (Debian 8), in particular systemd-remount-fs:

● systemd-remount-fs.service - Remount Root and Kernel File Systems
   Loaded: loaded (/lib/systemd/system/systemd-remount-fs.service; static)
   Active: failed (Result: exit-code) since Sun 2020-07-26 15:28:15 IDT; 17min ago
     Docs: man:systemd-remount-fs.service(8)

http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

  Process: 223 ExecStart=/lib/systemd/systemd-remount-fs (code=exited, status=1/FAILURE)
 Main PID: 223 (code=exited, status=1/FAILURE)

Jul 26 15:28:15 systemd[1]: systemd-remount-fs.service: main process exited, code=exited, status=1/FAILURE
Jul 26 15:28:15 systemd[1]: Failed to start Remount Root and Kernel File Systems.
Jul 26 15:28:15 systemd[1]: Unit systemd-remount-fs.service entered failed state.

and indeed, the root NFS remained read-only (checked with “mount” command), which explains why so many other services failed.

After an strace session, I managed to nail down the problem: The system call to mount(), which was supposed to do the remount, simply failed:

mount("10.1.1.1:/path/to/debian-82", "/", 0x61a250, MS_REMOUNT, "addr=10.1.1.1") = -1 EINVAL (Invalid argument)

On the other hand, any attempt to remount another read-only NFS mount, which had been mounted the regular way (i.e. after boot) went through clean, of course:

mount("10.1.1.1:/path/to/debian-82", "/mnt/tmp", 0x61a230, MS_REMOUNT, "addr=10.1.1.1") = 0

The only apparent difference between the two cases is the third argument, which is ignored for MS_REMOUNT according to the manpage.

The manpage also says something about the EINVAL return value:

EINVAL A remount operation (MS_REMOUNT) was attempted, but source was not already mounted on target.

A hint to the problem could be that the type of the mount, as listed in /proc/mounts, is “nfs” for the root mounted filesystem, but “nfs4″ for the one in /mnt/tmp. The reason for this difference isn’t completely clear.

The solution

So it’s all about that little hint: If the nfsroot is selected to boot as version 4, then there’s no problem remounting it. Why it made a difference from one kernel version to another is beyond me. So the fix is to add nfsvers=4 to the nfsroot assignment. Something like

root=/dev/nfs nfsroot=10.1.1.1:/path/to/debian-82,nfsvers=4

For the record, I re-ran the remount command with strace again, and exactly the same system call was made, including that most-likely-ignored 0x61a250 argument, and it simply returned success (zero) instead of EINVAL.

As a side note, the rootfstype=nfs in the kernel command line is completely ignored. Write any junk instead of “nfs” and it makes no difference.

Another yak shaved successfully.

Linux kernel OOPS dissection notes

What’s this

Every now and then I find myself looking at an Oops or kernel panic, reminding myself how to approach this. So this is where I write down the jots as I go. This isn’t very organized.

Disassembly

  • First thing first: Disassemble the relevant parts:
    $ objdump -DS driver.ko > ~/Desktop/driver.asm
  • Doing this on the .o or .ko file gives exactly the same result. Like diff-exactly.
  • Or if the region of interest belongs to the kernel itself, this can be done (in the kernel tree after a matching compilation):
    $ objdump -DS kernel/locking/spinlock.o > ~/Desktop/spinlock.asm
  • Or, directly on the entire kernel image (at the root if the kernel tree of a compiled kernel). This doesn’t just saves looking up where the relevant function is defined (which object file), but the labels used in the function will be correct, even when using -d instead of -DS.
    $ objdump -d vmlinux

    Then search for the function with a colon at the end, so it matches the beginning of the function, and not references to it. E.g.
    <_raw_spin_lock_irqsave>:

  • The -DS flag adds inline source in the disassembly when possible. If it fails, go for plain -d instead.
  • With the -d flag, usage of labels (in particular calls to functions) outside the disassembled module will appear to be to the address following the command, because the address after the opcode is zero. The disassembly is done on binary that is before linking.

Stack trace

  • The offending command is where the RIP part points at. It’s given in the same hex format as the stack trace.
  • The stack trace contains the offset points in a function (i.e. after a label) to after the call to the function. It’s label+hex/hex.
  • The hex notion is relative to a label as seen in the disassembly (the title row before each segment). The first offset number is the offset relative to the label, and the second is the length of the function (i.e. the offset to the next label).
  • <IRQ> and </IRQ> markups show the part that ran as an interrupt (not surprisingly).
  • In x86 assembly notation, the first argument is source, the second is destination.
And here’s just a sample of a stack trace within an IRQ:
<IRQ>
dump_stack+0x46/0x59
__report_bad_irq+0x40/0xa9
note_interrupt+0x1c9/0x217
handle_irq_event_percpu+0x4c/0x6a
handle_irq_event+0x2e/0x4c
handle_fasteoi_irq+0x9b/0xff
handle_irq+0x19/0x1c
do_IRQ+0x61/0x106
common_interrupt+0xf/0xf
</IRQ>

To be continued, of course…

 

Turning off DSN on sendmail to prevent backscatter

I sent that?

One morning, I got a bounce message from my own mail sendmail server, saying that it failed to deliver a message I never sent. That’s red alert. It means that someone managed to provoke my mail server to send an outbound message. It’s red alert, because my mail server effectively relays spam to any destination that the spammer chooses. This could ruin the server’s reputation horribly.

It turned out that an arriving mail required a return receipt, which was destined to just some mail address. There’s an SMTP feature called Delivery Status Notification (DSN), which allows the client connecting to the mail server to ask for a mail “in return”, informing the sender of the mail if it was properly delivered. The problem is that the MAIL FROM / From addresses could be spoofed, pointing at a destination to spam. Congratulations, your mail server was just tricked into sending spam. This kind of trickery is called backscatter.

Checking my own mail logs, the DSN is a virtually unused feature. So it’s probably just something spammers can take advantage of.

The relevant RFC for DSN is RFC1891. Further explanations on DSN can be found in one of sendmail’s tutorial pages.

How to turn DSN off

First, I recommend checking if it’s not disabled already, as explained below. In particular, if the paranoid-level “goaway” privacy option is used, DSN is turned off anyhow.

It’s actually easy. Add the noreceipts option to PrivacyOptions. More precisely, edit /etc/mail/sendmail.mc and add noreceipts to the list of already existing options. In my case, it ended up as

define(`confPRIVACY_FLAGS',dnl
`needmailhelo,needexpnhelo,needvrfyhelo,restrictqrun,restrictexpand,nobodyreturn,noetrn,noexpn,novrfy,noactualrecipient,noreceipts')dnl

and then run “make” in /etc/mail, and restart sendmail.

Turning off DSN is often recommended against in different sendmail guides, because it’s considered a “valuable feature” or so. As mentioned above, I haven’t seen it used by anyone else than spammers.

Will my mail server do DSN?

Easy to check, because the server announces its willingness to fulfill DSN requests at the beginning of the SMTP session, with the line marked in red in the sample session below:

<<< 220 mx.mymailserver.com ESMTP MTA; Wed, 15 Jul 2020 10:22:32 GMT
>>> EHLO localhost.localdomain
<<< 250-mx.mymailserver.com Hello 46-117-33-227.bb.netvision.net.il [46.117.33.227], pleased to meet you
<<< 250-ENHANCEDSTATUSCODES
<<< 250-PIPELINING
<<< 250-8BITMIME
<<< 250-SIZE
<<< 250-DSN
<<< 250-DELIVERBY
<<< 250 HELP
>>> MAIL FROM:<spamvictim@billauer.co.il>
<<< 250 2.1.0 <spamvictim@billauer.co.il>... Sender ok
>>> RCPT TO:<legal_address@billauer.co.il> NOTIFY=SUCCESS
<<< 250 2.1.5 <legal_address@billauer.co.il>... Recipient ok
>>> DATA
<<< 354 Enter mail, end with "." on a line by itself
>>> MIME-Version: 1.0
>>> From: spamvictim@billauer.co.il
>>> To: legal_address@billauer.co.il
>>> Subject: Testing email.
>>>
>>>
>>> Just a test, please ignore
>>> .
<<< 250 2.0.0 06FAMWa1014200 Message accepted for delivery
>>> QUIT
<<< 221 2.0.0 mx.mymailserver.com closing connection

To test a mail server for its behavior with DSN, the script that I’ve already published can be used. To make it request a return receipt, the two lines that set the SMTP recipient should be changed to

  die("Failed to set receipient\n")
    if (! ($smtp->recipient( ($to_addr ), { Notify => ['SUCCESS'] } ) ) );

This change causes the NOTIFY=SUCCESS part in the RCPT TO line, which effectively requests a receipt from the server when the mail is properly delivered.

Note that if DSN isn’t supported by the mail server (possibly because of the privacy option fix shown above), the SMPT session looks exactly the same, except that the SMTP line marked in red will be absent. Then the mail server just ignores the NOTIFY=SUCCESS part silently, and responds exactly as before.

However when running the Perl script, the Net::SMTP will be kind enough to issue a warning to its stderr:

Net::SMTP::recipient: DSN option not supported by host at ./testmail.pl line 36.

The mail addresses I used in the sample session above are bogus, of courses, but note that the spam victim is the sender of the email, because that’s where the return receipt goes. On top of that, the RCPT TO address will also get a spam message, but that’s the smaller problem, as it’s yet another spam message arriving — not one that is sent away from our server.

I should also mention that Notify can be a comma-separated list of events, e.g.

RCPT TO:<bad_address@billauer.co.il> NOTIFY=SUCCESS,FAILURE,DELAY

however FAILURE doesn’t include the user not being known to the server, in which case the message is dropped anyhow without any DSN message generated. So as a spam trick, one can’t send mails to random addresses, and issue spam bounce messages because they failed. That would have been too easy.

In the mail logs

The sample session shown above causes the following lines in mail.log. Note the line marked in red, which indicates that the return receipt mechanism was fired off.

Jul 15 10:15:31 sm-mta[12697]: 06FAFTbL012697: from=<spamvictim@billauer.co.il>, size=121, class=0, nrcpts=1, msgid=<202007151015.06FAFTbL012697@mx.mymailserver.com>, proto=ESMTP, daemon=IPv4-port-587, relay=46-117-33-227.bb.netvision.net.il
[46.117.33.227]
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: to=<legal_address@billauer.co.il>, ctladdr=<spamvictim@billauer.co.il> (1010/500), delay=00:00:01, xdelay=00:00:00, mailer=local, pri=30456, dsn=2.0.0, stat=Sent
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: 06FAFVbL012698: DSN: Return receipt
Jul 15 10:15:31 sm-mta[12698]: 06FAFVbL012698: to=<spamvictim@billauer.co.il>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=30000, dsn=2.0.0, stat=Sent

The receipt

Since I’m at it, this is what a receipt message for the sample session above looks like:

Received: from localhost (localhost)	by mx.mymailserver.com
 (8.14.4/8.14.4/Debian-8+deb8u2) id 06FAFVbL012698;	Wed, 15 Jul 2020
 10:15:31 GMT
Date: Wed, 15 Jul 2020 10:15:31 GMT
From: Mail Delivery Subsystem <MAILER-DAEMON@billauer.co.il>
Message-ID: <202007151015.06FAFVbL012698@mx.mymailserver.com>
To: <spamvictim@billauer.co.il>
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status;
 boundary="06FAFVbL012698.1594808131/mx.mymailserver.com"
Subject: Return receipt
Auto-Submitted: auto-generated (return-receipt)
X-Mail-Filter: main

This is a MIME-encapsulated message

--06FAFVbL012698.1594808131/mx.mymailserver.com

The original message was received at Wed, 15 Jul 2020 10:15:30 GMT
from 46-117-33-227.bb.netvision.net.il [46.117.33.227]

   ----- The following addresses had successful delivery notifications -----
<legal_address@billauer.co.il>  (successfully delivered to mailbox)

   ----- Transcript of session follows -----
<legal_address@billauer.co.il>... Successfully delivered

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: message/delivery-status

Reporting-MTA: dns; mx.mymailserver.com
Received-From-MTA: DNS; 46-117-33-227.bb.netvision.net.il
Arrival-Date: Wed, 15 Jul 2020 10:15:30 GMT

Final-Recipient: RFC822; legal_address@billauer.co.il
Action: delivered (to mailbox)
Status: 2.1.5
Last-Attempt-Date: Wed, 15 Jul 2020 10:15:31 GMT

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: text/rfc822-headers

Return-Path: <spamvictim@billauer.co.il>
Received: from localhost.localdomain (46-117-33-227.bb.netvision.net.il [46.117.33.227])
	by mx.mymailserver.com (8.14.4/8.14.4/Debian-8+deb8u2) with ESMTP id 06FAFTbL012697
	for <legal_address@billauer.co.il>; Wed, 15 Jul 2020 10:15:30 GMT
Date: Wed, 15 Jul 2020 10:15:29 GMT
Message-Id: <202007151015.06FAFTbL012697@mx.mymailserver.com>
MIME-Version: 1.0
From: spamvictim@billauer.co.il
To: legal_address@billauer.co.il
Subject: Testing email.

--06FAFVbL012698.1594808131/mx.mymailserver.com--

But note that if DSN is used by a spammer to trick our mail server, we will get the failure notice that results from sending this message to the other server. If we’re lucky enough to get anything at all: If the message is accepted, we’ll never know our server has been sending spam.

When umount says target is busy, but no process can be blamed

A short one: What to do if unmount is impossible with a

# umount /path/to/mount
umount: /path/to/mount: target is busy

but grepping the output of lsof for the said path yields nothing. In other words, the mount is busy, but no process can be blamed for accessing it (even as a home directory).

If this happens, odds are that it’s an NFS mount, held by some remote machine. The access might have been over long ago, but the mount is still considered busy. So the solution for this case is simple: Restart the NFS daemon. On Linux Mint 19 (and probably a lot of others) it’s simply

# systemctl restart nfs-server

and after this, umount is sucessful (hopefully…)

LG OLED with a Linux computer: Getting that pitch black

Introduction

So I got myself an LG OLED65B9. It’s huge and a really nice piece of electronics. I opted out the bells and whistles, and connected it via HDMI to my already existing media computer, running Linux Mint 18.1. All I wanted was a plain (yet very high quality) display.

However at some point I noticed that black wasn’t displayed as black. I opened GIMP, drew a huge black rectangle, and it displayed as dark grey. At first I thought that the screen was defective (or that I was overly optimistic expecting that black would be complete darkness), but then I tried an image from a USB stick, and reassured myself that black is displayed as complete darkness. As it should. Or why did I pay extra for an OLED?

Because I skipped the “play with the new toy” phase with this display, I’m 100% it’s with its factory settings. It’s not something I messed up.

I should mention that I use plain HD resolution of 1920x1080. The screen can do much better than that (see list of resolutions below), and defaults at 3840x2160 with my computer, but it’s quite pointless: Don’t know about you, I have nothing to show that goes higher than 1080p. And the computer’s graphics stutters at 4k UHD. So why push it?

I have a previous post on graphics modes, and one on the setup of the Brix media center computer involved.

So why is black displayed LCD style?

The truth is that I don’t know for sure. But it seems to be a problem only with standard 16:9 graphics modes. When switching to modes that are typical for computers (5:4 and 4:3 aspect ratios), the image was stretched to the entire screen, and black areas showed as pitch black. I’m not sure about this conclusion, and even less do I have an idea why this would happen or why a properly designed display would “correct” a pixel that arrives as RGB all zeros to something brighter.

Also, the black level on text consoles (Ctrl-Shift-F1 style) is horrible. But I’m not sure about which resolution they use.

An idea that crossed my mind is that maybe the pixels are sent as YCbCr in some modes or maybe the computer goes “Hey, I’m a TV now, let’s do some color correction nobody asked for” when standard HDTV aspect ratios are used. If any, I would go for the second possibility. But xrandr’s verbose output implies that both brightness and gamma are set to 1.0 for the relevant HDMI output, even when black isn’t black.

The graphics adapter is Intel Celeron J3160′s on-chip “HD Graphics” processor (8086:22b1) so nothing fancy here.

Update, 24.2.21: It actually seems like there’s an explanation. The computer uses the i915 module on a Linux kernel 4.4.0-53-generic. Why is that important? Well, look at the output of xrandr below. It says “Broadcast RGB: Automatic”. That means that the color range is set automatically by the driver. And indeed one of the options is “Limited 16:235″ which means that the darkest black is actually level 16 out of 255, as mentioned here, and passionately discussed here and here.

The automatic mode was added to the driver in kernel commit 55bc60db5988c8366751d3d04dd690698a53412c.

The interesting part seems to be this change in drivers/gpu/drm/i915/intel_hdmi.c (became drivers/gpu/drm/i915/display/intel_hdmi.c later):

+	if (intel_hdmi->color_range_auto) {
+		/* See CEA-861-E - 5.1 Default Encoding Parameters */
+		if (intel_hdmi->has_hdmi_sink &&
+		    drm_mode_cea_vic(adjusted_mode) > 1)
+			intel_hdmi->color_range = SDVO_COLOR_RANGE_16_235;
+		else
+			intel_hdmi->color_range = 0;
+	}

The call to drm_mode_cea_vic() apparently says, that if there’s a known Video Identification Code (VIC) for the timing parameters in use (i.e., there’s a standard mode corresponding what is used), then the limited color range is applied. Why? Because. I don’t know if this was fixed since.

Thanks to Nate Long for pointing out the discussions on the kernel module.

The fix

This just worked for me, and I didn’t feel like playing with it further. So I can’t assure that this is a consistent solution, but it actually seems that way.

The idea is that since the problem arises with standard 16:9 modes, maybe make up a non-standard one?

Unlike the case with my previous TV, using cvt to calculate the timing parameters turned out to be a good idea.

$ cvt 1920 1080 60
# 1920x1080 59.96 Hz (CVT 2.07M9) hsync: 67.16 kHz; pclk: 173.00 MHz
Modeline "1920x1080_60.00"  173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --newmode "try" 173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --addmode HDMI3 try
$ xrandr -d :0 --output HDMI3 --mode try

At this point I got a proper 1920x1080 on the screen, with black pixels as dark as when the display is powered off. The output of xrandr after this was somewhat unexpected, yet functionally what I wanted:

$ xrandr -d :0 --verbose
  1280x720 (0x4b) 74.250MHz +HSync +VSync +preferred
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1920x1080 (0x141) 173.000MHz -HSync +VSync *current
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz
  1920x1080 (0x10c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
 [ ... ]
  try (0x13e) 173.000MHz -HSync +VSync
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

So the mode in effect didn’t turn out the one I generated (“try”), but a replica of its parameters, marked as 0x141 (and 0x13a on another occasion). This mode wasn’t there before.

I’m don’t quite understand how this happened. Maybe Cinnamon’s machinery did this. It kind of gets in the way all the time, and at times it didn’t let me set just any mode I liked with xrandr, so maybe that. This whole thing with graphics modes is completely out of control.

I should mention that there is no problem with sound in this mode (or any other situation I tried). Not that there should be, but at some point I thought maybe there would be, because the mode implies a computer and not a TV-something. But no issues at all. Actually, the screen’s loudspeakers are remarkably good, with a surprisingly present bass, but that’s a different story.

As for making this special mode permanent, that turned out to be a problem in itself. This post shows how I eventually solved it.

List of graphics modes

Just in case this interests anyone, this is the output of a full resolution list:

$ xrandr -d :0 --verbose
[ ... ]
HDMI3 connected primary 3840x2160+0+0 (0x1ba) normal (normal left inverted right x axis y axis) 1600mm x 900mm
	Identifier: 0x48
	Timestamp:  -1469585217
	Subpixel:   unknown
	Gamma:      1.0:1.0:1.0
	Brightness: 1.0
	Clones:
	CRTC:       0
	CRTCs:      0
	Transform:  1.000000 0.000000 0.000000
	            0.000000 1.000000 0.000000
	            0.000000 0.000000 1.000000
	           filter:
	EDID:
		00ffffffffffff001e6da0c001010101
		011d010380a05a780aee91a3544c9926
		0f5054a1080031404540614071408180
		d1c00101010104740030f2705a80b058
		8a0040846300001e023a801871382d40
		582c450040846300001e000000fd0018
		781e871e000a202020202020000000fc
		004c472054560a20202020202020012d
		02035af1565f101f0413051403021220
		212215015d5e6263643f403209570715
		07505707016704033d1ec05f7e016e03
		0c001000b83c20008001020304e200cf
		e305c000e50e60616566eb0146d0002a
		1803257d76ace3060d01662150b05100
		1b304070360040846300001e00000000
		0000000000000000000000000000008b
	aspect ratio: Automatic
		supported: Automatic, 4:3, 16:9
	Broadcast RGB: Automatic
		supported: Automatic, Full, Limited 16:235
	audio: auto
		supported: force-dvi, off, auto, on
  3840x2160 (0x1ba) 297.000MHz +HSync +VSync *current +preferred
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.50KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  30.00Hz
  4096x2160 (0x1bb) 297.000MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  4096x2160 (0x1bc) 296.703MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  3840x2160 (0x1bd) 297.000MHz +HSync +VSync
        h: width  3840 start 4896 end 4984 total 5280 skew    0 clock  56.25KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  25.00Hz
  3840x2160 (0x1be) 297.000MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  3840x2160 (0x1bf) 296.703MHz +HSync +VSync
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.43KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  29.97Hz
  3840x2160 (0x1c0) 296.703MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  1920x1080 (0x1c1) 297.000MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 135.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 120.00Hz
  1920x1080 (0x1c2) 297.000MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock 112.50KHz
        v: height 1080 start 1084 end 1094 total 1125           clock 100.00Hz
  1920x1080 (0x1c3) 296.703MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 134.87KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 119.88Hz
  1920x1080 (0x16c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
  1920x1080 (0x1c4) 148.500MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  56.25KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  50.00Hz
  1920x1080 (0x16d) 148.352MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.43KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  59.94Hz
  1920x1080i (0x10c) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  60.00Hz
  1920x1080i (0x10d) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  50.00Hz
  1920x1080 (0x1c5) 74.250MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  30.00Hz
  1920x1080 (0x1c6) 74.250MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  25.00Hz
  1920x1080 (0x1c7) 74.250MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  27.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  24.00Hz
  1920x1080i (0x10e) 74.176MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  59.94Hz
  1920x1080 (0x1c8) 74.176MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  29.97Hz
  1920x1080 (0x1c9) 74.176MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  26.97KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  23.98Hz
  1280x1024 (0x1b5) 108.000MHz +HSync +VSync
        h: width  1280 start 1328 end 1440 total 1688 skew    0 clock  63.98KHz
        v: height 1024 start 1025 end 1028 total 1066           clock  60.02Hz
  1360x768 (0x4b) 85.500MHz +HSync +VSync
        h: width  1360 start 1424 end 1536 total 1792 skew    0 clock  47.71KHz
        v: height  768 start  771 end  777 total  795           clock  60.02Hz
  1152x864 (0x1ca) 81.579MHz -HSync +VSync
        h: width  1152 start 1216 end 1336 total 1520 skew    0 clock  53.67KHz
        v: height  864 start  865 end  868 total  895           clock  59.97Hz
  1280x720 (0x110) 74.250MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1280x720 (0x111) 74.250MHz +HSync +VSync
        h: width  1280 start 1720 end 1760 total 1980 skew    0 clock  37.50KHz
        v: height  720 start  725 end  730 total  750           clock  50.00Hz
  1280x720 (0x112) 74.176MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  44.96KHz
        v: height  720 start  725 end  730 total  750           clock  59.94Hz
  1024x768 (0x113) 65.000MHz -HSync -VSync
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock  48.36KHz
        v: height  768 start  771 end  777 total  806           clock  60.00Hz
  800x600 (0x115) 40.000MHz +HSync +VSync
        h: width   800 start  840 end  968 total 1056 skew    0 clock  37.88KHz
        v: height  600 start  601 end  605 total  628           clock  60.32Hz
  720x576 (0x116) 27.000MHz -HSync -VSync
        h: width   720 start  732 end  796 total  864 skew    0 clock  31.25KHz
        v: height  576 start  581 end  586 total  625           clock  50.00Hz
  720x576i (0x117) 13.500MHz -HSync -VSync Interlace
        h: width   720 start  732 end  795 total  864 skew    0 clock  15.62KHz
        v: height  576 start  580 end  586 total  625           clock  50.00Hz
  720x480 (0x118) 27.027MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.50KHz
        v: height  480 start  489 end  495 total  525           clock  60.00Hz
  720x480 (0x119) 27.000MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.47KHz
        v: height  480 start  489 end  495 total  525           clock  59.94Hz
  640x480 (0x11c) 25.200MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.50KHz
        v: height  480 start  490 end  492 total  525           clock  60.00Hz
  640x480 (0x11d) 25.175MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.47KHz
        v: height  480 start  490 end  492 total  525           clock  59.94Hz
  720x400 (0x1cb) 28.320MHz -HSync +VSync
        h: width   720 start  738 end  846 total  900 skew    0 clock  31.47KHz
        v: height  400 start  412 end  414 total  449           clock  70.08Hz

So it even supports fallback mode with a 25.175 MHz clock if one really insists.

Linux Wine jots

General

These are just a few jots on Wine. I guess this post will evolve over time.

I’m running Wine version 4.0 on Linux Mint 19, running on an x86_64.

First run

Every time Wine is run on a blank (or absent) directory given by WINEPREFIX, it installs a Windows environment. Which Windows version an several other attributes can be set with Wine Configuration:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

Note that Wine installs a 64-bit environment by default. Add WINEARCH=win32 after the WINEPREFIX setting on the first run of Wine for the relevant directory to install a 32-bit environment instead, which is recommended for its better support unless 64-bit applications are going to be used. No need for WINEARCH afterwards.

It often suggests to install Wine Mono and Wine Gecko. I usually tend to agree.

This installation downloads three files into .cache/wine/: wine_gecko-2.47-x86_64.msi, wine_gecko-2.47-x86.msi and wine-mono-4.7.5.msi. This is why Wine doesn’t ask for permission to install these when setting up new Windows environments after the first time.

Install and use Winetricks

It’s a good idea in general, and it allows installation of Microsoft runtime environment easily:

# apt install winetricks
# apt install wine32-development

And now to install Virtual Studio 6 runtime environment, for example (solving some error message on not being able to import isskin.dll or isskinu.dll)

$ WINEPREFIX=/path/to/winedir winetricks vcrun6sp6

For a list of all packages available, go

$ WINEPREFIX=/path/to/winedir winetricks list-all | less

Prevent browser popup

Wine has this thing that it opens a browser when so requested by the Windows application. That can be annoying at times, and get the program stuck when run inside a firejail. To prevent this altogether, just delete two files:

  • drive_c/windows/syswow64/winebrowser.exe
  • drive_c/windows/system32/winebrowser.exe

but that didn’t work with Picasa, because it opened the browser through its own xdg-open located at /opt/picasa/bin/xdg-utils-1.0.2/scripts/xdg-open. So I replaced it with the following lame script

#!/bin/bash
/usr/bin/konqueror "$1"

so at least it doesn’t mix in the real browser (which caused a mess at times).

Open explorer

The simplest way to start: Open the file explorer:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine explorer

DOS command line

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine cmd

This is better than expected: The command session is done directly in the console (no new window opened). Like invoking a shell.

Use with firejail

Windows equals viruses, and Wine doesn’t offer any protection against that. Since the entire filesystem is accessible from Z: (more on that below), it’s a good idea to run Wine from within a firejail mini-container. I have a separate post on firejail.

The execution of the program then looks something like (non-root user):

$ firejail --profile=~/my.profile --env=WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine 'C:\Program Files\Malsoft\Malsoft.exe' &

The my.profile file depends on what the Windows program is expected to do. I discuss that briefly in that post, however this is something that worked for me:

include /etc/firejail/disable-common.inc
include /etc/firejail/disable-passwdmgr.inc
private-tmp
private-dev

# All relevant directories are read-only by default, not /opt. So add it.
read-only /opt
#
# This whitelisting protects the entire home directory.
# .cache/wine is where the Gecko + Mono installation files are kept.
# They can't be downloaded, because of "net none" below
mkdir ~/sandboxed/
mkdir ~/.cache/wine
whitelist ~/sandboxed/
whitelist ~/.cache/wine

net none
nonewprivs
caps.drop all
noroot
# blacklist everything that can be harmed
#
blacklist /mnt
blacklist /cdrom
blacklist /media
blacklist /boot

Notes:

  • Note the “net none” part. Networking completely disabled. No access to the internet nor the local network.
  • Be sure to blacklist any system-specific mount, in particular those that are writable by the regular user. Do you have a /hugestorage mount? That one.
  • There’s a seccomp filter option that often appears in template profiles. It got a program in Wine completely stuck. It prevents certain system calls, so no doubt it adds safety, but it came in the way of something in my case.

Poor man’s sandboxing

If you’re too lazy to use firejail, you can remove some access to the local storage by virtue of Wine’s file system bindings. This is worth almost nothing, but almost nothing is more than nothing.

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

In the “Drives” tab, remove Z:, and in the Desktop Integration tab, go through each of the folders and uncheck “Link to”.

This doesn’t prevent a Wine-aware Windows program to accessing the machine with plain Linux API with your user permissions just like any Linux program, and the root directory is still visible in Windows’ file browsing utilities. Yet, simple Windows programs expect any file system to be mapped to a drive letter, and these steps prevent that. Not much, but once again, better than nothing.

Firejail: Putting a program in its own little container

Introduction

Firejail is a lightweight security utility which ties the hands of running processes, somewhat like Apparmor and SELinux. However it takes the mission towards Linux kernel’s cgroups and namespaces. It’s in fact a bit of a container-style virtualization utility, which creates sandboxes for running specific programs: Instead of a container for an entire operating system, it makes one for each application (i.e. the main process and its children). Rather than disallowing access from files and directories by virtue of permissions, simply make sure they aren’t visible to the processes. Same goes for networking.

By virtue of Cgroups, several security restrictions are also put in place regardless if so desired. Certain syscalls can be prevented etc. But in the end of the day, think container virtualization. A sandbox is created, and everything happens inside it. It’s also easy to add processes to an existing sandbox (in particular, start a new shell). Not to mention the joy of shutting down a sandbox, that is, killing all processes inside it.

While the main use of Firejail to protect the file system from access and tampering by malicious or infected software, it also allows more or less everything that a container-style virtual machine does: Control of network traffic (volume, dedicated firewall, which physical interfaces are exposed) as well as activity (how many subprocesses, CPU and memory utilization etc.). And like a virtual machine, it also allows statistics on resource usage.

Plus spoofing the host name, restricting access to sound devices, X11 capabilities and a whole range of stuff.

And here’s the nice thing: It doesn’t require root privileges to run. Sort of. The firejail executable is run with setuid.

It’s however important to note that firejail doesn’t create a stand-alone container. Rather, it mixes and matches files from the real file system and overrides selected parts of the directory tree with temporary mounts. Or overlays. Or whiteouts.

In fact, compared with the accurate rules of a firewall, its behavior is quite loose and inaccurate. For a newbie, it’s a bit difficult to predict exactly what kind of sandbox it will set up given this or other setting. It throws in all kind of files of its own into the temporary directories it creates, which is very helpful to get things up and running quickly, but that doesn’t give a feeling of control.

Generally speaking, everything that isn’t explicitly handled by blacklisting or whitelisting (see below) is accessible in the sandbox just like outside it. In particular, it’s the user’s responsibility to hide away all those system-specific mounted filesystems (do you call them /mnt/storage?). If desired, of course.

Major disclaimer: This post is not authoritative in any way, and contains my jots as I get to know the beast. In particular, I may mislead you to think something is protected even though it’s not. You’re responsible to your own decisions.

The examples below are with firejail version 0.9.52 on a Linux Mint 19.

Install

# apt install firejail
# apt install firetools

By all means, go

$ man firejail

after installation. It’s also worth to look at /etc/firejail/ to get an idea on what protection measures are typically used.

Key commands

Launch FireTools, a GUI front end:

$ firetools &

And the “Tools” part has a nice listing of running sandboxes (right-click the ugly thing that comes up).

Now some command line examples. I name the sandboxes in these examples, but I’m not sure it’s worth bothering.

List existing sandboxes (or use FireTools, right-click the panel and choose Tools):

$ firejail --list

Assign a name to a sandbox when creating it

$ firejail --name=mysandbox firefox

Shut down a sandbox (kill all its processes, and clean up):

$ firejail --shutdown=mysandbox

If a name wasn’t assigned, the PID given in the list can be used instead.

Disallow the root user in the sandbox

$ firejail --noroot

Create overlay filesystem (mounts read/write, but changes are kept elsewhere)

$ firejail --overlay firefox

There’s also –overlay-tmpfs for overlaying tmpfs only, as well as –overlay-clean to clean the overlays, which are stored in $HOME/.firejail.

To create a completely new home directory (and /root) as temporary filesystems (private browsing style), so they are volatile:

$ firejail --private firefox

Better still,

$ firejail --private=/path/to/extra-homedir firefox

This uses the directory in the given path as a persistent home directory (some basic files are added automatically). This path can be anywhere in the filesystem, even in parts that are otherwise hidden (i.e. blacklisted) to the sandbox. So this is probably the most appealing choice in most scenarios.

Don’t get too excited, though: Other mounted filesystems remain unprotected (at different levels). This just protects the home directory.

By default, a whole bunch of security rules are loaded when firejail is invoked. To start the container without this:

$ firejail --noprofile

A profile can be selected with the –profile=filename flag.

Writing a profile

If you really want to have a sandbox that protects your computer with relation to a specific piece of software, you’ll probably have to write your own profile. It’s no big deal, except that it’s a bit of trial and error.

First read the manpage:

$ man firejail-profile

It’s easiest to start from a template: Launch FireTools from a shell, right-click the ugly thing that comes up, and pick “Configuration Wizard”, and create a custom security profile for one of the listed application — the one that resembles most the one for which the profile is set up.

Then launch the application from FireTools. The takeaway is that it writes out the configuration file to the console. Start with that.

Whilelisting and blacklisting

First and foremost: Always run a

$ df -h

inside the sandbox to get an idea of what is mounted. Blacklist anything that isn’t necessary. Doing so to entire mounts removes the related mount from the df -h list, which makes it easier to spot things that shouldn’t be there.

It’s also a good idea to start a sample bash session with the sandbox, and get into the File Manager in the Firetool’s “Tools” section for each sandbox.

But then, what is whitelisting and blacklisting, exactly? These two terms are used all over the docs, somehow assuming we know what they mean. So I’ll try to nail it down.

Whitelisting isn’t anywhere near what one would think it is: By whitelisting certain files and/or directories, the original files/directories appear in the sandbox but all other files in their vicinity are invisible. Also, changes in the same vicinity are temporary to the sandbox session. The idea seems to be that if files and/or directories are whitelisted, everything else close to it should be out of sight.

Or as put in the man page:

A temporary file system is mounted on the top directory, and the whitelisted files are mount-binded inside. Modifications to whitelisted files are persistent, everything else is discarded when the sandbox is closed. The top directory could be user home, /dev, /media, /mnt, /opt, /srv, /var, and /tmp.

So for example, if any file or directory in the home directory is whitelisted, the entire home directory becomes overridden by an almost empty home directory plus the specifically whitelisted items. For example, from my own home directory (which is populated with a lot of files):

$ firejail --noprofile --whitelist=/home/eli/this-directory
Parent pid 31560, child pid 31561
Child process initialized in 37.31 ms

$ find .
.
./.config
./.config/pulse
./.config/pulse/client.conf
./this-directory
./this-directory/this-file.txt
./.Xauthority
./.bashrc

So there’s just a few temporary files that firejail was kind enough to add for convenience. Changes made in this-directory/ are persistent since it’s bind-mounted into the temporary directory, but everything else is temporary.

Quite unfortunately, it’s not possible to whitelist a directory outside the specific list of hierarchies (unless bind mounting is used, but that requires root). So if the important stuff is one some /hugedisk, only a bind mount will help (or is this the punishment for not putting it has /mnt/hugedisk?).

But note that the –private= flag allows setting the home directory to anywhere on the filesystem (even inside a blacklisted region). This ad-hoc home directory is persistent, so it’s not like whitelisting, but even better is some scenarios.

Alternatively, it’s possible to blacklist everything but a certain part of a mount. That’s a bit tricky, because if a new directory appears after the rules are set, it remains unprotected. I’ll explain why below.

Or if that makes sense, make the entire directory tree read-only, with only a selected part read-write. That’s fine if there’s no issue with data leaking, just the possibility of malware sabotage.

So now to blacklisting: Firejail implements blacklisting by mounting an empty, read-only-by-root file or directory on top of the original file. And indeed,

$ firejail --blacklist=delme.txt
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 30288, child pid 30289
Child process initialized in 57.75 ms
$ ls -l
[ ... ]
-r--------  1 nobody nogroup     0 Jun  9 22:12 delme.txt
[ ... ]
$ less delme.txt
delme.txt: Permission denied

There are –noblacklist and –nowhitelist flags as well. However these merely cancel future or automatic black- or whitelistings. In particular, one can’t blacklist a directory and whitelist a subdirectory. It would have been very convenient, but since the parent directory is overridden with a whiteout directory, there is no access to the subdirectory. So each and every subdirectory must be blacklisted separately with a script or something, and even then if a new subdirectory pops up, it’s not protected at all.

There’s also a –read-only flag allows setting certain paths and files as read-only. There’s –read-write too, of course. When a directory or file is whitelisted, it must be flagged read-only separately if so desired (see man firejail).

Mini-strace

Trace all processes in the sandbox (in particular accesses to files and network). Much easier than using strace, when all we want is “which files are accessed?”

$ firejail --trace

And then just run any program to see what files and network sockets it accesses. And things of that sort.

Microsoft’s outlook.com servers and the art of delivering mails to them

Introduction

Still in 2020, it seems like Microsoft lives up to its reputation: Being arrogant, thinking that anyone in business must be a huge corporate, and in particular ending up completely ridiculous. Microsoft’s mail servers, which accept on behalf of Hotmail, MSN, Office 365, Outlook.com, or Live.com users are no exception. This also affects companies and other entities which use their own domain names, but use Microsoft’s services for handling mail.

This post summarizes my personal experience and accumulated knowledge with delivering mail to their servers. I use a simple Linux sendmail SMTP MTA on a virtual server for handling the delivery of my own private mails as well as a very low traffic of transactional mails from a web server. All in all, it’s about 100 mails / month coming out from that server to all destinations.

So one server, one IP address with a perfect reputation on all open spam reputation trackers, with SPF, DKIM and DMARC records all in place properly.

One may ask why I’m not relying on existing mail delivery services or my ISP. Answer is simple: Any commercial mail delivery server is likely to have its reputation contaminated by some spammer, no matter what protection measures they take. When that happens, odds are that emails will just disappear, because the ISP has little interest in forwarding the bounce message saying that delivery failed. On a good day, they will be handling the problem quickly, and yet the sender of the lost mail won’t be aware that the correspondence is broken.

For this reason, it’s quite likely that small businesses will go on keeping their own, small, email delivery servers, maintaining their own reputation. So when Outlook’s servers are nasty with a single-IP server, they’re not just arrogant, but they are causing delivery issues with small to medium businesses.

To do when setting up the server

For starter info, go here. Microsoft is pretty upfront about not being friendly to new IP addresses (see troubleshooting page for postmasters).

So it’s a very good idea to create a Microsoft account to log into their services, and then join their Smart Network Data Service (SDNS) and Junk Mail Reporting Program. This is the start page for both of these services.

SDNS allows the owner of a mail server to register its IP address range (“Request Access“), so its status can be monitored (“View IP Status”) over time. When all is fine, the IP Status page says “All of the specified IPs have normal status”, and when they don’t like this or other IP address, it’s more like this (click to enlarge):

Microsoft SDNS blocked IP

The Junk Mail Reporting Program (JMRP) allows the owner of the mail server to receive notifications (by email) when a mail message is delivered however deemed suspicious, either by an end-user (marking it as spam) or by automatic means. So it’s a good idea to create a special email address for this purpose and fill in the JMRP form. Even for the sake of claiming that you got no complaints when contacting support later on.

Note that this is important for delivery of mail to any institution relies on Microsoft’s mail infrastructure. A proper IP address blacklist delisting takes you from

Mar 11 20:18:23 sm-mta[5817]: x2BKIL2H005815: to=<xxxxxxx@mit.edu>, delay=00:00:02, xdelay=00:00:02, mailer=esmtp, pri=121914, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=5.7.606, stat=User unknown

(but the bounce message indicated that it’s not an unknown user, but a blacklisted IP number) to

Mar 11 21:15:12 sm-mta[6170]: x2BLF8rT006168: to=<xxxxxxx@mit.edu>, delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121915, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=2.0.0, stat=Sent (<5C86CFDC.6000206@example.com> [InternalId=11420318042095, Hostname=DM5PR01MB2345.prod.exchangelabs.com] 11012 bytes in 0.191, 56.057 KB/sec Queued mail for delivery)

Note that the session response said nothing about a blacklisted IP, however the bounce message (not shown here) did.

Finally, Microsoft suggest getting a certification from Return Path. A paid-for service, clearly intended for large companies and in particular mass mailers to get their spam delivered. Microsoftish irony at its best.

To do when things go wrong

First thing first, read the bounce message. If it says that it’s on Microsoft’s IP blacklist, go to the Office 365 Anti-Spam IP Delist Portal and delist it.

Then check the IP’s status (requires logging in). If you’re blocked, contact support. This doesn’t require a Microsoft login account, by the way. I’m not sure if this link to the support page is valid in the long run, so it’s on SNDS’ main page (“contact sender support”) as well as Troubleshooting page.

My own ridiculous experience

I kicked off my mail server a bit more than a year ago. There was some trouble in the beginning, but that was no surprise. Then things got settled and working for a year, and only then, suddenly & out of the blue, a mail to a Hotmail address bounced with:

Action: failed
Status: 5.7.1
Diagnostic-Code: SMTP; 550 5.7.1 Unfortunately, messages from [193.29.56.92] weren't sent. Please contact your Internet service provider since part of their network is on our block list (S3140). You can also refer your provider to http://mail.live.com/mail/troubleshooting.aspx#errors. [VE1EUR01FT021.eop-EUR01.prod.protection.outlook.com]

And indeed, checking the IP status indicated that is was blocked “because of user complaints or other evidence of spamming”.

So first I went to the mail logs. Low traffic. No indication that the server has been tricked into sending a lot of mails. No indication that it has been compromised in any way. And when a server has been compromised, you know it.

No chance that there were user complaints, because I got nothing from JMRP. So what the “evidence of spamming”?

My best guess: A handful transactional mail messages (at most) to their servers for authenticating email addresses that were marker suspicious by their super software. Putting these messages in quarantine for a few hours is the common solution when that happens. Spam is about volume. If all you got was 4-5 messages, how could that be a spam server? Only if you look at percentage. 100% suspicious. Silly or what?

So I filled in the contact support form, and soon enough I got a message saying a ticket has been opened, and 30 minutes later saying

We have completed reviewing the IP(s) you submitted. The following table contains the results of our investigation.

Not qualified for mitigation
193.29.56.92
Our investigation has determined that the above IP(s) do not qualify for mitigation. These IP(s) have previously received mitigations from deliverability support, and have failed to maintain patterns within our guidelines, so they are ineligible for additional mitigation at this time.

Cute, heh? And that is followed by a lot of general advice, basically copied from the website, recommending to join JMRP and SDNS. Which I had a year earlier, of course. The script that responded didn’t even bother to check that.

But it also said:

To have Deliverability Support investigate further, please reply to this email with a detailed description of the problem you are having, including specific error messages, and an agent will contact you.

And so I did. I wrote that I had joined those two programs a year ago, that the mail volume is low and so on. I doubt it really made a difference. After sending the reply, I got a somewhat automated response rather quickly, now with a more human touch:

Hello,

My name is Ayesha and I work with the Outlook.com Deliverability Support Team.

IP: 193.29.56.92

We will be looking into this issue along with the Escalations Team. We understand the urgency of this issue and will provide an update as soon as this is available. Rest assured that this ticket is being tracked and we will get back to you as soon as we have more information to offer.

Thank you for your patience.

Sincerely,
Ayesha

Outlook.com Deliverability Support

And then, a few days later, another mail:

Hello,

My name is Yaqub and I work with the Outlook.com Deliverability Support Team.

Recent activity coming from your IP(s): ( 193.29.56.92) has been flagged by our system as suspicious, causing your IP to become blocked. I have conducted an investigation into the emails originating from your IP space and have implemented mitigation for your deliverability problem. This process may take 24 – 48 hours to replicate completely throughout our system.

Please note that lifting the block does not guarantee that your email will be delivered to a user’s inbox. However, here are some things that can help you with delivery:

(and here came the same suggestions on JMRP and SDNS)

And about 24 hours later, the IP status went back to OK again. And my emails went through normally.

Well, almost. A few days even further down, I attempted to send an email to a live.co.uk destination, and once again, I got the same rejection message (in block list, S3140). The only difference was that the mail server on the other side was hotmail-com.olc.protection.outlook.com (residing in the US), and now eur.olc.protection.outlook.com (somewhere in Europe).

I checked the IP’s status in SDNS and it was fine. So updating the Europeans on the updated IP status takes a bit time, or what?

So I replied to last email I got from Microsoft’s support, saying it failed with live.co.uk. I didn’t get any reply, but a few hours later I tried again, and the mail went through. Coincidence or not.

This time I also caught the related messaged from the mail log. It’s

May 01 15:10:28 sm-mta[2239]: 041FASMh002237: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=121816, relay=eur.olc.protection.outlook.com. [104.47.1.33], dsn=5.0.0, stat=Service unavailable
May 01 15:10:28 sm-mta[2239]: 041FASMh002237: 041FASMh002239: DSN: Service unavailable

for a failure, and

May 02 06:23:00 sm-mta[4024]: 0426Mx1I004021: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=121808, relay=eur.olc.protection.outlook.com. [104.47.18.97], dsn=2.0.0, stat=Sent (<5EAD11C3.20105@billauer.co.il> [InternalId=21887153366859, Hostname=AM6EUR05HT060.eop-eur05.prod.protection.outlook.com] 10627 bytes in 0.246, 42.064 KB/sec Queued mail for delivery -> 250 2.1.5)

for success.

Lesson learned: Contact support and insist.

And the lesson to all those using Microsoft’s mail services: Your provider cuts off your email contacts arbitrarily. Because they are Microsoft.

Ftrace: The Linux kernel hacker’s swiss knife

Introduction

I ran into ftrace completely by chance, while trying to figure out why the call to usb_submit_urb() took so long time. In fact, it wasn’t. It was pr_info() that delayed the output. And it was ftrace that got me to realize that.

Whether you’re into dissecting existing kernel code, and want to know which function calls which, or if you need a close look on what your own code actually does (and when), ftrace is your companion. And it does a lot more.

And that’s maybe its problem: It offers so many different possibilities, that its documentation gets not so inviting to read. Add some confusing terminology and focus on advanced issues, and one gets the impression that starting to use it is a project in itself.

This is definitely not the case. This post consists of some simple & useful tasks. It’s not much about accuracy, doing it the right way nor showing the whole picture. It’s about about getting stuff done. If you’ll need to nail down something more specific, read the docs. It looks like they got virtually any useful scenario covered.

I’ll divert from keeping things simple in part on events at the end of this post. The concept of events is fairly simple, but the implementation, well, well. But the point is partly to demonstrate exactly that.

Does your kernel support ftrace?

Ftrace is often enabled in compiled kernels, but not always. Look for /sys/kernel/debug/tracing/ (as root) and in particular go

# cat available_tracers

If function_graph and function aren’t listed as available_tracers, the kernel needs to be recompiled with the correct options. Namely, CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_GRAPH_TRACER, CONFIG_STACK_TRACER and CONFIG_DYNAMIC_FTRACE.

References

These are some good resources. I would usually put them at the end of the post, but noone is expected to get there.

  • A gentle introduction by Alex Dzyoba.
  • /sys/kernel/debug/tracing/README — it’s like the -h flag on user space utilities
  • Detailed documentation from the kernel tree: Documentation/trace/ftrace.rst (and other files in the same directory)
  • A tutorial in LWN, part 1 and part 2.
  • Some good slides on the topic.

Getting around a bit

They’ll tell you that there’s a special tracefs to mount, but it’s easily available under /sys/kernel:

# cd /sys/kernel/debug/tracing/

For tracing invocations of functions, there are two formats: The simple “function” and it’s more sophisticated “function_graph”. It’s easiest to just try them out: Select the function graph tracer

# echo function_graph > current_tracer

Watch the tracing output

# less trace

There’s also trace_pipe with a “tail -f” flavor.

The output isn’t very helpful for now, as it shows every function invoked on all processors. Too much information. We’ll get to filtering just below.

Turn tracing off

# echo 0 > tracing_on

or on

# echo 1 > tracing_on

and clear (empty) the trace:

# echo > trace

Note that turning tracing off turns off everything, including trace_printk() discussed below. It’s a complete halt. To just stop one of function tracers, better go

# echo nop > current_tracer

The trace data is stored in circular buffers, so old entries are overwritten by newer ones if these buffers get full. The problem is that there’s a separate buffer for each CPU, so once this overwriting begins, the overall trace may miss out traces from only some CPUs on those time segments. Therefore I prefer turning off overwriting completely. At least the beginning of the trace reflects what actually happened (and the mess is left to the end):

# echo 0 > options/overwrite

Now to the “function” variant, just which function was invoked by which along with a timestamp. There is however no information on when the function returned (how much time it took), but the absolute timestamp can be matched with dmesg stamp. But this requires selecting ftrace’s clock as global:

# echo function > current_tracer
# echo global > trace_clock

It’s very important to note that printk (and its derivatives) can take up to a few milliseconds, so an apparent mismatch in the timestamps between a printk and a function called immediately after it may be a result of that.

Using trace_printk()

It’s quite common to use printk and its derivatives for debugging kernel code, which is fairly OK if there’s no problem if these slow down the execution — I’ve seen printk taking a few milliseconds (in process context, I should say).

So for quick and time-accurate printk-like debugging, definitely go for trace_printk(). Plus you get some extra info, such as the process / interrupt context and which CPU it ran on. It’s just a matter of adding something like

trace_printk("Submitting buffer ID=%d, len=%d\n", id, count);

in your kernel code, instead of pr_info(), dev_info() or whatever. No extra header file or anything of that sort needed, as far as I can tell.

I should mention that trace_printk() is intended for temporary debug messages, and these can’t be left in production code. If you want debug messages that stay there, go for events. Trickier, but the right way to go.

trace_printk() messages are logged by default, even when current_tracer is set to “nop”. However tracing must be on. In short, a newly booted system will show trace_printk()’s output in “trace”.

These messages will be interleaved with the other traces if current_tracer is in non-nop mode. This can be useful when the situation of interest occurs rarely — for example a code segments takes too long to run. Since the outcome is known only at the end, the simple solution is to run function tracing, and call trace_printk() when the time difference exceeds a threshold. This message will of course appear after the function sequence in the trace, but it’s easily looked up with plain text tools (e.g. “less”).

The system boots up with tracing on and in “nop” mode, but if you’ve been fiddling with these, bring them back with

# echo nop > current_tracer
# echo 1 > tracing_on

And then view the messages just like before (“trace”, “trace_pipe” or the CPU-individual counterparts).

The reason you don’t want to leave trace_printk() in production code is that the first call will leave a message of this form in the kernel log:

[ 1954.872204] **********************************************************
[ 1954.875230] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.878242] **                                                      **
[ 1954.881257] ** trace_printk() being used. Allocating extra memory.  **
[ 1954.884281] **                                                      **
[ 1954.887345] ** This means that this is a DEBUG kernel and it is     **
[ 1954.890397] ** unsafe for production use.                           **
[ 1954.893470] **                                                      **
[ 1954.896496] ** If you see this message and you are not debugging    **
[ 1954.899527] ** the kernel, report this immediately to your vendor!  **
[ 1954.902539] **                                                      **
[ 1954.905598] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.908713] **********************************************************

This is of course nothing to be alarmed about (if you’re the one who made the trace_printk() calls, that is). Seems like the purpose of this message is to gently convince programmers to use events in production code instead.

Tracing a specific function call

Ftrace comes with great filtering capabilities, which can be really complicated. So let’s take a simple usage case. Say that I want to see how much time elapses from the call to usb_submit_urb() in my driver and the callback function. Never mind the details. First, I might want to verify that the function is really traceable — it might not be if it has been optimized away by the compiler or if it’s a #define macro (which definitely isn’t the case for usb_submit_urb(), since it’s an exported function, but anyhow).

So first, look it up in the list of available functions (cwd is /sys/kernel/debug/tracing):

# grep submit_urb available_filter_functions
usb_hcd_submit_urb
usb_submit_urb

Yep, it’s there. So add it to the functions to filter, along with my driver’s callback function, which is bulk_out_completer():

# echo usb_submit_urb bulk_out_completer > set_ftrace_filter
# echo function > current_tracer
# echo > trace

Note that multiple functions can be given to set_ftrace_filter, delimited by plain whitespace. Wildcards can also be used, as shown below.

Now perform the operation (use dd command in my case) that makes the driver active, and harvest the output:

# head -30 trace
# tracer: function
#
# entries-in-buffer/entries-written: 64/64   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
              dd-847   [001] ....  1787.067325: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067335: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067342: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067348: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067353: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067358: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067363: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067369: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068019: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068080: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068528: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068557: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069039: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069062: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069533: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069556: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070012: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.070036: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070506: bulk_out_completer <-__usb_hcd_giveback_urb

Not sure how useful the data above is, but it demonstrates how a sequence of events can be analyzed easily.

Just to close this issue, let’s take a look on wildcards. This is easiest shown with an example:

# echo '*_submit_urb' > set_ftrace_filter
# cat set_ftrace_filter
usb_hcd_submit_urb
usb_submit_urb

So quite evidently, the wildcard was applied when the filter was set, and its usage is quite trivial.

Don’t forget to remove the filter when it’s not necessary anymore. These filters won’t go away otherwise, and tracing may appear to suddenly not work if you go on doing something else. So simply:

# echo > set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####

It’s not a coincidence that I didn’t use function_graph above — the functions that are called by the selected function(s) won’t show anyhow. It does show the duration of the call, which may be helpful. This brings us to

Which functions does function X call?

This is a great way to get an idea of who-calls-what. Let’s stick with usb_submit_urb(). What other functions does it call, specifically in my case?

Simple. Be sure to have removed any previous filters (see above) and then just go:

# echo usb_submit_urb > set_graph_function
# echo function_graph > current_tracer
# echo > trace

Then do whatever causes the function to be called, after which “trace” reads something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  usb_submit_urb() {
 0)   0.979 us    |    usb_urb_ep_type_check();
 0)               |    usb_hcd_submit_urb() {
 0)   0.427 us    |      usb_get_urb();
 0)               |      xhci_map_urb_for_dma [xhci_hcd]() {
 0)               |        usb_hcd_map_urb_for_dma() {
 0)   0.658 us    |          dma_direct_map_page();
 0)   1.739 us    |        }
 0)   2.642 us    |      }

[ ... ]

and it goes on.

This is a good place to remind that if all you wanted was to get the stack trace of calls at a certain point in your code, WARN() and WARN_ONCE() may be handier.

Tracing a segment: quick and dirty

This method doesn’t really trace a specific segment. It’s a bit of a dirty trick: Tracing is disabled at first, and then enabled before the function call, and disabled after it. All function calls that take place on all CPUs during that time are recorded. It may mean additional noise, but this noise may also be the explanation to why something went wrong (if it went wrong).

There is probably a more elegant solution. In particular, ftrace triggers have “traceon” and “traceoff” actions, so this is probably the classier way to go. But let’s do something quick, dirty and simple to understand.

So say that I want to understand what usb_submit_urb() is up to and what other functions it calls along. I could go, in my own driver:

tracing_on();
rc = usb_submit_urb(urb, GFP_KERNEL);
tracing_off();

With this driver loaded into the kernel, then go something like:

# cd /sys/kernel/debug/tracing/
# echo global > trace_clock
# echo 32768 > buffer_size_kb
# echo 0 > tracing_on
# echo function > current_tracer

This assumes that current_tracer was “nop” before this (it’s by default), so the trace buffer begins empty. And when it’s changed to “function” at the last command, nothing happens, because tracing was just turned off. The buffer size is set to 32 MB per CPU (which is quite a lot).

Then run whatever makes the relevant code execute, and then

# less trace

The problem is that trace output from several CPUs is interleaved in this output. But we know which CPU the relevant command ran on from the trace itself, so obtain a filtered version (e.g. for CPU #2):

# less per_cpu/cpu2/trace

Events

The term “events” is somewhat misleading. An event is just a piece of printk-like debug message that can be enabled or disabled in runtime. It’s a convenient way to leave the debugging messages in place even in production code, and make it possible even for end-users to harvest the information without needing to compile anything.

I’ll try to explain this by a (ehm-ehm) simple example. Let’s enable the kmalloc event:

# echo nop > current_tracer
# echo kmalloc > set_event

This turns off the function tracer, and enables the kmalloc event. Looking at “trace” we now have something like:

# tracer: nop
#
# entries-in-buffer/entries-written: 1047/1047   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-781   [003] ....  5317.446761: kmalloc: call_site=ffffffff8153b210 ptr=000000001f5ab582 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.446788: kmalloc: call_site=ffffffff8153cbf1 ptr=000000002525fcc0 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.748476: kmalloc: call_site=ffffffff8153b210 ptr=0000000095781cbc bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.748501: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000c9801d3d bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.900662: kmalloc: call_site=ffffffff8153b210 ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.900687: kmalloc: call_site=ffffffff8153cbf1 ptr=0000000004406a83 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420356: kmalloc: call_site=ffffffff8119f2c0 ptr=00000000a6835237 bytes_req=184 bytes_alloc=192 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420365: kmalloc: call_site=ffffffff8119f34d ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420404: kmalloc: call_site=ffffffff8116feff ptr=00000000c30e90f8 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ....  5318.420408: kmalloc: call_site=ffffffff8153b210 ptr=000000000b7b85e5 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420415: kmalloc: call_site=ffffffff811701e4 ptr=00000000f54127a0 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            bash-792   [000] ....  5318.420431: kmalloc: call_site=ffffffff812eb5e2 ptr=0000000084bbe3b4 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ...1  5318.420435: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000a6a3ac50 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420473: kmalloc: call_site=ffffffff811b1aac ptr=0000000095972087 bytes_req=56 bytes_alloc=64 gfp_flags=GFP_KERNEL_ACCOUNT

These are all kmalloc calls that were executed since the event was enabled. There are of course ways to filter the events to specific occasions, but this is really not something I’ve gotten into (yet?).

It’s however interesting to see how these messages came about. To do this, try the following (after enabling the kmalloc event as shown above):

# echo '*kmalloc*' > set_graph_function
# echo function_graph > current_tracer
# echo > trace

And then the trace output can be something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  finish_task_switch() {
 0)               |    _raw_spin_unlock_irq() {
 0)   0.794 us    |      do_raw_spin_unlock();
 0)   0.777 us    |      preempt_count_sub();
 0)   4.402 us    |    }
 0)   8.919 us    |  }
 0)               |  __kmalloc_reserve.isra.9() {
 0)               |    __kmalloc_track_caller() {
 0)   0.784 us    |      kmalloc_slab();
 0)   0.654 us    |      should_failslab();
 0)   0.711 us    |      check_irq_off();
 0)               |      cache_alloc_debugcheck_after() {
 0)   4.691 us    |        check_poison_obj();
 0)   0.685 us    |        poison_obj();
 0)   7.353 us    |      }
 0)               |      /* kmalloc: call_site=ffffffff81801a80 ptr=000000000493ffc2 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC */
 0) + 15.571 us   |    }
 0) + 17.452 us   |  }

And it goes on like this several times. So now we have the event output in the middle of the function graph, and we can also see what happened: __kmalloc_reserve(), which is defined in net/core/skbuff.c, calls kmalloc_node_track_caller(), which is translated into __kmalloc_track_caller() by virtue of a #define in slab.h. That function is defined in mm/slab.c, but it just redirects the call to __do_kmalloc_node(), which makes the calls visible in the trace, and eventually calls kmem_cache_alloc_node_trace(). This call isn’t registered, most likely because it was optimized away by the compiler. And it reads:

#ifdef CONFIG_TRACING
void *
kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
	void *ret;

	ret = slab_alloc(cachep, flags, _RET_IP_);

	ret = kasan_kmalloc(cachep, ret, size, flags);
	trace_kmalloc(_RET_IP_, ret,
		      size, cachep->size, flags);
	return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
#endif

So there we have it. But wait! Where is trace_malloc defined? The answer lies at the top of slab.c:

#include <trace/events/kmem.h>

which includes include/trace/events/kmem.h, the beginning of which reads

 /* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kmem

#if !defined(_TRACE_KMEM_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_KMEM_H

#include <linux/types.h>
#include <linux/tracepoint.h>
#include <trace/events/mmflags.h>

DECLARE_EVENT_CLASS(kmem_alloc,

	TP_PROTO(unsigned long call_site,
		 const void *ptr,
		 size_t bytes_req,
		 size_t bytes_alloc,
		 gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags),

	TP_STRUCT__entry(
		__field(	unsigned long,	call_site	)
		__field(	const void *,	ptr		)
		__field(	size_t,		bytes_req	)
		__field(	size_t,		bytes_alloc	)
		__field(	gfp_t,		gfp_flags	)
	),

	TP_fast_assign(
		__entry->call_site	= call_site;
		__entry->ptr		= ptr;
		__entry->bytes_req	= bytes_req;
		__entry->bytes_alloc	= bytes_alloc;
		__entry->gfp_flags	= gfp_flags;
	),

	TP_printk("call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s",
		__entry->call_site,
		__entry->ptr,
		__entry->bytes_req,
		__entry->bytes_alloc,
		show_gfp_flags(__entry->gfp_flags))
);

DEFINE_EVENT(kmem_alloc, kmalloc,

	TP_PROTO(unsigned long call_site, const void *ptr,
		 size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags)
);

Tired already? We’re almost there. I won’t get into the details of this (mostly because I don’t know), but note two crucial points: One is the TP_printk(), which matches the output format of the event’s output. The second part is the DEFINE_EVENT, which defines a function linked with the kmem_alloc class, named trace_kmalloc(). The define-macro magic takes place in include/linux/tracepoint.h, and essentially translates DEFINE_EVENT into a DECLARE_TRACE, which is then turned into a __DECLARE_TRACE, which begins with:

#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
	extern struct tracepoint __tracepoint_##name;			\
	static inline void trace_##name(proto)				\
	{								\
		if (static_key_false(&__tracepoint_##name.key))		\
			__DO_TRACE(&__tracepoint_##name,		\
				TP_PROTO(data_proto),			\
				TP_ARGS(data_args),			\
				TP_CONDITION(cond), 0);			\
		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
			rcu_read_lock_sched_notrace();			\
			rcu_dereference_sched(__tracepoint_##name.funcs);\
			rcu_read_unlock_sched_notrace();		\
		}							\
	}								\
[ ... ]

And then it goes on with several related functions. The point with this long story, except for the obvious masochism, was to show that events really are just some kind of printk, only not all that easy to use. It also explains why that scary message is needed to get people off trace_printk().

But since there’s some code to copy from, it shouldn’t be all that bad to set up new events.

Which process opens file X?

In case you want to to know what process opens which file, system-wide, this is quick recipe I’ve copied from these slides:

# echo 'p:open do_sys_open file=+0(%si):string' > kprobe_events
# echo 1 > events/kprobes/open/enable
# cat trace_pipe

Note that “cat” can be replaced with “grep” if you’re looking which process opens a specific file (or is there some kernel filter for this? Not sure it’s worth bother checking).

I haven’t taken the time to figure out exactly how and why it works (and I guess it’s possible to use filters here as well).