Turning off DSN on sendmail to prevent backscatter

I sent that?

One morning, I got a bounce message from my own mail sendmail server, saying that it failed to deliver a message I never sent. That’s red alert. It means that someone managed to provoke my mail server to send an outbound message. It’s red alert, because my mail server effectively relays spam to any destination that the spammer chooses. This could ruin the server’s reputation horribly.

It turned out that an arriving mail required a return receipt, which was destined to just some mail address. There’s an SMTP feature called Delivery Status Notification (DSN), which allows the client connecting to the mail server to ask for a mail “in return”, informing the sender of the mail if it was properly delivered. The problem is that the MAIL FROM / From addresses could be spoofed, pointing at a destination to spam. Congratulations, your mail server was just tricked into sending spam. This kind of trickery is called backscatter.

Checking my own mail logs, the DSN is a virtually unused feature. So it’s probably just something spammers can take advantage of.

The relevant RFC for DSN is RFC1891. Further explanations on DSN can be found in one of sendmail’s tutorial pages.

How to turn DSN off

First, I recommend checking if it’s not disabled already, as explained below. In particular, if the paranoid-level “goaway” privacy option is used, DSN is turned off anyhow.

It’s actually easy. Add the noreceipts option to PrivacyOptions. More precisely, edit /etc/mail/sendmail.mc and add noreceipts to the list of already existing options. In my case, it ended up as

define(`confPRIVACY_FLAGS',dnl
`needmailhelo,needexpnhelo,needvrfyhelo,restrictqrun,restrictexpand,nobodyreturn,noetrn,noexpn,novrfy,noactualrecipient,noreceipts')dnl

and then run “make” in /etc/mail, and restart sendmail.

Turning off DSN is often recommended against in different sendmail guides, because it’s considered a “valuable feature” or so. As mentioned above, I haven’t seen it used by anyone else than spammers.

Will my mail server do DSN?

Easy to check, because the server announces its willingness to fulfill DSN requests at the beginning of the SMTP session, with the line marked in red in the sample session below:

<<< 220 mx.mymailserver.com ESMTP MTA; Wed, 15 Jul 2020 10:22:32 GMT
>>> EHLO localhost.localdomain
<<< 250-mx.mymailserver.com Hello 46-117-33-227.bb.netvision.net.il [46.117.33.227], pleased to meet you
<<< 250-ENHANCEDSTATUSCODES
<<< 250-PIPELINING
<<< 250-8BITMIME
<<< 250-SIZE
<<< 250-DSN
<<< 250-DELIVERBY
<<< 250 HELP
>>> MAIL FROM:<spamvictim@billauer.co.il>
<<< 250 2.1.0 <spamvictim@billauer.co.il>... Sender ok
>>> RCPT TO:<legal_address@billauer.co.il> NOTIFY=SUCCESS
<<< 250 2.1.5 <legal_address@billauer.co.il>... Recipient ok
>>> DATA
<<< 354 Enter mail, end with "." on a line by itself
>>> MIME-Version: 1.0
>>> From: spamvictim@billauer.co.il
>>> To: legal_address@billauer.co.il
>>> Subject: Testing email.
>>>
>>>
>>> Just a test, please ignore
>>> .
<<< 250 2.0.0 06FAMWa1014200 Message accepted for delivery
>>> QUIT
<<< 221 2.0.0 mx.mymailserver.com closing connection

To test a mail server for its behavior with DSN, the script that I’ve already published can be used. To make it request a return receipt, the two lines that set the SMTP recipient should be changed to

  die("Failed to set receipient\n")
    if (! ($smtp->recipient( ($to_addr ), { Notify => ['SUCCESS'] } ) ) );

This change causes the NOTIFY=SUCCESS part in the RCPT TO line, which effectively requests a receipt from the server when the mail is properly delivered.

Note that if DSN isn’t supported by the mail server (possibly because of the privacy option fix shown above), the SMPT session looks exactly the same, except that the SMTP line marked in red will be absent. Then the mail server just ignores the NOTIFY=SUCCESS part silently, and responds exactly as before.

However when running the Perl script, the Net::SMTP will be kind enough to issue a warning to its stderr:

Net::SMTP::recipient: DSN option not supported by host at ./testmail.pl line 36.

The mail addresses I used in the sample session above are bogus, of courses, but note that the spam victim is the sender of the email, because that’s where the return receipt goes. On top of that, the RCPT TO address will also get a spam message, but that’s the smaller problem, as it’s yet another spam message arriving — not one that is sent away from our server.

I should also mention that Notify can be a comma-separated list of events, e.g.

RCPT TO:<bad_address@billauer.co.il> NOTIFY=SUCCESS,FAILURE,DELAY

however FAILURE doesn’t include the user not being known to the server, in which case the message is dropped anyhow without any DSN message generated. So as a spam trick, one can’t send mails to random addresses, and issue spam bounce messages because they failed. That would have been too easy.

In the mail logs

The sample session shown above causes the following lines in mail.log. Note the line marked in red, which indicates that the return receipt mechanism was fired off.

Jul 15 10:15:31 sm-mta[12697]: 06FAFTbL012697: from=<spamvictim@billauer.co.il>, size=121, class=0, nrcpts=1, msgid=<202007151015.06FAFTbL012697@mx.mymailserver.com>, proto=ESMTP, daemon=IPv4-port-587, relay=46-117-33-227.bb.netvision.net.il
[46.117.33.227]
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: to=<legal_address@billauer.co.il>, ctladdr=<spamvictim@billauer.co.il> (1010/500), delay=00:00:01, xdelay=00:00:00, mailer=local, pri=30456, dsn=2.0.0, stat=Sent
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: 06FAFVbL012698: DSN: Return receipt
Jul 15 10:15:31 sm-mta[12698]: 06FAFVbL012698: to=<spamvictim@billauer.co.il>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=30000, dsn=2.0.0, stat=Sent

The receipt

Since I’m at it, this is what a receipt message for the sample session above looks like:

Received: from localhost (localhost)	by mx.mymailserver.com
 (8.14.4/8.14.4/Debian-8+deb8u2) id 06FAFVbL012698;	Wed, 15 Jul 2020
 10:15:31 GMT
Date: Wed, 15 Jul 2020 10:15:31 GMT
From: Mail Delivery Subsystem <MAILER-DAEMON@billauer.co.il>
Message-ID: <202007151015.06FAFVbL012698@mx.mymailserver.com>
To: <spamvictim@billauer.co.il>
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status;
 boundary="06FAFVbL012698.1594808131/mx.mymailserver.com"
Subject: Return receipt
Auto-Submitted: auto-generated (return-receipt)
X-Mail-Filter: main

This is a MIME-encapsulated message

--06FAFVbL012698.1594808131/mx.mymailserver.com

The original message was received at Wed, 15 Jul 2020 10:15:30 GMT
from 46-117-33-227.bb.netvision.net.il [46.117.33.227]

   ----- The following addresses had successful delivery notifications -----
<legal_address@billauer.co.il>  (successfully delivered to mailbox)

   ----- Transcript of session follows -----
<legal_address@billauer.co.il>... Successfully delivered

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: message/delivery-status

Reporting-MTA: dns; mx.mymailserver.com
Received-From-MTA: DNS; 46-117-33-227.bb.netvision.net.il
Arrival-Date: Wed, 15 Jul 2020 10:15:30 GMT

Final-Recipient: RFC822; legal_address@billauer.co.il
Action: delivered (to mailbox)
Status: 2.1.5
Last-Attempt-Date: Wed, 15 Jul 2020 10:15:31 GMT

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: text/rfc822-headers

Return-Path: <spamvictim@billauer.co.il>
Received: from localhost.localdomain (46-117-33-227.bb.netvision.net.il [46.117.33.227])
	by mx.mymailserver.com (8.14.4/8.14.4/Debian-8+deb8u2) with ESMTP id 06FAFTbL012697
	for <legal_address@billauer.co.il>; Wed, 15 Jul 2020 10:15:30 GMT
Date: Wed, 15 Jul 2020 10:15:29 GMT
Message-Id: <202007151015.06FAFTbL012697@mx.mymailserver.com>
MIME-Version: 1.0
From: spamvictim@billauer.co.il
To: legal_address@billauer.co.il
Subject: Testing email.

--06FAFVbL012698.1594808131/mx.mymailserver.com--

But note that if DSN is used by a spammer to trick our mail server, we will get the failure notice that results from sending this message to the other server. If we’re lucky enough to get anything at all: If the message is accepted, we’ll never know our server has been sending spam.

When umount says target is busy, but no process can be blamed

A short one: What to do if unmount is impossible with a

# umount /path/to/mount
umount: /path/to/mount: target is busy

but grepping the output of lsof for the said path yields nothing. In other words, the mount is busy, but no process can be blamed for accessing it (even as a home directory).

If this happens, odds are that it’s an NFS mount, held by some remote machine. The access might have been over long ago, but the mount is still considered busy. So the solution for this case is simple: Restart the NFS daemon. On Linux Mint 19 (and probably a lot of others) it’s simply

# systemctl restart nfs-server

and after this, umount is sucessful (hopefully…)

LG OLED with a Linux computer: Getting that pitch black

Introduction

So I got myself an LG OLED65B9. It’s huge and a really nice piece of electronics. I opted out the bells and whistles, and connected it via HDMI to my already existing media computer, running Linux Mint 18.1. All I wanted was a plain (yet very high quality) display.

However at some point I noticed that black wasn’t displayed as black. I opened GIMP, drew a huge black rectangle, and it displayed as dark grey. At first I thought that the screen was defective (or that I was overly optimistic expecting that black would be complete darkness), but then I tried an image from a USB stick, and reassured myself that black is displayed as complete darkness. As it should. Or why did I pay extra for an OLED?

Because I skipped the “play with the new toy” phase with this display, I’m 100% it’s with its factory settings. It’s not something I messed up.

I should mention that I use plain HD resolution of 1920x1080. The screen can do much better than that (see list of resolutions below), and defaults at 3840x2160 with my computer, but it’s quite pointless: Don’t know about you, I have nothing to show that goes higher than 1080p. And the computer’s graphics stutters at 4k UHD. So why push it?

I have a previous post on graphics modes, and one on the setup of the Brix media center computer involved.

So why is black displayed LCD style?

The truth is that I don’t know. But it seems to be a problem only with standard 16:9 graphics modes. When switching to modes that are typical for computers (5:4 and 4:3 aspect ratios), the image was stretched to the entire screen, and black areas showed as pitch black. I’m not sure about this conclusion, and even less do I have an idea why this would happen or why a properly designed display would “correct” a pixel that arrives as RGB all zeros to something brighter.

Also, the black level on text consoles (Ctrl-Shift-F1 style) is horrible. But I’m not sure about which resolution they use.

An idea that crossed my mind is that maybe the pixels are sent as YCbCr in some modes or maybe the computer goes “Hey, I’m a TV now, let’s do some color correction nobody asked for” when standard HDTV aspect ratios are used. If any, I would go for the second possibility. But xrandr’s verbose output implies that both brightness and gamma are set to 1.0 for the relevant HDMI output, even when black isn’t black.

The graphics adapter is Intel Celeron J3160′s on-chip “HD Graphics” processor (8086:22b1) so nothing fancy here.

The fix

This just worked for me, and I didn’t feel like playing with it further. So I can’t assure that this is a consistent solution, but it actually seems that way.

The idea is that since the problem arises with standard 16:9 modes, maybe make up a non-standard one?

Unlike the case with my previous TV, using cvt to calculate the timing parameters turned out to be a good idea.

$ cvt 1920 1080 60
# 1920x1080 59.96 Hz (CVT 2.07M9) hsync: 67.16 kHz; pclk: 173.00 MHz
Modeline "1920x1080_60.00"  173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --newmode "try" 173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --addmode HDMI3 try
$ xrandr -d :0 --output HDMI3 --mode try

At this point I got a proper 1920x1080 on the screen, with black pixels as dark as when the display is powered off. The output of xrandr after this was somewhat unexpected, yet functionally what I wanted:

$ xrandr -d :0 --verbose
  1280x720 (0x4b) 74.250MHz +HSync +VSync +preferred
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1920x1080 (0x141) 173.000MHz -HSync +VSync *current
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz
  1920x1080 (0x10c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
 [ ... ]
  try (0x13e) 173.000MHz -HSync +VSync
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

So the mode in effect didn’t turn out the one I generated (“try”), but a replica of its parameters, marked as 0x141 (and 0x13a on another occasion). This mode wasn’t there before.

I’m don’t quite understand how this happened. Maybe Cinnamon’s machinery did this. It kind of gets in the way all the time, and at times it didn’t let me set just any mode I liked with xrandr, so maybe that. This whole thing with graphics modes is completely out of control.

I should mention that there is no problem with sound in this mode (or any other situation I tried). Not that there should be, but at some point I thought maybe there would be, because the mode implies a computer and not a TV-something. But no issues at all. Actually, the screen’s loudspeakers are remarkably good, with a surprisingly present bass, but that’s a different story.

As for making this special mode permanent, that turned out to be a problem in itself. This post shows how I eventually solved it.

List of graphics modes

Just in case this interests anyone, this is the output of a full resolution list:

$ xrandr -d :0 --verbose
[ ... ]
HDMI3 connected primary 3840x2160+0+0 (0x1ba) normal (normal left inverted right x axis y axis) 1600mm x 900mm
	Identifier: 0x48
	Timestamp:  -1469585217
	Subpixel:   unknown
	Gamma:      1.0:1.0:1.0
	Brightness: 1.0
	Clones:
	CRTC:       0
	CRTCs:      0
	Transform:  1.000000 0.000000 0.000000
	            0.000000 1.000000 0.000000
	            0.000000 0.000000 1.000000
	           filter:
	EDID:
		00ffffffffffff001e6da0c001010101
		011d010380a05a780aee91a3544c9926
		0f5054a1080031404540614071408180
		d1c00101010104740030f2705a80b058
		8a0040846300001e023a801871382d40
		582c450040846300001e000000fd0018
		781e871e000a202020202020000000fc
		004c472054560a20202020202020012d
		02035af1565f101f0413051403021220
		212215015d5e6263643f403209570715
		07505707016704033d1ec05f7e016e03
		0c001000b83c20008001020304e200cf
		e305c000e50e60616566eb0146d0002a
		1803257d76ace3060d01662150b05100
		1b304070360040846300001e00000000
		0000000000000000000000000000008b
	aspect ratio: Automatic
		supported: Automatic, 4:3, 16:9
	Broadcast RGB: Automatic
		supported: Automatic, Full, Limited 16:235
	audio: auto
		supported: force-dvi, off, auto, on
  3840x2160 (0x1ba) 297.000MHz +HSync +VSync *current +preferred
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.50KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  30.00Hz
  4096x2160 (0x1bb) 297.000MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  4096x2160 (0x1bc) 296.703MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  3840x2160 (0x1bd) 297.000MHz +HSync +VSync
        h: width  3840 start 4896 end 4984 total 5280 skew    0 clock  56.25KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  25.00Hz
  3840x2160 (0x1be) 297.000MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  3840x2160 (0x1bf) 296.703MHz +HSync +VSync
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.43KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  29.97Hz
  3840x2160 (0x1c0) 296.703MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  1920x1080 (0x1c1) 297.000MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 135.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 120.00Hz
  1920x1080 (0x1c2) 297.000MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock 112.50KHz
        v: height 1080 start 1084 end 1094 total 1125           clock 100.00Hz
  1920x1080 (0x1c3) 296.703MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 134.87KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 119.88Hz
  1920x1080 (0x16c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
  1920x1080 (0x1c4) 148.500MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  56.25KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  50.00Hz
  1920x1080 (0x16d) 148.352MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.43KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  59.94Hz
  1920x1080i (0x10c) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  60.00Hz
  1920x1080i (0x10d) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  50.00Hz
  1920x1080 (0x1c5) 74.250MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  30.00Hz
  1920x1080 (0x1c6) 74.250MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  25.00Hz
  1920x1080 (0x1c7) 74.250MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  27.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  24.00Hz
  1920x1080i (0x10e) 74.176MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  59.94Hz
  1920x1080 (0x1c8) 74.176MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  29.97Hz
  1920x1080 (0x1c9) 74.176MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  26.97KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  23.98Hz
  1280x1024 (0x1b5) 108.000MHz +HSync +VSync
        h: width  1280 start 1328 end 1440 total 1688 skew    0 clock  63.98KHz
        v: height 1024 start 1025 end 1028 total 1066           clock  60.02Hz
  1360x768 (0x4b) 85.500MHz +HSync +VSync
        h: width  1360 start 1424 end 1536 total 1792 skew    0 clock  47.71KHz
        v: height  768 start  771 end  777 total  795           clock  60.02Hz
  1152x864 (0x1ca) 81.579MHz -HSync +VSync
        h: width  1152 start 1216 end 1336 total 1520 skew    0 clock  53.67KHz
        v: height  864 start  865 end  868 total  895           clock  59.97Hz
  1280x720 (0x110) 74.250MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1280x720 (0x111) 74.250MHz +HSync +VSync
        h: width  1280 start 1720 end 1760 total 1980 skew    0 clock  37.50KHz
        v: height  720 start  725 end  730 total  750           clock  50.00Hz
  1280x720 (0x112) 74.176MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  44.96KHz
        v: height  720 start  725 end  730 total  750           clock  59.94Hz
  1024x768 (0x113) 65.000MHz -HSync -VSync
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock  48.36KHz
        v: height  768 start  771 end  777 total  806           clock  60.00Hz
  800x600 (0x115) 40.000MHz +HSync +VSync
        h: width   800 start  840 end  968 total 1056 skew    0 clock  37.88KHz
        v: height  600 start  601 end  605 total  628           clock  60.32Hz
  720x576 (0x116) 27.000MHz -HSync -VSync
        h: width   720 start  732 end  796 total  864 skew    0 clock  31.25KHz
        v: height  576 start  581 end  586 total  625           clock  50.00Hz
  720x576i (0x117) 13.500MHz -HSync -VSync Interlace
        h: width   720 start  732 end  795 total  864 skew    0 clock  15.62KHz
        v: height  576 start  580 end  586 total  625           clock  50.00Hz
  720x480 (0x118) 27.027MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.50KHz
        v: height  480 start  489 end  495 total  525           clock  60.00Hz
  720x480 (0x119) 27.000MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.47KHz
        v: height  480 start  489 end  495 total  525           clock  59.94Hz
  640x480 (0x11c) 25.200MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.50KHz
        v: height  480 start  490 end  492 total  525           clock  60.00Hz
  640x480 (0x11d) 25.175MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.47KHz
        v: height  480 start  490 end  492 total  525           clock  59.94Hz
  720x400 (0x1cb) 28.320MHz -HSync +VSync
        h: width   720 start  738 end  846 total  900 skew    0 clock  31.47KHz
        v: height  400 start  412 end  414 total  449           clock  70.08Hz

So it even supports fallback mode with a 25.175 MHz clock if one really insists.

Linux Wine jots

General

These are just a few jots on Wine. I guess this post will evolve over time.

I’m running Wine version 4.0 on Linux Mint 19, running on an x86_64.

First run

Every time Wine is run on a blank (or absent) directory given by WINEPREFIX, it installs a Windows environment. Which Windows version an several other attributes can be set with Wine Configuration:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

It often suggests to install Wine Mono and Wine Gecko. I usually tend to agree.

This installation downloads three files into .cache/wine/: wine_gecko-2.47-x86_64.msi, wine_gecko-2.47-x86.msi and wine-mono-4.7.5.msi. This is why Wine doesn’t ask for permission to install these when setting up new Windows environments after the first time.

Install and use Winetricks

It’s a good idea in general, and it allows installation of Microsoft runtime environment easily:

# apt install winetricks
# apt install wine32-development

And now to install Virtual Studio 6 runtime environment, for example (solving some error message on not being able to import isskin.dll or isskinu.dll)

$ WINEPREFIX=/path/to/winedir winetricks vcrun6

Prevent browser popup

Wine has this thing that it opens a browser when so requested by the Windows application. That can be annoying at times, and get the program stuck when run inside a firejail. To prevent this altogether, just delete two files:

  • drive_c/windows/syswow64/winebrowser.exe
  • drive_c/windows/system32/winebrowser.exe

Open explorer

The simplest way to start: Open the file explorer:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine explorer

DOS command line

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine cmd

This is better than expected: The command session is done directly in the console (no new window opened). Like invoking a shell.

Use with firejail

Windows equals viruses, and Wine doesn’t offer any protection against that. Since the entire filesystem is accessible from Z: (more on that below), it’s a good idea to run Wine from within a firejail mini-container. I have a separate post on firejail.

The execution of the program then looks something like (non-root user):

$ firejail --profile=~/my.profile --env=WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine 'C:\Program Files\Malsoft\Malsoft.exe' &

The my.profile file depends on what the Windows program is expected to do. I discuss that briefly in that post, however this is something that worked for me:

include /etc/firejail/disable-common.inc
include /etc/firejail/disable-passwdmgr.inc
private-tmp
private-dev

# All relevant directories are read-only by default, not /opt. So add it.
read-only /opt
#
# This whitelisting protects the entire home directory.
# .cache/wine is where the Gecko + Mono installation files are kept.
# They can't be downloaded, because of "net none" below
mkdir ~/sandboxed/
mkdir ~/.cache/wine
whitelist ~/sandboxed/
whitelist ~/.cache/wine

net none
nonewprivs
caps.drop all
noroot
# blacklist everything that can be harmed
#
blacklist /mnt
blacklist /cdrom
blacklist /media
blacklist /boot

Notes:

  • Note the “net none” part. Networking completely disabled. No access to the internet nor the local network.
  • Be sure to blacklist any system-specific mount, in particular those that are writable by the regular user. Do you have a /hugestorage mount? That one.
  • There’s a seccomp filter option that often appears in template profiles. It got a program in Wine completely stuck. It prevents certain system calls, so no doubt it adds safety, but it came in the way of something in my case.

Poor man’s sandboxing

If you’re too lazy to use firejail, you can remove some access to the local storage by virtue of Wine’s file system bindings. This is worth almost nothing, but almost nothing is more than nothing.

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

In the “Drives” tab, remove Z:, and in the Desktop Integration tab, go through each of the folders and uncheck “Link to”.

This doesn’t prevent a Wine-aware Windows program to accessing the machine with plain Linux API with your user permissions just like any Linux program, and the root directory is still visible in Windows’ file browsing utilities. Yet, simple Windows programs expect any file system to be mapped to a drive letter, and these steps prevent that. Not much, but once again, better than nothing.

Firejail: Putting a program in its own little container

Introduction

Firejail is a lightweight security utility which ties the hands of running processes, somewhat like Apparmor and SELinux. However it takes the mission towards Linux kernel’s cgroups and namespaces. It’s in fact a bit of a container-style virtualization utility, which creates sandboxes for running specific programs: Instead of a container for an entire operating system, it makes one for each application (i.e. the main process and its children). Rather than disallowing access from files and directories by virtue of permissions, simply make sure they aren’t visible to the processes. Same goes for networking.

By virtue of Cgroups, several security restrictions are also put in place regardless if so desired. Certain syscalls can be prevented etc. But in the end of the day, think container virtualization. A sandbox is created, and everything happens inside it. It’s also easy to add processes to an existing sandbox (in particular, start a new shell). Not to mention the joy of shutting down a sandbox, that is, killing all processes inside it.

While the main use of Firejail to protect the file system from access and tampering by malicious or infected software, it also allows more or less everything that a container-style virtual machine does: Control of network traffic (volume, dedicated firewall, which physical interfaces are exposed) as well as activity (how many subprocesses, CPU and memory utilization etc.). And like a virtual machine, it also allows statistics on resource usage.

Plus spoofing the host name, restricting access to sound devices, X11 capabilities and a whole range of stuff.

And here’s the nice thing: It doesn’t require root privileges to run. Sort of. The firejail executable is run with setuid.

It’s however important to note that firejail doesn’t create a stand-alone container. Rather, it mixes and matches files from the real file system and overrides selected parts of the directory tree with temporary mounts. Or overlays. Or whiteouts.

In fact, compared with the accurate rules of a firewall, its behavior is quite loose and inaccurate. For a newbie, it’s a bit difficult to predict exactly what kind of sandbox it will set up given this or other setting. It throws in all kind of files of its own into the temporary directories it creates, which is very helpful to get things up and running quickly, but that doesn’t give a feeling of control.

Generally speaking, everything that isn’t explicitly handled by blacklisting or whitelisting (see below) is accessible in the sandbox just like outside it. In particular, it’s the user’s responsibility to hide away all those system-specific mounted filesystems (do you call them /mnt/storage?). If desired, of course.

Major disclaimer: This post is not authoritative in any way, and contains my jots as I get to know the beast. In particular, I may mislead you to think something is protected even though it’s not. You’re responsible to your own decisions.

The examples below are with firejail version 0.9.52 on a Linux Mint 19.

Install

# apt install firejail
# apt install firetools

By all means, go

$ man firejail

after installation. It’s also worth to look at /etc/firejail/ to get an idea on what protection measures are typically used.

Key commands

Launch FireTools, a GUI front end:

$ firetools &

And the “Tools” part has a nice listing of running sandboxes (right-click the ugly thing that comes up).

Now some command line examples. I name the sandboxes in these examples, but I’m not sure it’s worth bothering.

List existing sandboxes (or use FireTools, right-click the panel and choose Tools):

$ firejail --list

Assign a name to a sandbox when creating it

$ firejail --name=mysandbox firefox

Shut down a sandbox (kill all its processes, and clean up):

$ firejail --shutdown=mysandbox

If a name wasn’t assigned, the PID given in the list can be used instead.

Disallow the root user in the sandbox

$ firejail --noroot

Create overlay filesystem (mounts read/write, but changes are kept elsewhere)

$ firejail --overlay firefox

There’s also –overlay-tmpfs for overlaying tmpfs only, as well as –overlay-clean to clean the overlays, which are stored in $HOME/.firejail.

To create a completely new home directory (and /root) as temporary filesystems (private browsing style), so they are volatile:

$ firejail --private firefox

Better still,

$ firejail --private=/path/to/extra-homedir firefox

This uses the directory in the given path as a persistent home directory (some basic files are added automatically). This path can be anywhere in the filesystem, even in parts that are otherwise hidden (i.e. blacklisted) to the sandbox. So this is probably the most appealing choice in most scenarios.

Don’t get too excited, though: Other mounted filesystems remain unprotected (at different levels). This just protects the home directory.

By default, a whole bunch of security rules are loaded when firejail is invoked. To start the container without this:

$ firejail --noprofile

A profile can be selected with the –profile=filename flag.

Writing a profile

If you really want to have a sandbox that protects your computer with relation to a specific piece of software, you’ll probably have to write your own profile. It’s no big deal, except that it’s a bit of trial and error.

First read the manpage:

$ man firejail-profile

It’s easiest to start from a template: Launch FireTools from a shell, right-click the ugly thing that comes up, and pick “Configuration Wizard”, and create a custom security profile for one of the listed application — the one that resembles most the one for which the profile is set up.

Then launch the application from FireTools. The takeaway is that it writes out the configuration file to the console. Start with that.

Whilelisting and blacklisting

First and foremost: Always run a

$ df -h

inside the sandbox to get an idea of what is mounted. Blacklist anything that isn’t necessary. Doing so to entire mounts removes the related mount from the df -h list, which makes it easier to spot things that shouldn’t be there.

It’s also a good idea to start a sample bash session with the sandbox, and get into the File Manager in the Firetool’s “Tools” section for each sandbox.

But then, what is whitelisting and blacklisting, exactly? These two terms are used all over the docs, somehow assuming we know what they mean. So I’ll try to nail it down.

Whitelisting isn’t anywhere near what one would think it is: By whitelisting certain files and/or directories, the original files/directories appear in the sandbox but all other files in their vicinity are invisible. Also, changes in the same vicinity are temporary to the sandbox session. The idea seems to be that if files and/or directories are whitelisted, everything else close to it should be out of sight.

Or as put in the man page:

A temporary file system is mounted on the top directory, and the whitelisted files are mount-binded inside. Modifications to whitelisted files are persistent, everything else is discarded when the sandbox is closed. The top directory could be user home, /dev, /media, /mnt, /opt, /srv, /var, and /tmp.

So for example, if any file or directory in the home directory is whitelisted, the entire home directory becomes overridden by an almost empty home directory plus the specifically whitelisted items. For example, from my own home directory (which is populated with a lot of files):

$ firejail --noprofile --whitelist=/home/eli/this-directory
Parent pid 31560, child pid 31561
Child process initialized in 37.31 ms

$ find .
.
./.config
./.config/pulse
./.config/pulse/client.conf
./this-directory
./this-directory/this-file.txt
./.Xauthority
./.bashrc

So there’s just a few temporary files that firejail was kind enough to add for convenience. Changes made in this-directory/ are persistent since it’s bind-mounted into the temporary directory, but everything else is temporary.

Quite unfortunately, it’s not possible to whitelist a directory outside the specific list of hierarchies (unless bind mounting is used, but that requires root). So if the important stuff is one some /hugedisk, only a bind mount will help (or is this the punishment for not putting it has /mnt/hugedisk?).

But note that the –private= flag allows setting the home directory to anywhere on the filesystem (even inside a blacklisted region). This ad-hoc home directory is persistent, so it’s not like whitelisting, but even better is some scenarios.

Alternatively, it’s possible to blacklist everything but a certain part of a mount. That’s a bit tricky, because if a new directory appears after the rules are set, it remains unprotected. I’ll explain why below.

Or if that makes sense, make the entire directory tree read-only, with only a selected part read-write. That’s fine if there’s no issue with data leaking, just the possibility of malware sabotage.

So now to blacklisting: Firejail implements blacklisting by mounting an empty, read-only-by-root file or directory on top of the original file. And indeed,

$ firejail --blacklist=delme.txt
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 30288, child pid 30289
Child process initialized in 57.75 ms
$ ls -l
[ ... ]
-r--------  1 nobody nogroup     0 Jun  9 22:12 delme.txt
[ ... ]
$ less delme.txt
delme.txt: Permission denied

There are –noblacklist and –nowhitelist flags as well. However these merely cancel future or automatic black- or whitelistings. In particular, one can’t blacklist a directory and whitelist a subdirectory. It would have been very convenient, but since the parent directory is overridden with a whiteout directory, there is no access to the subdirectory. So each and every subdirectory must be blacklisted separately with a script or something, and even then if a new subdirectory pops up, it’s not protected at all.

There’s also a –read-only flag allows setting certain paths and files as read-only. There’s –read-write too, of course. When a directory or file is whitelisted, it must be flagged read-only separately if so desired (see man firejail).

Mini-strace

Trace all processes in the sandbox (in particular accesses to files and network). Much easier than using strace, when all we want is “which files are accessed?”

$ firejail --trace

And then just run any program to see what files and network sockets it accesses. And things of that sort.

Microsoft’s outlook.com servers and the art of delivering mails to them

Introduction

Still in 2020, it seems like Microsoft lives up to its reputation: Being arrogant, thinking that anyone in business must be a huge corporate, and in particular ending up completely ridiculous. Microsoft’s mail servers, which accept on behalf of Hotmail, MSN, Office 365, Outlook.com, or Live.com users are no exception. This also affects companies and other entities which use their own domain names, but use Microsoft’s services for handling mail.

This post summarizes my personal experience and accumulated knowledge with delivering mail to their servers. I use a simple Linux sendmail SMTP MTA on a virtual server for handling the delivery of my own private mails as well as a very low traffic of transactional mails from a web server. All in all, it’s about 100 mails / month coming out from that server to all destinations.

So one server, one IP address with a perfect reputation on all open spam reputation trackers, with SPF, DKIM and DMARC records all in place properly.

One may ask why I’m not relying on existing mail delivery services or my ISP. Answer is simple: Any commercial mail delivery server is likely to have its reputation contaminated by some spammer, no matter what protection measures they take. When that happens, odds are that emails will just disappear, because the ISP has little interest in forwarding the bounce message saying that delivery failed. On a good day, they will be handling the problem quickly, and yet the sender of the lost mail won’t be aware that the correspondence is broken.

For this reason, it’s quite likely that small businesses will go on keeping their own, small, email delivery servers, maintaining their own reputation. So when Outlook’s servers are nasty with a single-IP server, they’re not just arrogant, but they are causing delivery issues with small to medium businesses.

To do when setting up the server

For starter info, go here. Microsoft is pretty upfront about not being friendly to new IP addresses (see troubleshooting page for postmasters).

So it’s a very good idea to create a Microsoft account to log into their services, and then join their Smart Network Data Service (SDNS) and Junk Mail Reporting Program. This is the start page for both of these services.

SDNS allows the owner of a mail server to register its IP address range (“Request Access“), so its status can be monitored (“View IP Status”) over time. When all is fine, the IP Status page says “All of the specified IPs have normal status”, and when they don’t like this or other IP address, it’s more like this (click to enlarge):

Microsoft SDNS blocked IP

The Junk Mail Reporting Program (JMRP) allows the owner of the mail server to receive notifications (by email) when a mail message is delivered however deemed suspicious, either by an end-user (marking it as spam) or by automatic means. So it’s a good idea to create a special email address for this purpose and fill in the JMRP form. Even for the sake of claiming that you got no complaints when contacting support later on.

Note that this is important for delivery of mail to any institution relies on Microsoft’s mail infrastructure. A proper IP address blacklist delisting takes you from

Mar 11 20:18:23 sm-mta[5817]: x2BKIL2H005815: to=<xxxxxxx@mit.edu>, delay=00:00:02, xdelay=00:00:02, mailer=esmtp, pri=121914, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=5.7.606, stat=User unknown

(but the bounce message indicated that it’s not an unknown user, but a blacklisted IP number) to

Mar 11 21:15:12 sm-mta[6170]: x2BLF8rT006168: to=<xxxxxxx@mit.edu>, delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121915, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=2.0.0, stat=Sent (<5C86CFDC.6000206@example.com> [InternalId=11420318042095, Hostname=DM5PR01MB2345.prod.exchangelabs.com] 11012 bytes in 0.191, 56.057 KB/sec Queued mail for delivery)

Note that the session response said nothing about a blacklisted IP, however the bounce message (not shown here) did.

Finally, Microsoft suggest getting a certification from Return Path. A paid-for service, clearly intended for large companies and in particular mass mailers to get their spam delivered. Microsoftish irony at its best.

To do when things go wrong

First thing first, read the bounce message. If it says that it’s on Microsoft’s IP blacklist, go to the Office 365 Anti-Spam IP Delist Portal and delist it.

Then check the IP’s status (requires logging in). If you’re blocked, contact support. This doesn’t require a Microsoft login account, by the way. I’m not sure if this link to the support page is valid in the long run, so it’s on SNDS’ main page (“contact sender support”) as well as Troubleshooting page.

My own ridiculous experience

I kicked off my mail server a bit more than a year ago. There was some trouble in the beginning, but that was no surprise. Then things got settled and working for a year, and only then, suddenly & out of the blue, a mail to a Hotmail address bounced with:

Action: failed
Status: 5.7.1
Diagnostic-Code: SMTP; 550 5.7.1 Unfortunately, messages from [193.29.56.92] weren't sent. Please contact your Internet service provider since part of their network is on our block list (S3140). You can also refer your provider to http://mail.live.com/mail/troubleshooting.aspx#errors. [VE1EUR01FT021.eop-EUR01.prod.protection.outlook.com]

And indeed, checking the IP status indicated that is was blocked “because of user complaints or other evidence of spamming”.

So first I went to the mail logs. Low traffic. No indication that the server has been tricked into sending a lot of mails. No indication that it has been compromised in any way. And when a server has been compromised, you know it.

No chance that there were user complaints, because I got nothing from JMRP. So what the “evidence of spamming”?

My best guess: A handful transactional mail messages (at most) to their servers for authenticating email addresses that were marker suspicious by their super software. Putting these messages in quarantine for a few hours is the common solution when that happens. Spam is about volume. If all you got was 4-5 messages, how could that be a spam server? Only if you look at percentage. 100% suspicious. Silly or what?

So I filled in the contact support form, and soon enough I got a message saying a ticket has been opened, and 30 minutes later saying

We have completed reviewing the IP(s) you submitted. The following table contains the results of our investigation.

Not qualified for mitigation
193.29.56.92
Our investigation has determined that the above IP(s) do not qualify for mitigation. These IP(s) have previously received mitigations from deliverability support, and have failed to maintain patterns within our guidelines, so they are ineligible for additional mitigation at this time.

Cute, heh? And that is followed by a lot of general advice, basically copied from the website, recommending to join JMRP and SDNS. Which I had a year earlier, of course. The script that responded didn’t even bother to check that.

But it also said:

To have Deliverability Support investigate further, please reply to this email with a detailed description of the problem you are having, including specific error messages, and an agent will contact you.

And so I did. I wrote that I had joined those two programs a year ago, that the mail volume is low and so on. I doubt it really made a difference. After sending the reply, I got a somewhat automated response rather quickly, now with a more human touch:

Hello,

My name is Ayesha and I work with the Outlook.com Deliverability Support Team.

IP: 193.29.56.92

We will be looking into this issue along with the Escalations Team. We understand the urgency of this issue and will provide an update as soon as this is available. Rest assured that this ticket is being tracked and we will get back to you as soon as we have more information to offer.

Thank you for your patience.

Sincerely,
Ayesha

Outlook.com Deliverability Support

And then, a few days later, another mail:

Hello,

My name is Yaqub and I work with the Outlook.com Deliverability Support Team.

Recent activity coming from your IP(s): ( 193.29.56.92) has been flagged by our system as suspicious, causing your IP to become blocked. I have conducted an investigation into the emails originating from your IP space and have implemented mitigation for your deliverability problem. This process may take 24 – 48 hours to replicate completely throughout our system.

Please note that lifting the block does not guarantee that your email will be delivered to a user’s inbox. However, here are some things that can help you with delivery:

(and here came the same suggestions on JMRP and SDNS)

And about 24 hours later, the IP status went back to OK again. And my emails went through normally.

Well, almost. A few days even further down, I attempted to send an email to a live.co.uk destination, and once again, I got the same rejection message (in block list, S3140). The only difference was that the mail server on the other side was hotmail-com.olc.protection.outlook.com (residing in the US), and now eur.olc.protection.outlook.com (somewhere in Europe).

I checked the IP’s status in SDNS and it was fine. So updating the Europeans on the updated IP status takes a bit time, or what?

So I replied to last email I got from Microsoft’s support, saying it failed with live.co.uk. I didn’t get any reply, but a few hours later I tried again, and the mail went through. Coincidence or not.

This time I also caught the related messaged from the mail log. It’s

May 01 15:10:28 sm-mta[2239]: 041FASMh002237: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=121816, relay=eur.olc.protection.outlook.com. [104.47.1.33], dsn=5.0.0, stat=Service unavailable
May 01 15:10:28 sm-mta[2239]: 041FASMh002237: 041FASMh002239: DSN: Service unavailable

for a failure, and

May 02 06:23:00 sm-mta[4024]: 0426Mx1I004021: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=121808, relay=eur.olc.protection.outlook.com. [104.47.18.97], dsn=2.0.0, stat=Sent (<5EAD11C3.20105@billauer.co.il> [InternalId=21887153366859, Hostname=AM6EUR05HT060.eop-eur05.prod.protection.outlook.com] 10627 bytes in 0.246, 42.064 KB/sec Queued mail for delivery -> 250 2.1.5)

for success.

Lesson learned: Contact support and insist.

And the lesson to all those using Microsoft’s mail services: Your provider cuts off your email contacts arbitrarily. Because they are Microsoft.

Ftrace: The Linux kernel hacker’s swiss knife

Introduction

I ran into ftrace completely by chance, while trying to figure out why the call to usb_submit_urb() took so long time. In fact, it wasn’t. It was pr_info() that delayed the output. And it was ftrace that got me to realize that.

Whether you’re into dissecting existing kernel code, and want to know which function calls which, or if you need a close look on what your own code actually does (and when), ftrace is your companion. And it does a lot more.

And that’s maybe its problem: It offers so many different possibilities, that its documentation gets not so inviting to read. Add some confusing terminology and focus on advanced issues, and one gets the impression that starting to use it is a project in itself.

This is definitely not the case. This post consists of some simple & useful tasks. It’s not much about accuracy, doing it the right way nor showing the whole picture. It’s about about getting stuff done. If you’ll need to nail down something more specific, read the docs. It looks like they got virtually any useful scenario covered.

I’ll divert from keeping things simple in part on events at the end of this post. The concept of events is fairly simple, but the implementation, well, well. But the point is partly to demonstrate exactly that.

Does your kernel support ftrace?

Ftrace is often enabled in compiled kernels, but not always. Look for /sys/kernel/debug/tracing/ (as root) and in particular go

# cat available_tracers

If function_graph and function aren’t listed as available_tracers, the kernel needs to be recompiled with the correct options. Namely, CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_GRAPH_TRACER, CONFIG_STACK_TRACER and CONFIG_DYNAMIC_FTRACE.

References

These are some good resources. I would usually put them at the end of the post, but noone is expected to get there.

  • A gentle introduction by Alex Dzyoba.
  • /sys/kernel/debug/tracing/README — it’s like the -h flag on user space utilities
  • Detailed documentation from the kernel tree: Documentation/trace/ftrace.rst (and other files in the same directory)
  • A tutorial in LWN, part 1 and part 2.
  • Some good slides on the topic.

Getting around a bit

They’ll tell you that there’s a special tracefs to mount, but it’s easily available under /sys/kernel:

# cd /sys/kernel/debug/tracing/

For tracing invocations of functions, there are two formats: The simple “function” and it’s more sophisticated “function_graph”. It’s easiest to just try them out: Select the function graph tracer

# echo function_graph > current_tracer

Watch the tracing output

# less trace

There’s also trace_pipe with a “tail -f” flavor.

The output isn’t very helpful for now, as it shows every function invoked on all processors. Too much information. We’ll get to filtering just below.

Turn tracing off

# echo 0 > tracing_on

or on

# echo 1 > tracing_on

and clear (empty) the trace:

# echo > trace

Note that turning tracing off turns off everything, including trace_printk() discussed below. It’s a complete halt. To just stop one of function tracers, better go

# echo nop > current_tracer

The trace data is stored in circular buffers, so old entries are overwritten by newer ones if these buffers get full. The problem is that there’s a separate buffer for each CPU, so once this overwriting begins, the overall trace may miss out traces from only some CPUs on those time segments. Therefore I prefer turning off overwriting completely. At least the beginning of the trace reflects what actually happened (and the mess is left to the end):

# echo 0 > options/overwrite

Now to the “function” variant, just which function was invoked by which along with a timestamp. There is however no information on when the function returned (how much time it took), but the absolute timestamp can be matched with dmesg stamp. But this requires selecting ftrace’s clock as global:

# echo function > current_tracer
# echo global > trace_clock

It’s very important to note that printk (and its derivatives) can take up to a few milliseconds, so an apparent mismatch in the timestamps between a printk and a function called immediately after it may be a result of that.

Using trace_printk()

It’s quite common to use printk and its derivatives for debugging kernel code, which is fairly OK if there’s no problem if these slow down the execution — I’ve seen printk taking a few milliseconds (in process context, I should say).

So for quick and time-accurate printk-like debugging, definitely go for trace_printk(). Plus you get some extra info, such as the process / interrupt context and which CPU it ran on. It’s just a matter of adding something like

trace_printk("Submitting buffer ID=%d, len=%d\n", id, count);

in your kernel code, instead of pr_info(), dev_info() or whatever. No extra header file or anything of that sort needed, as far as I can tell.

I should mention that trace_printk() is intended for temporary debug messages, and these can’t be left in production code. If you want debug messages that stay there, go for events. Trickier, but the right way to go.

trace_printk() messages are logged by default, even when current_tracer is set to “nop”. However tracing must be on. In short, a newly booted system will show trace_printk()’s output in “trace”.

These messages will be interleaved with the other traces if current_tracer is in non-nop mode. This can be useful when the situation of interest occurs rarely — for example a code segments takes too long to run. Since the outcome is known only at the end, the simple solution is to run function tracing, and call trace_printk() when the time difference exceeds a threshold. This message will of course appear after the function sequence in the trace, but it’s easily looked up with plain text tools (e.g. “less”).

The system boots up with tracing on and in “nop” mode, but if you’ve been fiddling with these, bring them back with

# echo nop > current_tracer
# echo 1 > tracing_on

And then view the messages just like before (“trace”, “trace_pipe” or the CPU-individual counterparts).

The reason you don’t want to leave trace_printk() in production code is that the first call will leave a message of this form in the kernel log:

[ 1954.872204] **********************************************************
[ 1954.875230] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.878242] **                                                      **
[ 1954.881257] ** trace_printk() being used. Allocating extra memory.  **
[ 1954.884281] **                                                      **
[ 1954.887345] ** This means that this is a DEBUG kernel and it is     **
[ 1954.890397] ** unsafe for production use.                           **
[ 1954.893470] **                                                      **
[ 1954.896496] ** If you see this message and you are not debugging    **
[ 1954.899527] ** the kernel, report this immediately to your vendor!  **
[ 1954.902539] **                                                      **
[ 1954.905598] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.908713] **********************************************************

This is of course nothing to be alarmed about (if you’re the one who made the trace_printk() calls, that is). Seems like the purpose of this message is to gently convince programmers to use events in production code instead.

Tracing a specific function call

Ftrace comes with great filtering capabilities, which can be really complicated. So let’s take a simple usage case. Say that I want to see how much time elapses from the call to usb_submit_urb() in my driver and the callback function. Never mind the details. First, I might want to verify that the function is really traceable — it might not be if it has been optimized away by the compiler or if it’s a #define macro (which definitely isn’t the case for usb_submit_urb(), since it’s an exported function, but anyhow).

So first, look it up in the list of available functions (cwd is /sys/kernel/debug/tracing):

# grep submit_urb available_filter_functions
usb_hcd_submit_urb
usb_submit_urb

Yep, it’s there. So add it to the functions to filter, along with my driver’s callback function, which is bulk_out_completer():

# echo usb_submit_urb bulk_out_completer > set_ftrace_filter
# echo function > current_tracer
# echo > trace

Note that multiple functions can be given to set_ftrace_filter, delimited by plain whitespace. Wildcards can also be used, as shown below.

Now perform the operation (use dd command in my case) that makes the driver active, and harvest the output:

# head -30 trace
# tracer: function
#
# entries-in-buffer/entries-written: 64/64   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
              dd-847   [001] ....  1787.067325: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067335: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067342: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067348: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067353: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067358: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067363: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067369: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068019: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068080: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068528: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068557: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069039: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069062: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069533: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069556: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070012: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.070036: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070506: bulk_out_completer <-__usb_hcd_giveback_urb

Not sure how useful the data above is, but it demonstrates how a sequence of events can be analyzed easily.

Just to close this issue, let’s take a look on wildcards. This is easiest shown with an example:

# echo '*_submit_urb' > set_ftrace_filter
# cat set_ftrace_filter
usb_hcd_submit_urb
usb_submit_urb

So quite evidently, the wildcard was applied when the filter was set, and its usage is quite trivial.

Don’t forget to remove the filter when it’s not necessary anymore. These filters won’t go away otherwise, and tracing may appear to suddenly not work if you go on doing something else. So simply:

# echo > set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####

It’s not a coincidence that I didn’t use function_graph above — the functions that are called by the selected function(s) won’t show anyhow. It does show the duration of the call, which may be helpful. This brings us to

Which functions does function X call?

This is a great way to get an idea of who-calls-what. Let’s stick with usb_submit_urb(). What other functions does it call, specifically in my case?

Simple. Be sure to have removed any previous filters (see above) and then just go:

# echo usb_submit_urb > set_graph_function
# echo function_graph > current_tracer
# echo > trace

Then do whatever causes the function to be called, after which “trace” reads something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  usb_submit_urb() {
 0)   0.979 us    |    usb_urb_ep_type_check();
 0)               |    usb_hcd_submit_urb() {
 0)   0.427 us    |      usb_get_urb();
 0)               |      xhci_map_urb_for_dma [xhci_hcd]() {
 0)               |        usb_hcd_map_urb_for_dma() {
 0)   0.658 us    |          dma_direct_map_page();
 0)   1.739 us    |        }
 0)   2.642 us    |      }

[ ... ]

and it goes on.

This is a good place to remind that if all you wanted was to get the stack trace of calls at a certain point in your code, WARN() and WARN_ONCE() may be handier.

Tracing a segment: quick and dirty

This method doesn’t really trace a specific segment. It’s a bit of a dirty trick: Tracing is disabled at first, and then enabled before the function call, and disabled after it. All function calls that take place on all CPUs during that time are recorded. It may mean additional noise, but this noise may also be the explanation to why something went wrong (if it went wrong).

There is probably a more elegant solution. In particular, ftrace triggers have “traceon” and “traceoff” actions, so this is probably the classier way to go. But let’s do something quick, dirty and simple to understand.

So say that I want to understand what usb_submit_urb() is up to and what other functions it calls along. I could go, in my own driver:

tracing_on();
rc = usb_submit_urb(urb, GFP_KERNEL);
tracing_off();

With this driver loaded into the kernel, then go something like:

# cd /sys/kernel/debug/tracing/
# echo global > trace_clock
# echo 32768 > buffer_size_kb
# echo 0 > tracing_on
# echo function > current_tracer

This assumes that current_tracer was “nop” before this (it’s by default), so the trace buffer begins empty. And when it’s changed to “function” at the last command, nothing happens, because tracing was just turned off. The buffer size is set to 32 MB per CPU (which is quite a lot).

Then run whatever makes the relevant code execute, and then

# less trace

The problem is that trace output from several CPUs is interleaved in this output. But we know which CPU the relevant command ran on from the trace itself, so obtain a filtered version (e.g. for CPU #2):

# less per_cpu/cpu2/trace

Events

The term “events” is somewhat misleading. An event is just a piece of printk-like debug message that can be enabled or disabled in runtime. It’s a convenient way to leave the debugging messages in place even in production code, and make it possible even for end-users to harvest the information without needing to compile anything.

I’ll try to explain this by a (ehm-ehm) simple example. Let’s enable the kmalloc event:

# echo nop > current_tracer
# echo kmalloc > set_event

This turns off the function tracer, and enables the kmalloc event. Looking at “trace” we now have something like:

# tracer: nop
#
# entries-in-buffer/entries-written: 1047/1047   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-781   [003] ....  5317.446761: kmalloc: call_site=ffffffff8153b210 ptr=000000001f5ab582 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.446788: kmalloc: call_site=ffffffff8153cbf1 ptr=000000002525fcc0 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.748476: kmalloc: call_site=ffffffff8153b210 ptr=0000000095781cbc bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.748501: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000c9801d3d bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.900662: kmalloc: call_site=ffffffff8153b210 ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.900687: kmalloc: call_site=ffffffff8153cbf1 ptr=0000000004406a83 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420356: kmalloc: call_site=ffffffff8119f2c0 ptr=00000000a6835237 bytes_req=184 bytes_alloc=192 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420365: kmalloc: call_site=ffffffff8119f34d ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420404: kmalloc: call_site=ffffffff8116feff ptr=00000000c30e90f8 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ....  5318.420408: kmalloc: call_site=ffffffff8153b210 ptr=000000000b7b85e5 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420415: kmalloc: call_site=ffffffff811701e4 ptr=00000000f54127a0 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            bash-792   [000] ....  5318.420431: kmalloc: call_site=ffffffff812eb5e2 ptr=0000000084bbe3b4 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ...1  5318.420435: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000a6a3ac50 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420473: kmalloc: call_site=ffffffff811b1aac ptr=0000000095972087 bytes_req=56 bytes_alloc=64 gfp_flags=GFP_KERNEL_ACCOUNT

These are all kmalloc calls that were executed since the event was enabled. There are of course ways to filter the events to specific occasions, but this is really not something I’ve gotten into (yet?).

It’s however interesting to see how these messages came about. To do this, try the following (after enabling the kmalloc event as shown above):

# echo '*kmalloc*' > set_graph_function
# echo function_graph > current_tracer
# echo > trace

And then the trace output can be something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  finish_task_switch() {
 0)               |    _raw_spin_unlock_irq() {
 0)   0.794 us    |      do_raw_spin_unlock();
 0)   0.777 us    |      preempt_count_sub();
 0)   4.402 us    |    }
 0)   8.919 us    |  }
 0)               |  __kmalloc_reserve.isra.9() {
 0)               |    __kmalloc_track_caller() {
 0)   0.784 us    |      kmalloc_slab();
 0)   0.654 us    |      should_failslab();
 0)   0.711 us    |      check_irq_off();
 0)               |      cache_alloc_debugcheck_after() {
 0)   4.691 us    |        check_poison_obj();
 0)   0.685 us    |        poison_obj();
 0)   7.353 us    |      }
 0)               |      /* kmalloc: call_site=ffffffff81801a80 ptr=000000000493ffc2 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC */
 0) + 15.571 us   |    }
 0) + 17.452 us   |  }

And it goes on like this several times. So now we have the event output in the middle of the function graph, and we can also see what happened: __kmalloc_reserve(), which is defined in net/core/skbuff.c, calls kmalloc_node_track_caller(), which is translated into __kmalloc_track_caller() by virtue of a #define in slab.h. That function is defined in mm/slab.c, but it just redirects the call to __do_kmalloc_node(), which makes the calls visible in the trace, and eventually calls kmem_cache_alloc_node_trace(). This call isn’t registered, most likely because it was optimized away by the compiler. And it reads:

#ifdef CONFIG_TRACING
void *
kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
	void *ret;

	ret = slab_alloc(cachep, flags, _RET_IP_);

	ret = kasan_kmalloc(cachep, ret, size, flags);
	trace_kmalloc(_RET_IP_, ret,
		      size, cachep->size, flags);
	return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
#endif

So there we have it. But wait! Where is trace_malloc defined? The answer lies at the top of slab.c:

#include <trace/events/kmem.h>

which includes include/trace/events/kmem.h, the beginning of which reads

 /* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kmem

#if !defined(_TRACE_KMEM_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_KMEM_H

#include <linux/types.h>
#include <linux/tracepoint.h>
#include <trace/events/mmflags.h>

DECLARE_EVENT_CLASS(kmem_alloc,

	TP_PROTO(unsigned long call_site,
		 const void *ptr,
		 size_t bytes_req,
		 size_t bytes_alloc,
		 gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags),

	TP_STRUCT__entry(
		__field(	unsigned long,	call_site	)
		__field(	const void *,	ptr		)
		__field(	size_t,		bytes_req	)
		__field(	size_t,		bytes_alloc	)
		__field(	gfp_t,		gfp_flags	)
	),

	TP_fast_assign(
		__entry->call_site	= call_site;
		__entry->ptr		= ptr;
		__entry->bytes_req	= bytes_req;
		__entry->bytes_alloc	= bytes_alloc;
		__entry->gfp_flags	= gfp_flags;
	),

	TP_printk("call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s",
		__entry->call_site,
		__entry->ptr,
		__entry->bytes_req,
		__entry->bytes_alloc,
		show_gfp_flags(__entry->gfp_flags))
);

DEFINE_EVENT(kmem_alloc, kmalloc,

	TP_PROTO(unsigned long call_site, const void *ptr,
		 size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags)
);

Tired already? We’re almost there. I won’t get into the details of this (mostly because I don’t know), but note two crucial points: One is the TP_printk(), which matches the output format of the event’s output. The second part is the DEFINE_EVENT, which defines a function linked with the kmem_alloc class, named trace_kmalloc(). The define-macro magic takes place in include/linux/tracepoint.h, and essentially translates DEFINE_EVENT into a DECLARE_TRACE, which is then turned into a __DECLARE_TRACE, which begins with:

#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
	extern struct tracepoint __tracepoint_##name;			\
	static inline void trace_##name(proto)				\
	{								\
		if (static_key_false(&__tracepoint_##name.key))		\
			__DO_TRACE(&__tracepoint_##name,		\
				TP_PROTO(data_proto),			\
				TP_ARGS(data_args),			\
				TP_CONDITION(cond), 0);			\
		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
			rcu_read_lock_sched_notrace();			\
			rcu_dereference_sched(__tracepoint_##name.funcs);\
			rcu_read_unlock_sched_notrace();		\
		}							\
	}								\
[ ... ]

And then it goes on with several related functions. The point with this long story, except for the obvious masochism, was to show that events really are just some kind of printk, only not all that easy to use. It also explains why that scary message is needed to get people off trace_printk().

But since there’s some code to copy from, it shouldn’t be all that bad to set up new events.

Which process opens file X?

In case you want to to know what process opens which file, system-wide, this is quick recipe I’ve copied from these slides:

# echo 'p:open do_sys_open file=+0(%si):string' > kprobe_events
# echo 1 > events/kprobes/open/enable
# cat trace_pipe

Note that “cat” can be replaced with “grep” if you’re looking which process opens a specific file (or is there some kernel filter for this? Not sure it’s worth bother checking).

I haven’t taken the time to figure out exactly how and why it works (and I guess it’s possible to use filters here as well).

Linux driver: Creating device files for a hotpluggable device

Introduction

Most device drivers hook onto an already existing subsystem: Audio, networking, I2C, whatever. If it’s a specialized piece of hardware, a single device file is typically enough, and you get away with a misc device or, as shown in the kernel’s USB example (drivers/usb/usb-skeleton.c), with the USB framework’s usb_register_dev() call.

However in some cases, a number of dedicated device files are required, belonging to a dedicated class. Only in these cases does it make sense to generate the device files explicitly. A driver that does this for no good reason will be hard to get into the Linux kernel tree.

Hotpluggable devices are tricky in particular because of the two following reasons:

  • They may vanish at any time, so the driver must be sure to handle that properly. Namely, not to free any resources as long as they may be accessed by some other thread.
  • It’s impossible initialize the device driver once and for all against a known set of devices when the driver loads.

This post focuses on USB devices, and in particular on deleting the device files and releasing the kernel that support them when the USB device is unplugged.

Spoiler: There’s no problem deleting the device files themselves, but handling the cdev struct requires some attention.

Everything in this post relates to Linux kernel v5.3.

A reminder on device files

First, we shall look at the usual way to do it, which is in fact unsuitable for a hotpluggable device. But we have to start with something.

There are three players in this game:

  • The cdev struct: Makes the connection between a set of major/minors and a pointer to a struct file_operations, containing the pointers to the functions (methods) that implement open, release, read, write etc. The fops, in short.
  • The device files: Those files in /dev that are accessible by user space programs.
  • The class: Something that must be assigned to device files that are created, and assists in identifying their nature (in particular for the purpose of udev).

The class is typically a global variable in the driver module, and is created in its init routine:

static struct class *example_class;

static int __init example_init(void)
{
  example_class = class_create(THIS_MODULE, examplename);
  if (IS_ERR(example_class))
    return PTR_ERR(example_class);

  return 0;
}

Because a class is typically global in the module, and hence not accessible elsewhere, it’s impossible to create device files on behalf of another class without using their API and restrictions (for example, misc devices). On the other hand, if you try to push a driver which creates a new class into the kernel tree, odds are that you’ll have to explain why you need to add yet another class. Don’t expect a lot of sympathy on this matter.

The next player is the cdev struct. Its role is to connect between a major + a range of minors and file operations. It’s typically part of a larger struct which is allocated for each physical device. So it usually goes something like

struct example_device {
  struct device *dev;

  struct cdev cdev;

  int major;
  int lowest_minor; /* Highest minor = lowest_minor + num_devs - 1 */

  int num_devs;

  struct kref kref;

  /* Just some device related stuff */
  struct list_head my_list;
  __iomem void *registers;
  int fatal_error;
  wait_queue_head_t my_wait;
}

The only part that is relevant for this post is the struct cdev and the others marked in bold, but I left a few others that often appear in a IOMM device.

Note that the example_device struct contains the cdev struct itself, and not a pointer to it. This is the usual way, but that isn’t the correct way for a USB device. More on that below.

As mentioned above, the purpose of the cdev struct is to bind a major/minor set to a struct of fops. Something like

static const struct file_operations example_fops = {
  .owner      = THIS_MODULE,
  .read       = example_read,
  .write      = example_write,
  .open       = example_open,
  .flush      = example_flush,
  .release    = example_release,
  .llseek     = example_llseek,
  .poll       = example_poll,
};

The cdev is typically initialized and brought to life with something like

  struct example_device *mydevice;
  dev_t dev;

  rc = alloc_chrdev_region(&dev, 0, /* minor start */
			   mydevice->num_devs,
			   examplename);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to obtain major/minors");
    return rc;
  }

  mydevice->major = major = MAJOR(dev);
  mydevice->lowest_minor = minor = MINOR(dev);

  cdev_init(&mydevice->cdev, &example_fops);
  mydevice->cdev.owner = THIS_MODULE;

  rc = cdev_add(&mydevice->cdev, MKDEV(major, minor),
		mydevice->num_channels);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n");
    goto bummer;
  }

So there are a number of steps here: First, a major and a range of minors is allocated with the call to alloc_chrdev_region(), and the result is stored in the dev_t struct.

Then the cdev is initialized and assigned a pointer to the fops struct (i.e. the struct that assigns open, read, write release). It’s the call to cdev_add() that makes the module “live” as a device file handler by binding the fops to the set of major/minor set that was just assigned.

If there happens to exist files with the relevant major and minor in the file system, they can be used immediately to execute the methods in example_fops. This is however not likely in this case, since they were allocated dynamically. The very common procedure is hence to create them in the driver (which also triggers udev events, if such are defined). So there can be several calls to something like:

    device = device_create(example_class,
			   NULL,
			   MKDEV(major, i),
			   NULL,
			   "%s", devname);

Note that this is the only use of example_class. The class has nothing to do with the cdev.

And of course, all this must be reverted when the USB device is unplugged. So we’re finally getting to business.

Is it fine to remove device files on hot-unplugging?

Yes, but remember that the file operation methods may very well be called on behalf of file descriptors that were already opened.

So it’s completely OK to call device_destroy() on a device that is still opened by a process. There is no problem creating device files with the same names again, even while the said process still has the old file opened. It’s exactly like any inode, which is visible only by the process(es) that has a file handle on them. A device file is just a file. Remove it, and it’s really gone only when there are no more references to it.

In fact, for a simple “cat” process that held a deleted device, the entry in /proc went

# ls -l /proc/756/fd
total 0
lrwx------. 1 root root 64 Mar  3 14:12 0 -> /dev/pts/0
lrwx------. 1 root root 64 Mar  3 14:12 1 -> /dev/pts/0
lrwx------. 1 root root 64 Mar  3 14:12 2 -> /dev/pts/0
lr-x------. 1 root root 64 Mar  3 14:12 3 -> /dev/example_03 (deleted)

So no drama here. Really easy.

Also recall that removing the device files doesn’t mean all that much: It’s perfectly possible (albeit weird) to generate extra device files with mknod, and use them regardless. The call to device_destroy() won’t make any difference in this case. It just removes those convenience device files in /dev.

When to release the cdev struct

Or more precisely, the question is when to release the struct that contains the cdev struct. The kernel example’s suggestion (drivers/usb/usb-skeleton.c) is to maintain a reference counter on the enclosing struct (a kref). Then increment the reference count for each file opened, decrement for each file release, and also decrement it when the device is disconnected. This way, the device information (e.g. example_device struct above) sticks around until the device is disconnected and there are no open files. There is also an issue with locking, discussed at the end of this post.

But when cdev is part of this struct, that is not enough. cdev_del(), which is normally called in the device’s disconnect method, disables the accessibility of the fops for opening new file descriptors. But there’s much to the comment from fs/char_dev.c, above the definition of cdev_del(): “This guarantees that cdev device will no longer be able to be opened, however any cdevs already open will remain and their fops will still be callable even after cdev_del returns.”

So what’s the problem, one may ask. The kref keeps the cdev until the last release! (hopefully with proper locking, as discussed at the end of this post)

Well, that’s not good enough: It turns out that the struct cdev is accessed after the fops release method has been called, even for the last open file descriptor.

Namely, the issue is with __fput() (defined in fs/file_table.c), which is the function that calls the fops release method, and does a lot of other things that are related to the release of a file descriptor (getting it off the epoll lists, for example): If the released inode is a character device, it calls cdev_put() with the cdev struct after the release fops method has returned.

Which makes sense, after all. The cdev’s reference count must be reduced sometime, and it can’t be before calling the release, can it?

So cdev_put calls kobject_put() on the cdev’s kobject to reduce its reference count. And then module_put() on the owner of the cdev entry (the owning module, that is) as given in the @owner entry of struct cdev.

Therefore, there’s a nasty OOPS or a kernel panic if the struct cdev is on a memory segment that has been freed. Ironically enough, the call to cdev_put() brings the cdev’s reference count to zero if cdev_del() has been called previously. That, in turn, leads to a call to the kobject’s release method, which is cdev_default_release(). In other words, the oops is caused by the mechanism that is supposed to prevent the cdev (and the module) the exact problem that it ends up creating.

Ironic, but also the hint to the solution.

The lame solution

The simplest way is to have the cdev as a static global variable of the relevant module. Is this accepted practice? Most likely, as Greg K-H himself manipulated a whole lot of these in kernel commit 7e7654a. If this went through his hands, who am I to argue. However this goes along with allocating a fixed pool of minors for the cdev: The number of allocated minors is set when calling cdev_add().

The backside is that cdev_add() can only be called once, so the range of minors must be fixed. This is commonly solved by setting up a pool of minors in the module’s init routine (256 of them in usb-skeleton.c, but there are many other examples).

Even though it’s a common solution in the kernel tree, I always get a slight allergy to this concept. How many times have we heard that “when it was designed, it was considered a lot” thing?

The elegant solution

In short: Allocate the cdev dynamically. Instead of

struct example_device {
  struct device *dev;

  struct cdev cdev;
  [ ... ]
}

go

struct example_device {
  struct device *dev;

  struct cdev *cdev;
  [ ... ]
}

so the cdev struct is referred to with a pointer instead. And instead of the call to cdev_init(), go:

  mydevice->cdev = cdev_alloc();
  if (!mydevice->cdev)
    goto bummer;

  mydevice->cdev->ops = &example_fops;
  mydevice->cdev->owner = THIS_MODULE;

And from there go ahead as usual. The good part is that there’s no need to free a cdev that has been allocated this way. The kernel frees it automatically when its reference count goes down to zero (it starts at one, of course). So all in all, the kernel counts the references to cdev as files are opened and closed. In particular, it decrements it when cdev_del() is called. So it really vanishes only when it’s not needed anymore.

Note that cdev_init() isn’t called. Doing this will cause a kernel memory leak (which won’t affect the allocation of major and minors, by the way). See “Read the Source” below, which also shows the details on how this solves the problem.

Only note that if cdev_add() fails, the correct unwinding is:

  rc = cdev_add(&mydevice->cdev, MKDEV(major, minor),
		mydevice->num_channels);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n");
    kobject_put(&mydevice->cdev->kobj);
    goto bummer2;
  }

In other words, don’t call cdev_del() if cdev_add() fails. It’s can’t be deleted if it hasn’t been added. Decrementing its reference count is the reverse operation. This is how it’s done by __register_chrdev(), defined in char_dev.c. That’s where cdev_add() and cdev_del() are defined, so they should know…

Know cdev’s reference count rules

Since cdev’s is wiped out by the kernel, it’s important to know how the kernel counts its reference count. So these are the rules:

  • cdev is assigned a reference count of 1 by the call to cdev_alloc() (by virtue of kobject_init). Same goes for cdev_init(), but that’s irrelevant (see code snippets below).
  • cdev’s reference count is not incremented by the call to cdev_add(). So it stays on 1, which is sensible.
  • cdev’s reference count is decremented on a call to cdev_del(). This makes sense, even though it kind-of breaks the symmetry with cdev_add(). But the latter takes a free ride on the ref count of cdev_alloc(), so that’s how it comes together.
  • A reference increment is done for each opened related file, and decremented on file release.
The bottom line is that if cdev_del() is called when there is no currently opened relevant device file, it will go away immediately.

For the extra pedantic, it may seem necessary to call kobject_get(&mydevice->cdev->kobj) immediately after cdev_alloc(), and then kobject_put() only after freeing the example_device struct, because it contains the pointer to the cdev. This is what reference counting means: Count the pointers to the resource. However since the cdev struct is typically only used for the cdev_del() call, nothing bad is likely to happen because of this pointer to nowhere after the cdev has been freed. It’s more a matter of formality.

This extra reference count manipulation can also be done with cdev_get() and cdev_put(), but will add an unnecessary and possibly confusing (albeit practically harmless) reference count to the module itself. Just be sure to set the cdev’s @owner entry before calling cdev_get() or things will get messy.

Read the Source

Finally, I’ll explain why using cdev_alloc() really helps. The answer lies in the kernel’s fs/char_dev.c.

Let’s start with cdev_init(). It’s short:

void cdev_init(struct cdev *cdev, const struct file_operations *fops)
{
  memset(cdev, 0, sizeof *cdev);
  INIT_LIST_HEAD(&cdev->list);
  kobject_init(&cdev->kobj, &ktype_cdev_default);
  cdev->ops = fops;
}

Noted that kobject_init? It initializes a kernel object, which is used for reference counting. And it’s of type ktype_cdev_default, which in this case only means that the release function is defined as

static struct kobj_type ktype_cdev_default = {
  .release	= cdev_default_release,
};

So when cdev->kobj’s reference count goes to zero, cdev_default_release() is called. Which is:

static void cdev_default_release(struct kobject *kobj)
{
  struct cdev *p = container_of(kobj, struct cdev, kobj);
  struct kobject *parent = kobj->parent;

  cdev_purge(p);
  kobject_put(parent);
}

Arrgghh! So there’s a release function! Why can’t it free the memory as well? It wouldn’t have been perfect. Well, a catastrophe, in fact. How could it free a memory segment within another enclosing struct?

But in fact, there is such a release function, with a not-so-surprising name:

static void cdev_dynamic_release(struct kobject *kobj)
{
  struct cdev *p = container_of(kobj, struct cdev, kobj);
  struct kobject *parent = kobj->parent;

  cdev_purge(p);
  kfree(p);
  kobject_put(parent);
}

Exactly the same, just with the kfree() in exactly the right spot. Backed up by

static struct kobj_type ktype_cdev_dynamic = {
  .release	= cdev_dynamic_release,
};

and guess which function uses it:

struct cdev *cdev_alloc(void)
{
  struct cdev *p = kzalloc(sizeof(struct cdev), GFP_KERNEL);
  if (p) {
    INIT_LIST_HEAD(&p->list);
    kobject_init(&p->kobj, &ktype_cdev_dynamic);
  }
  return p;
}

Now let’s compare it with cdev_init():

  • It allocates the cdev instead of using an existing one. Well, that’s the point, isn’t it?
  • It doesn’t call memset(), because the segment is already zero by virtue of kzalloc.
  • It doesn’t assign cdev->fops, because it doesn’t have that info. The driver is responsible for this now.
  • It sets the kernel object to have a release method that includes the kfree() part, of course.

This is why cdev_init() must not be called after cdev_alloc(): Even though it will do nothing harmless apparently, it will re-init the kernel object to ktype_cdev_default. That’s easily unnoticed, since the only thing that will happen is that kfree() won’t be called. Causing a very small, barely notable, kernel memory leak. No disaster, but people go to kernel-hell for less.

When and how to free example_device

Now back to the topic of maintaining a reference count on the device’s information (e.g. struct example_device). It should contain this struct kref, which allows keeping a track on when the struct itself should be kept in memory, and when it can be deleted. As mentioned earlier, the kref is automatically initialized with a reference count of 1, and is then incremented every time the open method is called for a related device file, decremented for every release of such, and once again decremented when the device itself is disconnected.

On the face of it, easy peasy: The struct goes away when there are no related open device files, and the device itself is away too. But what if there’s a race condition? What if a file is opened at the same time that the device is disconnected? This requires a mutex.

The practice for using kref is to decrement the struct’s reference count with something like

kref_put(&mydevice->kref, cleanup_dev);

where cleanup_dev is a function that is called if the reference count reached zero, with a pointer to the kdev. The function then uses container_of to find the address of the structure containing the kref, and frees the former. Something like

static void cleanup_dev(struct kref *kref)
{
  struct example_device *dev =
    container_of(kref, struct example_device, kref);

  kfree(dev);
}

The locking mechanism is relatively simple. All it needs to ensure is that the open method doesn’t try to access the example_device struct after it has been freed. But since the open method must do some kind of lookup to find which example_device struct is relevant, by checking if it covers the major/minor of the opened device file, the name of the game is to unlist the example_device before freeing its memory.

So if the driver implements a list of example_device structs, one for each connected USB device, all that is necessary is to protect the access to this list with a mutex, and to hold that mutex while kref_put() is called. Likewise, this mutex is taken by the open method before looking in the list, and is released only after incrementing the reference count with kref_get().

And then make sure that the list entry is removed in cleanup_dev.

Bonus: When is it OK to access the USB API’s structs?

Not directly related, but definitely worth mentioning: The memory chunk, to which struct usb_interface *interface points to (which is given to both probe and disconnect) is released after the call to the disconnect method returns. This means that if any other method holds a copy of the pointer and uses it, there must be some kind of lock that prevents the disconnect call to return as long as this pointer may be in use. And of course, prevents any other thread to start using this pointer after that. Otherwise even something as innocent as

dev_info(&interface->dev, "Everything is going just great!\n");

may cause a nasty crash. Sleeping briefly on the disconnect method is OK, and it solves this issue. Just be sure no other thread sleeps forever with that lock taken. Should not be an issue, because asynchronous operations on the USB API have no reason to block.

This is demonstrated well in the kernel’s own usb-skeleton.c, by virtue of io_mutex. In the disconnection method, it goes

mutex_lock(&dev->io_mutex);
dev->interface = NULL;
mutex_unlock(&dev->io_mutex);

and then, whenever the driver wants to touch anything related USB, it goes

mutex_lock(&dev->io_mutex);
if (!dev->interface) {
  mutex_unlock(&dev->io_mutex);
  retval = -ENODEV;
  goto error;
}

and keeps holding that mutex during all business with the kernel’s USB API. Once again, this is reasonable when using the asynchronous API, so no call blocks.

It’s however not possible to hold this mutex in URB completer callbacks, since they are executed in atomic context (an interrupt handler or tasklet). These callbacks routines are allowed to assume that the interface data is legit throughout their own execution, because the kernel’s USB subsystem makes sure to complete all URBs (with a -ENOENT status) before tearing the interface down.

This is true unless the soft_unbind flag is explicitly set by the device driver, which means “if set to 1, the USB core will not kill URBs and disable endpoints before calling the driver’s disconnect method.”

For example, in usb-skeleton.c, dev->interface->dev is used for an error message in the completion callbacks. Also, usb_unbind_interface() (in usb/core/driver.c) sets intf->condition to USB_INTERFACE_UNBINDING and then calls usb_disable_interface() to terminate all URBs before calling the disconnect method. So this ensures no new URBs are queued and the old ones are completed.

Linux kernel programming: Do I need a lock?

Introduction

Writing a device driver for Linux (or other kernel programming) always requires keeping parallel execution in mind. It’s often enough to follow common programming patterns, using spinlocks and mutexes in the same way that everyone else does. But somehow, I always find myself looking at my own code, and ask myself: Am I absolutely sure that this will work? Isn’t there any crazy race condition that will show up once in a while and make things fail horribly in the most elusive way?

Or even worse: Will this work properly if CPU 1 does this and CPU 2 does that? When can I just ignore the fact that one piece of code will run on one CPU and the other piece on another?

Maybe because my main expertise is logic design (FPGA), I have this thing about looking for by-definition guarantees for proper functionality. This is a good habit in programming in general, I believe, and in kernel programming in particular.

I’ve already written a post with a similar topic, but at the time the focus was on memory mapped I/O. This somewhat messy post it’s all about the programming model. Quite ironically, it’s difficult to find the right order of topics in a post discussing parallel execution.

Comments and corrections in particular are welcome in comments below. The main incentive to write this post is actually to validate by view on this topic.

The curse of awareness

One may do kernel programming happily until the concept of memory barriers come to mind. The awareness that both the compiler and the processor may reorder, optimize away or even add read and write operations to memory kind makes one wonder how any trivial piece of code even works. The notion that memory writes may not propagate as once might expect between CPUs in an SMP processor (they all are nowadays) undermines that basic, naïve, assurance that all variables have the values that were last assigned to them, even if that took place earlier by a different function. I mean, what if that function ran on a different processor? Did the changes all make it to the new one?

Memory barriers, by themselves, offer a disappointing level of assurance: An SMP write barrier (smp_wmb() ) just says that all writes until that barrier are guaranteed to have been propagated before the writes that came after it. The SMP read barrier (smp_rmb() ) likewise says that all reads before the barrier are guaranteed to appear to have been done before those that came after the barrier.

This isn’t very convincing, but at least is allows setting up a “valid” flag for some data. For example, one thread can fill a data structure, call smp_wmb() and then write to a flag in the data structure that says it’s OK for use. The other thread can busy-wait on that same flag, and once it’s set, call smp_rmb() and then access the data. Because of the use of both barriers, the data is guaranteed to be valid when it’s read by the second thread.

But who’s being thinking about memory barriers? What about all code I’ve written until now? Did it work just by chance?

As I’ve noted in that earlier post, the SMP memory barriers translate into nothing on an x86 platform. This is really bad news, because most of us develop on these platforms, so if the barrier jugglery is done incorrectly, it shows only when the code runs on a different platform. In other words, the complaint will come later and out of nowhere.

Read the kernel docs

Before diving deeper into the subtle issues, here’s a list of references. The related docs in the kernel’s sources (as of v5.3, these tend to move):

  • Documentation/memory-barriers.txt
  • tools/memory-model/Documentation/recipes.txt
  • tools/memory-model/Documentation/explanation.txt
  • Documentation/atomic_t.txt
  • Documentation/core-api/workqueue.rst

The simple rule

If you’re absolutely sure that a taking a spinlock or mutex will never wait for anything, don’t take it. Maybe consider a memory barrier.

Put otherwise: If there’s a code segment A, that is guaranteed to reach its end before code segment B begins its execution, it’s functionally guaranteed to behave as if they ran sequentially on the same processor — all data is updated at the beginning of segment B as at the end of segment A.

The keyword here is guaranteed: It means that some means of the kernel API is relied upon in this matter. There are several such means available, for example:

  • Kernel locks (spinlocks, semaphores, mutexes) or course. But that’s what we’re trying to avoid.
  • Mechanisms that ensure non-reentrance. For example, tasklets and work items are guaranteed to run on one CPU in the entire system at most.
  • Calls to synchronization functions: For example, after calling cancel_work_sync() or flush_workqueue(), the thread that called this functions has the same view as the last related work item (and possibly changes that occurred afterwards, of course).
  • Device drivers can rely on that disconnect() isn’t called along with probe(), and that the release fops method isn’t called before that last read or write method has returned.
  • … and there are tons of other, of course.

The rationale behind this rule is that if this data dependency rule wasn’t guaranteed, virtually nothing in the kernel would work. Synchronization of data is in no programmer’s mind until there’s an explicit possibility for parallel execution (including interrupts). And even when there’s parallel execution in mind, programmers use locks, and (correctly) assume that the ball is passed to the thread holding the lock, along with the entire memory view being synchronized as well.

In short: If you have the relevant lock, or are otherwise guaranteed by the kernel API that nobody’s currently fiddling with the exact same piece of memory, you have the right to ignore any consistency issues. Just because everyone else does, so the kernel API makes sure it actually works that way.

Waking up a waiting process

A special case is the wait queue: One thread calling one of the wait_event*() API functions to sleep until a condition occurs, and a second thread (often an interrupt) calls wake_up*(). If this relationship ensured exclusive execution (i.e. the thread that calls wait_event*() won’t run again until the second one is finished), then the second thread may assume that it’s synchronized with the first thread at the place that it made the wake_up call.

This is important in particular regarding the condition that is passed to the wait_event*() call. It’s however quite common to use a spinlock on the second thread when accessing the common data, since it can’t be guaranteed the the first thread won’t be invoked again (in particular if it’s an ISR).

Lock-this lock-that

So here’s the catch: It’s tempting to take the some kind of lock to feel safer. Something like:

spin_lock_irqsave(&x->spinlock, flags);
safe_value = x->value;
spin_unlock_irqrestore(&x->spinlock, flags);

and with this, have a comfy feeling about safe_value being correct somehow. After all, its value was obtained with protection gear on. But was that protection necessary?

A lock that is inserted today will be very difficult to remove in the future, be it because doing so requires recalling all considerations that are fresh in your memory today. And this unnecessary lock may cause difficulties, in particular for adding features that aren’t in sight today. So by all means, better safe than sorry is not an excuse for a lock. Think it through, for each and every one.

Access with no protection at all

So what happens if we read or write carelessly from a memory region that is possibly accessed by other threads? Well, the first thing to consider is that the compiler may play some nasty pranks with optimization. The classic example is this failing busy-wait loop

while (!x->ready);

which is supposed to wait until x->ready becomes true by virtue of some other thread changing it. Almost any optimizing compiler will interpret this as a single-thread program, hence x->ready won’t change by itself, and implement the equivalent of

if (!x->ready)
  while (1);

or in other words, an infinite loop. At the compiler level, this can be solved with the volatile attribute, however odds are that a patch containing that word will be rejected. It’s considered a red flags saying you’ve misunderstood something. Even though quite a few drivers in the kernel tree use volatile.

Rather, READ_ONCE() and WRITE_ONCE() should be used when an explicit read or write of memory should take place. These prevent the compiler’s optimization (typically by using the volatile keyword, actually) but also the compiler’s reordering of successive READ_ONCE() and WRITE_ONCE(). Note however that these don’t imply memory barriers — they tell the compiler what to do, but not (necessarily) the processor. In particular, the implementation of these for most processors leaves it free to reorder the accesses as long as the result is the same as a single-thread program. In other words: If another CPU fiddles with the same memory regions, this reordering can bite back.

So if memory barriers are used, are READ_ONCE() and WRITE_ONCE() needed? If memory barriers are set correctly along with plain C memory accesses to ensure correct ordering with respect to another CPU, and the compiler’s optimization stunts are neutralized anyhow, why bother? And if it’s for I/O, use readl() / writel() or similar functions.

recipes.txt explains that:

If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores.

Say what? Load/store tearing? That means that, for example, a 32-bit processor executes a write of 32-bit word (an int) in a non-atomic manner? Splitting it into bytes? Huh?

This is discussed further in explanations.txt, in the section named THE READS-FROM RELATION, which also emphasizes:

It requires all accesses to be properly aligned and of the location’s actual size.

To make a long story short: No sane processor will split a natural, aligned access of a work into a non-atomic operation. But to be safe, use READ_ONCE and WRITE_ONCE when it’s important to nail down a memory access, and be sure that the address is aligned to the element’s size.

This discussion has another important implication: When using WRITE_ONCE() and READ_ONCE() with aligned addresses, one can rely on that the value that is read with READ_ONCE() is one of those that were written by some WRITE_ONCE() operation in the past, and never some kind of intermediate junk value, even when it’s read from another CPU. So even if we don’t know exactly when the value of WRITE_ONCE() will reach the other CPUs (not even if it’s before or after other WRITE_ONCE’s), we know it’s always seen with one of the sane values it was assigned with.

As a side note, it’s often tempting to ask oneself whether the WRITE_ONCE will be seen “immediately” by another thread. I just wrote the value, is it on the other CPU yet? And now? So well, it’s not a valid question. If the dependence between the two threads is important, a mechanism should be put in place to ensure synchronization. Locks, memory barriers, whatever. If it isn’t, the other thread should work perfectly fine reading the value before the WRITE_ONCE().

When to use atomic_t

Consider this:

spin_lock_irqsave(&x->spinlock, flags);
x->value++;
spin_unlock_irqrestore(&x->spinlock, flags);

Among others, this makes sure that no increment gets lost because two CPUs were doing it at the same time. But what if this is the only way x->value is changed? Is the spinlock necessary at all? Why not just this?

x->value++;

Like, find me a processor that doesn’t have an atomic increment opcode.

Well, the immediate answer is that the value after the increment may not have propagated to a neighboring processor, so if it also increments the same field more or less at the same time, both add 1 to the same value, and one increment is lost.

So the correct way for doing this is to define x->value of atomic_t type, and then

atomic_inc(&x->value);

possibly without any lock whatsoever. This guarantees that the increment is accounted for properly, even when two CPUs do this in parallel. Or as written in explanations.txt:

What does it mean to say that a read-modify-write (rmw) update, such as atomic_inc(&x), is atomic? It means that the memory location (x in this case) does not get altered between the read and the write events making up the atomic operation. In particular, if two CPUs perform atomic_inc(&x) concurrently, it must be guaranteed that the final value of x will be the initial value plus two.

As for ordering and memory barriers, some commands offer this and others don’t. Refer to atomic_t.txt for more information on that. For example, atomic_inc() gives no ordering protection whatsoever.

There are, of course, a whole range of other atomic operations supported. See atomic_t.txt.

As for plain reads and writes of atomic_c, there are dedicated functions: atomic_read() and atomic_set(), which are implemented with just READ_ONCE() and WRITE_ONCE(). Or as written in atomic_t.txt:

The non-RMW ops are (typically) regular LOADs and STOREs and are canonically implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and smp_store_release() respectively. Therefore, if you find yourself only using
the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all and are doing it wrong.

Simply put: atomic_t is for atomic modifications, like atomic_inc(). Don’t use it where READ_ONCE() and WRITE_ONCE() would do the job literally equally well.

Once again, it’s tempting to ask oneself when and how soon the other CPU sees the incremented value, and the answer remains the same: It should work just fine whether it sees the value before or after the increment, and if it doesn’t, you’re doing it wrong.

Can work items on the same workqueue run concurrently?

Or put the other way around, is sequential execution ensured? If work item A and B are queued on the same workqueue, do I need any locks to protect stuff that both access? Or can I rely on the kernel to prevent concurrent execution?

The truth is that I don’t have a definite answer on this one. Comments are welcome (below).

On one hand, the official kernel docs say

… the worker executes the functions associated with the work items one after the other.

The “one after the other” thing sounds quite reassuring. But does “after the other” relate to after executing the previous item, or after it has finished? That brings me to the other hand, which is reading the source. Namely process_one_work() in kernel/workqueue.c, which is called, among others, by process_scheduled_works() for the obvious purpose. Actually, it’s enough to read the comments associated with the former:

As long as context requirement is met, any worker can call this function to process a work.

and

A single work shouldn’t be executed concurrently by multiple workers on a single cpu. Check whether anyone is already processing the work. If so, defer the work to the currently executing one.

Actually, a single work shouldn’t be executed concurrently at all, as far as I know. But the point is that the test relates to a work item against itself, and it says nothing about other work items.

So my conclusion, for which comments are welcome, is as follows:

  • A lock is required to ensure synchronization between different work items, even if they are queued on the same workqueue…
  • … but things will most likely work perfectly fine without a lock in this scenario, because odds are that only a single worker thread will be allocated for the workqueue. In other words, the fact that it works without a lock doesn’t mean you’ve done it right.
  • Sleeping in a work item is likely to delay other work items in the same workqueue. The fact that a lock is needed to protect against concurrency doesn’t mean that the kernel will produce another work thread when you want it.

Bonus: A note on locks

I didn’t find a good place to push this in, but I still want this on this post.

From recipes.txt, on locking:

Any CPU that has acquired a given lock sees any changes previously seen or made by any CPU before it released that same lock.

Note that it says “seen or made” not just made. In other words, getting the lock gives the complete memory view of the CPU that released it, not just changes it made. Also note that it says “lock” — that means spinlocks as well as mutexes.

This is most likely how the kernel API ensures the synchronization between segments of code that are guaranteed not to run in parallel: There is always some kind of lock on some kind of data structure that says “X can now run, Y can’t”.

A few epoll jots

Just a few things I wrote down while getting the hang on Linux’ epoll working with a named pipe. There’s also a little test program at Github.

  • Be sure to read this and this.
  • An event list for a file descriptor can be added only once with epoll_ctl(…, EPOLL_CTL_ADD, …). Calling epoll_ctl for adding an event entry for a file descriptor which is already listed results with an EEXIST error (the manual says so, and hey, it also happens).
  • The @events member passed to epoll_ctl() is an OR of all events to watch. The @events member in the array of events returned by epoll_wait() are the events that are in effect.
  • It’s fine to register events that are unrelated (i.e. will never happen), not that there’s any reason to do so deliberately.
  • If several events are triggered for the same file descriptor, they are ORed in one array entry by epoll_wait().
  • Without the EPOLLET flag (edge-triggered), the same event keeps appearing endlessly until cleared by some I/O action.
  • In particular, EPOLLHUP is returned continuously on a FIFO (named pipe) opened for read with the other side unopened.
  • Same for EPOLLERR with a FIFO opened for write.
  • In edge-triggered mode (with EPOLLET) an event is generated each time new data is fed or drained on the other side, even if the previous data hasn’t been cleared. In this sense, it isn’t really edge-triggered. Probably the calls to wake_up() (in different variants) in the driver causes this.
  • As expected, if a FIFO is opened for read with O_NONBLOCK, there is no event whatsoever when the other side is opened — only when data arrives.
  • Important: If the counterpart side is closed and then reopened while there is no epoll_wait() blocking, this will go unnoticed. The solution is probably to have a tight loop only picking up events, and let some other thread take it from there.