systemd / DBus debugging starter pack
Introduction
Trying to solve a 90 second wait-on-some-start-job on each boot situation, I found that there’s little info on how to tackle a problem like this out there. Most web pages usually go “I did this, hurray it worked!”, but where do you start solving a problem like this if none of the do-this-do-that advice helps?
So this is a random pile of things to try out. Most of the things shown below didn’t solve my own issue, but these are the tools I collected in my toolbox.
I did this on a Debian 8 (Jessie), and the problem was that boot was stuck for a minute and a half on “A start job is running for sys-subsystem-net-devices-eth0.device”. It’s not directly relevant, except that it influences what I tried out, and hence listed below.
I have written a shorter version of this post which focuses on the specific problem. This post is more about the techniques for figuring out what’s going on.
PID 1 at your service
The tricky part of systemd is that much of the activity is done directly by the systemd process, having PID 1. Requests to start and stop services and other units are sent via DBus messages, i.e. over connections to UNIX sockets. To someone who is used to the good-old-systemV Linux, this is voodoo at its worst, but there are simple ways to keep track of this, as shown below.
In particular, don’t strace the “systemctl start” process — it just sends the request over DBus. Rather, attach strace to PID 1, also explained below. That’s where the fork to the actual job process takes place, if at all.
And don’t get confused by having /org/freedesktop/ appearing everywhere in the logs. It doesn’t necessarily have anything to do with the desktop (if such exists), and is likewise relevant to a non-graphical system. DBus’ started as a solution for desktop machines, and that’s the only reason “freedesktop” is everywhere.
First thing first
Read the man page, “man systemd.device” in my case. If there’s another computer with different configuration, see what happens there. What does it look like when it works?
journald -x
As mentioned on this page, if something went wrong during boot, check out the log to see why. The -x flag adds valuable info for solving issues of this sort.
For example,
# journald -x [ ... ] May 20 11:41:20 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out. May 20 11:41:20 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device. -- Subject: Unit sys-subsystem-net-devices-eth0.device has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit sys-subsystem-net-devices-eth0.device has failed. -- -- The result is timeout. May 20 11:41:20 diskless systemd[1]: Dependency failed for ifup for eth0. -- Subject: Unit ifup@eth0.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ifup@eth0.service has failed. -- -- The result is dependency.
Now, how to read it: It’s clear that sys-subsystem-net-devices-eth0.device is the unit that didn’t manage to kick off. But the more important clue is that ifup@eth0.service failed, because it depends on the former.
It’s important, because it explains why an attempt to launch sys-subsystem-net-devices-eth0.device was done in the first place. A lot of “there I fixed it” pages on the web disable the latter service, and get rid of the problem, not necessarily understanding how and why.
So here’s why: On some relatively early systemd versions, *.device units simply won’t launch. It’s worked around by making sure that no other unit requests them. But then some software package isn’t aware of this, and requests a .device unit and there’s the deadlock. Or more precisely, waiting 90s for the timeout.
Kicking if off manually
Obviously, the unit is inactive:
# systemctl status sys-subsystem-net-devices-eth0.device ● sys-subsystem-net-devices-eth0.device Loaded: loaded Active: inactive (dead) May 20 12:47:07 diskless systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device... May 20 12:48:37 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out. May 20 12:48:37 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
Note that there’s also systemctl show for a detailed printout of the unit’s assignments.
Try to start it manually (this little session took 90 seconds, right?)
# systemctl start sys-subsystem-net-devices-eth0.device Job for sys-subsystem-net-devices-eth0.device timed out.
The important takeaway is that we can repeat the problem on a running system (as opposed to a booting one). This allows running some tools for looking at what happens.
On the other hand, I’m not all that sure .device units are supposed to be started or stopped with systemctl at all. Or more likely, that requesting a start or stop on device units means waiting for them to reach the desired state by themselves. This goes along with the observation I made with strace (below), showing that systemd does nothing meaningful until it times out. So most likely, it just looked up the state of the device unit, saw it wasn’t started, and then went to sleep, essentially waiting for a udev event to bring the unit to the desired state, and consequently return a success status to the start request.
In fact, when I tried “systemctl stop” on the eth0 device on another machine, on which it the device file was activated automatically, it got stuck exactly the same way as for starting it on Debian 8.
As far as I understand, these should become active and inactive by a systemd-udev event by virtue of udev labeling. They are there to trigger other units that depend on them, not to be controlled explicitly.
But here comes a major red herring: Curiously enough, during the 90 seconds of waiting, “systemctl starts” created a child process, “/bin/systemd-tty-ask-password-agent –watch”. One can easily be misled into thinking that it’s this child process that blocks the completion of the former command.
So first, let’s convince ourselves that it’s not the problem, because running
# systemctl --no-ask-password start sys-subsystem-net-devices-eth0.device
doesn’t create this second process, but is stuck nevertheless.
This systemd-tty-ask-password-agent process listens for system-wide requests for obtaining a password from the user (e.g. when opening a crypto disk), and does that job if necessary. systemctl launches it just in case, regardless of the unit requested for starting. This is the way to make sure passwords are collected, if so needed. This process is usually not visible, because systemctl commands typically don’t last very long. More about it here.
Actually, checking with strace, systemctl was blocking all those 90 seconds on a ppoll(), waiting for some response from the /run/systemd/private UNIX socket. That’s the DBus connection with process number 1, systemd. In other words, systemctl requested the start of the unit over DBus, and then waited for the result for 90 seconds, at which point it got the answer that the attempt timed out.
DBus
DBus is like a small client-server intranet that is local to the computer. It allows processes that are connected to DBus two features:
- A process can act as a client, and request a function call on another process that is connected to the same bus (acting as a server). When the function call finishes, its result value(s) are returned through the bus, so there’s a real function call look and feel. Both the function call and the result may involve multiple elements of data with various types (strings, booleans, arrays, you name it). DBus has adopted object-oriented teminology, so these function calls are called method calls, and the whole OO jargon (properties in particular) applies.
- A process can broadcast events on the bus, which are called signals, so other processes that are subscribed to this signal get a notification.
DBus was originally developed to allow GUI desktop utilities to talk with each other. Later on, this framework was adopted for more fundamental uses, in particular systemd, Pulseaudio, bluetoothd, cupsd etc. So the “original” Dbus is now called the “session bus”, and the latter is the “system bus”. Two completely independent networks, if you like.
For a more comprehensive explanation, see Wikipedia’s page. I show an example of making method calls on DBus using the command line on this post. For a simple example on how Dbus is used, I suggest this.
Listening to DBus
There’s are two utilities, dbus-monitor and “busctl monitor” for dumping DBus messages (an eavesdrop add-on may be required to allow system-wide message monitoring, but this was not the case on my system). I’ll present monitoring of the system Dbus here. For an example of monitoring a session’s messages, there’s another post of mine.
So on the invocation of
# systemctl start sys-subsystem-net-devices-eth0.device
the output of
# dbus-monitor --system
was
signal sender=org.freedesktop.DBus -> dest=:1.8 serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameAcquired string ":1.8" signal sender=:1.0 -> dest=(null destination) serial=127 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitNew string "sys-subsystem-net-devices-eth0.device" object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice" signal sender=:1.0 -> dest=(null destination) serial=128 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobNew uint32 173 object path "/org/freedesktop/systemd1/job/173" string "sys-subsystem-net-devices-eth0.device" signal sender=:1.0 -> dest=(null destination) serial=129 path=/org/freedesktop/systemd1/job/173; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged string "org.freedesktop.systemd1.Job" array [ dict entry( string "State" variant string "running" ) ] array [ ]
and when the timeout occurs with a
Job for sys-subsystem-net-devices-eth0.device timed out.
the following output is captured on the DBus:
signal sender=:1.0 -> dest=(null destination) serial=141 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobRemoved uint32 173 object path "/org/freedesktop/systemd1/job/173" string "sys-subsystem-net-devices-eth0.device" string "timeout" signal sender=:1.0 -> dest=(null destination) serial=142 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitRemoved string "sys-subsystem-net-devices-eth0.device" object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice"
Clearly, everything was done by the systemd main process, and almost nothing by the process created from console.
The “sender=:1.0″ part means that the sender is the process number 1 (systemd). Try
$ busctl
(that’s short for “busctl list”) to get a mapping between these addresses and processes.
See the number 173 in the object paths all over in the dbus traffic? That’s the job number as listed in
# systemctl list-jobs JOB UNIT TYPE STATE 173 sys-subsystem-net-devices-eth0.device start running 1 jobs listed.
Note that these job numbers have absolutely nothing to do with the Linux PIDs.
Using strace
strace is often very useful for resolving OS problems. It’s however important to realize that the old-fashioned way of stracing the process created on command line will probably not yield much information, because this process only sends a request over DBus.
Instead, strace the process that does the actual work: PID 1, the Mother Of All Processes, the almighty systemd itself. I have to admit that I was first intimidated by the idea to attach strace to this process, but it turns out that it’s usually quite calm, and spits out relatively little unrelated mumbo-jumbo.
Bonus: It’s always the same command:
# strace -p 1 -s 128 -ff -o systemd-trace
This makes a file for each process systemd may fork into. If things went wrong because some process didn’t execute properly, this is how we catch it.
For example, when running the said “systemctl start sys-subsystem-net-devices-eth0.device” command, this was the output:
accept4(12, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 13 getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0 open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18 read(18, "\270\305\231\206+&\262MI\313[\337y}\314V", 16) = 16 close(18) = 0 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(13, F_GETFD) = 0x1 (flags FD_CLOEXEC) fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 setsockopt(13, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 setsockopt(13, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0 getsockopt(13, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(13, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0 getsockopt(13, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 setsockopt(13, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0 getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0 fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 getsockopt(13, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0 getsockname(13, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, [23]) = 0 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45 epoll_ctl(4, EPOLL_CTL_ADD, 13, {0, {u32=2784593808, u64=94857137325968}}) = 0 open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18 read(18, "\10,\360z\t\363+\355D\2556NLkhL", 16) = 16 close(18) = 0 open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18 read(18, "$F\3302\215\326\320\251\261\240\217\232\224\1\346\205", 16) = 16 close(18) = 0 open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18 read(18, "|\313E\273R\264 \375\v\245\235\206h\247\30-", 16) = 16 close(18) = 0 open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18 read(18, "{\200\255\356\26\341b4V_P\225aHkO", 16) = 16 close(18) = 0 epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLOUT, {u32=2784593808, u64=94857137325968}}) = 0 timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2519, 883043000}}, NULL) = 0 epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 359675827}) = 0 timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2508, 883043000}}, NULL) = 0 epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 359719924}) = 0 sendmsg(13, {msg_name(0)=NULL, msg_iov(3)=[{"OK b8c599862b26424d89cb5bdf797dcc56\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52 epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=2784593808, u64=94857137325968}}) = 0 epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 359793419}) = 0 epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 359846496}) = 0 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0018\0\0\0\1\0\0\0\240\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\t\0\0\0StartUnit\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop."..., 208}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 208 getuid() = 0 sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1&\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1o\0\0", 32}, {"!\0\0\0/org/freedesktop/systemd1/job/242\0", 38}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 70 sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\2\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224 sendmsg(35, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\302\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224 epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 360201428}) = 0 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1*\0\0\0\2\0\0\0\227\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\7\0\0\0GetUnit\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1"..., 186}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 186 sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1S\0\0\0\3\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1o\0\0", 32}, {"N\0\0\0/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0", 83}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 115 epoll_wait(4, {}, 33, 0) = 0 clock_gettime(CLOCK_BOOTTIME, {2508, 360292582}) = 0 epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 360319860}) = 0 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0019\0\0\0\3\0\0\0\300\0\0\0\1\1o\0N\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0\0\3\1s\0\3\0\0\0Get\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Pro"..., 241}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 241 lstat("/etc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/etc/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0 lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0 lstat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=120, ...}) = 0 lstat("/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0 lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0 lstat("/run/systemd/generator", {st_mode=S_IFDIR|0755, st_size=360, ...}) = 0 lstat("/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/local", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0 lstat("/usr/local/lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0 lstat("/usr/local/lib/systemd", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=36864, ...}) = 0 lstat("/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib/systemd/system", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0 lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0 lstat("/run/systemd/generator.late", {st_mode=S_IFDIR|0755, st_size=440, ...}) = 0 lstat("/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/local/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\10\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\3\0\0\0\10\1g\0\1v\0\0", 32}, {"\1b\0\0\0\0\0\0", 8}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 40 epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1 clock_gettime(CLOCK_BOOTTIME, {2508, 883466886}) = 0 read(29, "\1\0\0\0\0\0\0\0", 8) = 8 timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2509, 383043000}}, NULL) = 0 epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1
So what we have here is the acceptance of the connection, sending and receiving messages like those captured with dbus-monitor above, but nothing meaningful was executed: There was no productive system call done, and systemctl didn’t fork. But we can also see which files (actually, directories) systemd was looking for, and didn’t find: It really wanted to find some sys-subsystem-net-devices-eth0.device.d in one of the famous paths. Not that it matters so much, though.
By contrast and for example, if “systemctl start atd” is launched and atd is not already running, systemd (as process 1) forks into another process and calls execve(“/usr/sbin/atd”) on the forked process (after a whole lot of cgroup stuff, closing files etc.). If the same systemctl command is called with the atd service already running, there is no such fork (not surprisingly, systemd does nothing when attempting to start an already started service).
For the record, the failed lookup of directories ins’t the problem: I had the luxury of trying exactly the same on a machine that doesn’t get stuck on starting sys-subsystem-net-devices-eth0.device, and the strace looked the same. Except that the systemd job was terminated immediately and successfully, rather than getting stuck.
On my own behalf, this was the moment I realized that this unit shouldn’t be started at all on the system it gets stuck on.
Checking udev
If a boot problem is related to a device, maybe something went wrong with the device’s bringup, which in turn prevented the relevant .device unit to become active, and then some other unit waits for it…?
So what is running when eth0 is detected?
# udevadm test /sys/class/net/eth0 calling: test version 215 This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. load module index Network interface NamePolicy= disabled on kernel commandline, ignoring. timestamp of '/etc/systemd/network' changed timestamp of '/lib/systemd/network' changed Parsed configuration file /lib/systemd/network/99-default.link Created link configuration context. timestamp of '/etc/udev/rules.d' changed read rules file: /lib/udev/rules.d/42-usb-hid-pm.rules read rules file: /lib/udev/rules.d/50-bluetooth-hci-auto-poweron.rules read rules file: /lib/udev/rules.d/50-firmware.rules read rules file: /lib/udev/rules.d/50-udev-default.rules read rules file: /lib/udev/rules.d/55-dm.rules [ ... ] read rules file: /etc/udev/rules.d/90-local-imagedisk.rules read rules file: /lib/udev/rules.d/95-cd-devices.rules read rules file: /lib/udev/rules.d/95-udev-late.rules read rules file: /lib/udev/rules.d/97-hid2hci.rules read rules file: /lib/udev/rules.d/99-systemd.rules rules contain 393216 bytes tokens (32768 * 12 bytes), 23074 bytes strings 21081 strings (168928 bytes), 18407 de-duplicated (148529 bytes), 2675 trie nodes used NAME 'eth0' /etc/udev/rules.d/70-persistent-net.rules:2 IMPORT builtin 'net_id' /lib/udev/rules.d/75-net-description.rules:6 IMPORT builtin 'hwdb' /lib/udev/rules.d/75-net-description.rules:12 IMPORT builtin 'path_id' /lib/udev/rules.d/80-net-setup-link.rules:5 IMPORT builtin 'net_setup_link' /lib/udev/rules.d/80-net-setup-link.rules:11 Config file /lib/systemd/network/99-default.link applies to device eth0 RUN 'net.agent' /lib/udev/rules.d/80-networking.rules:1 RUN '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/$name --prefix=/proc/sys/net/ipv4/neigh/$name --prefix=/proc/sys/net/ipv6/conf/$name --prefix=/proc/sys/net/ipv6/neigh/$name' /lib/udev/rules.d/99-systemd.rules:61 ACTION=add DEVPATH=/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/net/eth0 ID_BUS=pci ID_MODEL_FROM_DATABASE=RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (Motherboard) ID_MODEL_ID=0x8168 ID_NET_DRIVER=r8169 ID_NET_NAME_MAC=enx408d5c4d1b15 ID_NET_NAME_PATH=enp1s0 ID_PATH=pci-0000:01:00.0 ID_PATH_TAG=pci-0000_01_00_0 ID_PCI_CLASS_FROM_DATABASE=Network controller ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller ID_VENDOR_FROM_DATABASE=Realtek Semiconductor Co., Ltd. ID_VENDOR_ID=0x10ec IFINDEX=2 INTERFACE=eth0 SUBSYSTEM=net SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0 TAGS=:systemd: USEC_INITIALIZED=20767 run: 'net.agent' run: '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/eth0 --prefix=/proc/sys/net/ipv4/neigh/eth0 --prefix=/proc/sys/net/ipv6/conf/eth0 --prefix=/proc/sys/net/ipv6/neigh/eth0' unload module index Unloaded link configuration context.
Note that the “systemd” tag is in place, and so is the SYSTEMD_ALIAS assignment. So there’s probably no reason udev-wise why there was no .device unit activated.
hwdb.bin
Update the file /etc/udev/hwdb.bin:
# udevadm hwdb --update
Note that it doesn’t touch /lib/udev/hwdb.bin, and I’m unclear how they interact, if at all (“wild” guess: /etc/udev/hwdb.bin overrules the one in /lib/udev, if it exists).
On newer systems it appears to be “systemd-hwdb update”.