When Bluetooth goes poof
Having rebooted my computer after a few months of continuous operation, I suddenly failed to use my Bluetooth headphones. It took some time to figure out that the problem wasn’t with the Cinnamon 3.8.9 fancy stuff, nor the DBus interface, which produced error messages. There was simply no Bluetooth device in the system to talk to.
Prior to this mishap, my Atheros QCA6174 had worked flawlessly and reliably for several months, both as a Wifi adapter and a Bluetooth adapter.
For the record, I have a Linux Mint 19 Tara machine with 4.15.0-20-generic kernel on a X299 AORUS Gaming 7 motherboard, running in 64 bit mode of course.
I’ll jump to the spoiler: If you happen to have a Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter on your machine, never just reboot the machine: Shut down the computer completely, and disconnect main power for a minute or so with the power supply’s switch. Just powering off the computer the fine way isn’t enough. The device probably continues to get power from the motherboard when in computer is off by virtue of its own power control.
Powering off the computer this way is what solved it for me. However there are also some rumors on the web, which I can’t confirm, about Bluetooth coming back to life after loading Windows on the same computer. Or turning Bluetooth off and on again with the BIOS. My guess is that due to a bug, the chip sometimes needs some kind of tickle on shutdown or when starting, or Bluetooth is lost. Something that is worked around with a hush-hush fix in the driver for Windows, but the Linux driver doesn’t do the same.
This post goes down to the gory details, partly for the sake of quick diagnostics in the future, and partly because Bluetooth tends to be a mystery thing. So I’m trying to give an idea on what’s going on.
Why it’s confusing
Here’s the thing: The Qualcomm QCA6174 connects to the motherboard as a PCIe device as a Wifi adapter, and to the USB bus as a Bluetooth device. Sounds weird, but that’s the way it is. Bluetooth has been wonky for 20 years, and that’s probably its destiny.
So there are wires going from the QCA6174 to the PCIe bus and other wires from the same device going to one of the ports of the motherboard’s USB root hub (see details below). On my specific motherboard, the Bluetooth interface of the QCA6174 is connected to port 13 of the root hub, that facilitates the motherboard’s physical USB connector at the back of the computer. So while designing the board, they wired some of the D+/D- wires to the physical ports at the back, and a couple of those go to the QCA6174. I’m saying this over and over again, because it’s so counterintuitive.
Counterintuituve, but it seems like it’s quite common. Intel’s AC 7260 Wifi / Bluetooth combo seems to do exactly the same thing.
As a PCIe device, QCA6174 has Vendor / Product IDs 168c:003e. As as USB device, it’s 0cf3:e300. Confusing? It won’t surprise me if the Wifi and Bluetooth interfaces are two independent units on the same chip, that happen to share an antenna.
Apparently, when the QCA6174 has a bad day, the PCIe interface wakes up properly, and USB doesn’t. The result is that the Wifi works fine, but the Bluetooth is absent.
To add some confusion, the kernel source’s drivers/net/wireless/ath/ath10k/usb.c matches USB device 13b1:0042, which is indeed a Linksys device (the comment in the code says Linksys WUSB6100M). Not clear why it’s there.
On the other hand, drivers/bluetooth/btusb.c, matches a whole range of Atheros USB devices, among others 0cf3:e300, calling it “QCA ROME” in the comments. So it’s the btusb module that takes care of QCA6174′s Bluetooth interface, not anything in ath/ath10k. Cute, isn’t it?
What it looks like when it works
When trying to figure out what’s wrong, it helps knowing that it looks like when it’s OK. So below is a lot of info that was collected when I got the Bluetooth up and running.
When it failed, everything looked exactly the same in relation to the device’s PCIe interface, but there was absolutely nothing related to USB and Bluetooth: No entry for the device in lsusb, hcicontrol nor rfkill, as shown below.
Kernel log output on behalf of the device, as connected to the PCIe bus. Note that the exact same logs appeared when the Bluetooth device was absent. Exactly-exactly. Down to the single character, I’ve compared them. So this isn’t really relevant, but anyhow:
[ 0.126428] pci 0000:03:00.0: [168c:003e] type 00 class 0x028000
[ 0.126456] pci 0000:03:00.0: reg 0x10: [mem 0x92800000-0x929fffff 64bit]
[ 0.126555] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[ ... ]
[ 17.616738] ath10k_pci 0000:03:00.0: enabling device (0000 -> 0002)
[ 17.617514] ath10k_pci 0000:03:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ ... ]
[ 17.915091] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:03:00.0.bin failed with error -2
[ 17.915109] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/cal-pci-0000:03:00.0.bin failed with error -2
[ 17.926172] ath10k_pci 0000:03:00.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:1535
[ 17.926173] ath10k_pci 0000:03:00.0: kconfig debug 0 debugfs 1 tracing 1 dfs 0 testmode 0
[ 17.926505] ath10k_pci 0000:03:00.0: firmware ver WLAN.RM.4.4.1-00124-QCARMSWPZ-1 api 6 features wowlan,ignore-otp crc32 d8fe1bac
[ 18.078191] ath10k_pci 0000:03:00.0: board_file api 2 bmi_id N/A crc32 506ce037
[ ... ]
[ 18.642461] ath10k_pci 0000:03:00.0: Unknown eventid: 3
[ 18.658195] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[ 18.661096] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[ 18.661772] ath10k_pci 0000:03:00.0: htt-ver 3.56 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
Note that two attempts to load firmware failed, but apparently the third went OK. Don’t let these error messages mislead you: The kernel messages in this respect were the same when the Bluetooth appeared and when it didn’t.
The “Unknown eventid” may appear more than once.
Its entry with plain lspci (unrelated entries removed):
$ lspci
03:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
And now to the parts that were missing completely when the Bluetooth device didn’t appear: The logs on behalf of the device, connected to the USB bus:
[ 3.764868] usb 1-13: new full-speed USB device number 12 using xhci_hcd
[ 3.913930] usb 1-13: New USB device found, idVendor=0cf3, idProduct=e300
[ 3.915610] usb 1-13: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Plain lsusb:
$ lsusb
[ ... ]
Bus 001 Device 012: ID 0cf3:e300 Atheros Communications, Inc.
[ ... ]
lsusb, tree view (a lot of irrelevant stuff excluded):
$ lsusb -t
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
|__ Port 13: Dev 12, If 1, Class=Wireless, Driver=btusb, 12M
|__ Port 13: Dev 12, If 0, Class=Wireless, Driver=btusb, 12M
More in detail for the device:
# lsusb -v -d 0cf3:e300
Bus 001 Device 012: ID 0cf3:e300 Atheros Communications, Inc.
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.01
bDeviceClass 224 Wireless
bDeviceSubClass 1 Radio Frequency
bDeviceProtocol 1 Bluetooth
bMaxPacketSize0 64
idVendor 0x0cf3 Atheros Communications, Inc.
idProduct 0xe300
[ ... ]
Several modules appear in lsmod, but the point is that there are dependencies on bluetooth module, in particular the btusb module. Irrelevant modules deleted from list below:
$ lsusb
rfcomm 77824 16
bnep 20480 2
btusb 45056 0
btrtl 16384 1 btusb
btbcm 16384 1 btusb
btintel 16384 1 btusb
bluetooth 548864 43 btrtl,btintel,bnep,btbcm,rfcomm,btusb
How to check if a Bluetooth device is present
There is no device file for Bluetooth interface, exactly as there’s none for network. Like there’s eth0 for Ethernet, there’s hci0 for Bluetooth.
hcicontrol grabs the info by opening a socket. As in the relevant strace:
socket(PF_BLUETOOTH, SOCK_RAW, 1)
So this is what it looked like with the Bluetooth device present (without it, hciconfig simply prints nothing):
$ hciconfig
hci0: Type: Primary Bus: USB
BD Address: xx:xx:xx:xx:xx:xx ACL MTU: 1024:8 SCO MTU: 50:8
UP RUNNING PSCAN ISCAN
RX bytes:1386 acl:0 sco:0 events:94 errors:0
TX bytes:5494 acl:0 sco:0 commands:94 errors:0
Real hex numbers appear instead of the xx’s above. Use hciconfig -a for more verbose output.
And the device appears in the rfkill list, and shouldn’t be blocked.
$ rfkill list
0: hci0: Bluetooth
Soft blocked: no
Hard blocked: no
1: phy0: Wireless LAN
Soft blocked: no
Hard blocked: no
These two show that the kernel supplies a Bluetooth device to the higher software levels. If Bluetooth doesn’t work, there are other reasons…
TL;DR
Bought an Orico PCIe adapter for USB 3.0 for testing a USB device I’m developing (PVU3-5O2I). It has the VL805 chipset (1106/3483) which isn’t xHCI compliant. So it works only with the vendor’s own drivers for Windows, which you’ll have to struggle a bit to install.
Update: Be sure to read the comments at the bottom of this post. It appears like some people have had different experiences.
Attempt with Linux
That the device is detected by its class (xHCI), and not by its Vendor / Product IDs.
The following was found in the kernel log while booting:
[ 0.227014] pci 0000:03:00.0: [1106:3483] type 00 class 0x0c0330
[ 0.227042] pci 0000:03:00.0: reg 0x10: [mem 0xdf000000-0xdf000fff 64bit]
[ 0.227104] pci 0000:03:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 0.227182] pci 0000:03:00.0: System wakeup disabled by ACPI
and
[ 0.325254] pci 0000:03:00.0: xHCI HW did not halt within 16000 usec status = 0x14
and then
[ 1.474178] xhci_hcd 0000:03:00.0: xHCI Host Controller
[ 1.474421] xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 3
[ 1.505919] xhci_hcd 0000:03:00.0: Host not halted after 16000 microseconds.
[ 1.506066] xhci_hcd 0000:03:00.0: can't setup: -110
[ 1.506241] xhci_hcd 0000:03:00.0: USB bus 3 deregistered
[ 1.506494] xhci_hcd 0000:03:00.0: init 0000:03:00.0 fail, -110
[ 1.506640] xhci_hcd: probe of 0000:03:00.0 failed with error -110
The error message comes from xhci_halt() defined in drivers/usb/host/xhci.c, and doesn’t seem to indicate anything special, except that the hardware doesn’t behave as expected.
Update firmware, maybe?
The idea was to try updating the firmware on the card. Maybe that will help?
So I downloaded the driver from the manufacturer and the firmware fix tool from Station Drivers.
Ran the firmware fix tool before installing the driver on Windows. It went smooth. Then recycled power completely and booted Linux again (the instructions require that). Exactly the same error as above.
Went for Windows again, ran the firmware update tool, and this time read the firmware revision. It was indeed 013704, as it should be. So this doesn’t help.
Install driver on Windows 10
Checking in the Device Manager, the card was found as an xHCI controller, but with the “device cannot start (Code 10)”. In other words, Windows’ xHCI driver didn’t like it either.
Attempted installation of the driver. Failed with “Sorry, the install wizard can’t find the proper component for the current platform. Please press OK to terminate the install Wizard”. What it actually means is that the installation software (just downloaded from the hardware vendor) hasn’t heard about Windows 10, and could therefore not find an appropriate driver.
So I found the directory to which the files were extracted, somewhere under C:\Users\{myuser}\AppData\Local\Temp\is-VJVK5.tmp, and copied the USetup directory from there. Then selected xhcdrv.inf for driver installation. It’s intended for Windows 7, but it so happends, that generally drivers for Windows 7 and Windows 10 are the same. It’s the installer that was unnecessarily fussy.
After installing this driver, a “VIA USB eXtensible Host Controller” entry appeared in the USB devices list of the Device Manager, and it said it works properly.
After a reboot, there was “xHCI Root Hub 0″ under “Other Devices” of the Device Manager, with the error message “The drivers for this device are not installed”. It was available under the same USetup directory (ViaHub3.inf).
This added “VIA USB 2 Hub” and “VIA USB 3 Root Hub” to the list of USB devices, and believe it or not, the card started working.
Bottom line: It does work with its own very special drivers for Windows, with a very broken setup procedure.
Apr 2021 update: It turns out that the driver can be downloaded from Gigabyte’s site, as the device is apparently on their GA-78LMT-USB3 motherboard. The file name is mb_driver_via-usb3.exe, and this is the only one I managed to install, and the card works great with it. Other driver packages failed with the installer saying that it “can’t find the proper component”, whatever that meant.
Instead of using my scanner, I put my cell phone on some kind of stand, and shot a lot of paper documents (voice activation is a blessing). But then the files are unnecessarily large. Don’t need all that resolution. So
$ for i in * ; do convert "$i" -scale '33%' -quality 75 "smaller/scan_$i" ; done
And the files are 100-200k each with enough resolution to see the fine print.
Crop fail
This drove me craze for an hour: I wanted to crop the image, so that only the paper part would be left. Selecting the region in GIMP and then using it as in
$ convert in.jpg -crop 2632x3712+152+960 -scale '33%' -quality 75 out.jpg
Failed miserably, giving me an image if wrong dimensions and showing the wrong place, even though I correctly picked width x height + offsetx + offsety, with reference to the topleft corner. Exactly the output that GIMP displayed when I selected the desired area.
The problem turned out to be that that the cell phone wrote the orientation data into the JPG file (as it should). So the coordinates showed in GIMP were after rotating the image, and had therefore nothing to do with those given as the -crop argument.
Solution: In order to obtain the reference image to obtain coordinates with GIMP, go
$ convert in.jpg -strip for-gimp.jpg
And then use those coordinates for cropping. I suppose adding -strip to the conversion command is a good idea as well, to prevent further confusion, in particular if a -rotate is added to get the rotation correctly:
$ convert in.jpg -strip -crop 3712x2632+952+200 -scale '33%' -quality 75 out.jpg
The reason this is so confusing is that ImageMagick retained the orientation information, so both the input and output images were displayed in the same orientation, even though the coordinated used by the -crop flag related to the unrotated image.
Introduction
While implementing Xillybus‘ USB 3.0 general purpose IP core for FPGAs, I found the USB Implementers Forum’s compliance tool handy, yet somewhat quirky, for verifying I got things right. It was USB3CV version 2.1.12.1, running on Windows 10 @32 bit. The 64 bit version works the same (I’ve tested it as well).
A GPLed open-source version for Linux is something one could have wished for (and would probably improve things considerably), but that’s probably too much to expect when Microsoft is all over the USB standard.
These are my notes as I went along.
Obtaining and installing
Download USB3CV from this page. Installation went smooth on Windows 10 @32 bits (and 64 bits as well).
As suggested by the utility’s author, disable UAC completely on the system by invoking regedit, setting the EnableLUA to 0 on the following path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System. And restart Windows. This seems to be a matter of convenience, and not something to do on anything else but an internal computer intended for testing only, as it’s a security hole.
A usbif.json file can be obtained from USB-IF for use by USB3CV. This only allows looking up the Vendor ID, so there’s no problem working without it. There will be two red lines in the log, but the relevant tests passes without this file as well.
Installing on Windows 7 (32 bit) failed on installing the Microsoft Visual C++ 2017 Redistributable (vc_redist.x86.exe) prerequisite. Despite removing everything installed on that computer (under Programs and Features), the problem remained — it’s probably because SP1 isn’t installed on that machine.
The hijack
When invoked, the USB3CV program replaces the original xHCI driver in Window’s I/O stack with one it uses for testing, and returns the original one when exiting. This means that all USB devices that are connected to the USB controller (possibly all USB devices on a motherboard) become effectively disconnected. That includes USB 2.0 and USB 1.1 devices. To work around this, either use good-old PS/2 keyboard and mouse, employ a remote session, or plug in an extra USB board (possibly as low as USB 1.1) and plug the USB mouse and keyboard there. Otherwise, well, no mouse and keyboard on a Windows system.
Also, be sure to close USB3CV before shutting down Windows, or it may not have enough time to bring the original driver back. See more notes on this issue below.
USB3CV is kind enough to prompt for which USB controller to take over, and it’s also fine if you accidentally knock out your own USB mouse and keyboard (a second dialog box requires confirming the takeover with a mouse click, or it times out and reverts it).
The immediate difference when plugging in a device when USB3CV is running (and has taken over the relevant controller) is that nothing happens — there is no enumeration nor descriptor fetching, as there would usually be. This happens only later on, when requesting tests, in which case the controller is scanned for devices. Several times, actually.
Another significant difference is that the test xHCI driver doesn’t have these small workaround features that a usual xHCI driver has for getting unstuck from protocol bug deadlocks, and it doesn’t attempt to smooth protocol errors. Which makes sense: A regular xHCI driver’s goal is to make the device work. The test driver is there to expose errors, so it should get stuck when things are done wrong. Hence it may expose bugs that were smoothed out when the device was connected in a regular manner to a computer. For example, a bug in the device’s LTSSM implementation may be smoothed out by a regular driver by issuing a warm reset and starting all over again, without any log message, but USB3CV’s driver will just fail.
So if weird stuff happens with USB3CV, check your own implementation, and don’t look for bugs in USB3CV. Reject the immediate “but it worked before” instinct to blame something else than your own design.
Hands on
Double-click the USB 3 Gen X CV icon. A dialog box with a list of USB controller(s) opens. Select the one that the device is attached to. All other USB devices on that controller, of all speeds, will be disconnected from the computer. Then the “Command Verifier” asks to verify this choice. If you just disabled your own mouse and keyboard you can’t click “Continue”, and that dialog box will time out, and the hijacked USB controller is released.
Then the main windows opens. Select “Chapter 9 Tests (USB 3 Gen X devices)”, Compliance Test (or Debug for individual tests), and click Run.
This is when USB3CV tries to find devices on the hijacked USB controller (it’s complete silence on wires until then). Sometimes this fails for no apparent reason — see below. A dialog box asking to select the device to work with appears, and is then followed by three dialog boxes asking for the number of retimers. If you don’t know what it’s about, you don’t have any, so select zero on all three.
When tests fail, the error messages may be misleading. For example, a problem with the device’s LTSSM made the GetDecriptor() test fail, spitting out the paragraphs in the spec that aren’t met, but on the wires there was no related SETUP packet sent (because the link wasn’t up, it turned out eventually). However the SET_SEL test went through OK nevertheless. So it may be really confusing. This is easily mistaken for a bug in USB3CV.
Even worse, it seems like a test failure can lead to all kind of unexpected and unrelated errors in following tests.
The tool also complains when the device declares itself as USB 3.0 in the device descriptor rather than USB 3.2, considering it to be a test failure. What about USB 3.0 devices, not supporting anything related to SuperSpeedPlus? Why should they even mention USB 3.2?
When things get stuck
If weird things happen (in particular if the device isn’t found by USB3CV and/or otherwise) re-run USB3CV and exit it, so the original xHCI controller is brought back upon exit. That’s what USB3CV expects to see on invocation, and it doesn’t work properly otherwise. So just run the tool and exit immediately, and then run it again.
It’s better to start USB3CV with the device already plugged in. Moving it to another plug while USB3CV is running often helps.
Crashes
Unfortunately, USB3CV crashes quite a lot (mostly in relation with test failures, in particular failing tests related to low power states). The “Command Verifier Tool has stopped working” dialog box may appear. A rather ironic workaround seems to work: Clicking “Abort” as the tests run (towards the end, actually), and then clicking “No”, a bit before the end of the tests, in the dialog box asking if you really want to abort (so the test isn’t aborted, after all). Sometimes the enumeration test is done, sometimes it isn’t (and fails with some ugly error), so maybe that’s related.
Sometimes USB3CV just gets stuck in a test, and attempting abort the test doesn’t help. Closing the USB3CV windows brings up “Wait for a stack switch” after which Windows crashes with a “Your PC ran into a problem and needs to restart. You can restart”. Which probably means that it’s OK to recycle power (no other possibility left). Windows suggest searching online for “WDF_VIOLATION” too.
Whether the power recycle took place or not, USB3CV didn’t have the opportunity to return the original xHCI driver as it usually does upon a normal exit. Therefore, be sure to invoke and exit USB3CV as mentioned above to get the system back to its original state.
On a test computer, I use /dev/sda1 to contain whatever operating system I need for the moment. At some point, I installed Linux Mint 19.1 properly on that partition, and then I wanted to return to Windows 10. After writing the Windows 10 image to /dev/sda1, I got a message from grub saying it didn’t detect the filesystem.
Hmmm… So the MBR was overwritten by GRUB, and now I need to get it back to Windowish. One can use Microsoft’s rescue disk-on-key, or the quick hack: Download ms-sys, compile with plain “make”, don’t bother to install, and just go from the bin/ directory:
# ./ms-sys -7 /dev/sda
and Windows 10 boots like a charm.
Introduction
FPGA and Linux and all that hi-tech stuff is nice, but nothing compares to the self pride of getting a simple plumbing job done right. So this time it was about installing a pressure gauge under a bathroom sink, between the water outlet for the faucet’s cold water and the faucet itself.
No need to tell me the reading is valid only if the taps are all closed. Anyhow, this is a simple task if you happen to have a Tee fitting that happens to match the stuff that is supposed to connect to it. If not, go find adapters. Or another Tee. If you’re a plumber, you probably have a large box with stuff to try around. As I’m not (the pipelines I usually deal with are digital register pipelines), the trick is to define the exact parts needed, and find them on AliExpress or Ebay. Or maybe even at the hardware store. The latter option turned out pretty difficult, as these parts are cheap, the motivation to help is accordingly, and if I can’t define what I need exactly, it’s a lost battle. So I ordered the stuff from AliExpress eventually. And I got it right.
So here are the notes to myself for the next time I’ll need to do something similar.
The standards
Spoiler: In Israel, everything follows BSP. A former British colony, after all.
Pluming is a local thing, performed by local people, depending on their local hardware stores, with a “give me that thingy” kind of communication. It’s therefore quite difficult to find exact definitions for plumbing fittings. It goes by “see if it fits”.
For threaded fittings, terms like ½” and ¾” are often used, but they refer to nothing measured on the piece of metal itself. These figures used to tell the inner diameter of the pipe itself, but that doesn’t work anymore. So if you want to measure a fitting and tell what it’s called in the market, you need to go to the standard tables.
And here comes the real fun. There are mainly two standards for pipe sizes, which detail the dimensions for the pipes and threads. It seems like the most common ones in Israel (and Europe) follow the British Standard (BSP), but there’s also American National Standard Pipe Thread standards (Often referred to as Nominal Pipe Size, NPS or National Pipe Taper / Thread, NPT).
Sometimes IPS (Iron Pipe Size) is mentioned, but it usually means NPS.
The two standards are incompatible, despite similar terminology and measures. In particular, the thread pitch doesn’t match between the two standards. But there’s also the thread form: American goes with Sellers, which has sharp edges, and British with Whitworth thread form, which is a sine wave shape. Not that I can tell the difference just by looking. So if you try to mix British with American fittings, screwing will probably be difficult, and it won’t hold pressure well, if at all.
Either way, for historical reasons, the inch number used in these standard matches none of the measured diameters of the pipe: Neither the inner or outer. The OD, outer diameter, is the easiest to measure, and should be compared with the standard.
So the main headache is BSP vs. NTP (or NPS, IPS, MIP, FIP and all other abbreviations meaning “American”).
And then we have this thing with DN sizes. For example, DN15 means ½”, and DN20 means ¾”. Even though these were coined for the American standard (and listed on Wikipedia’s page for NPS) it seems like they’re also used in context of BSP. So if a product is listed with a DN number, it probably means nothing on which BSP vs. NTP.
Actual measurements
This is what I measured on my own stuff.
- The tap for my washing machine is a ¾” according to the manual, I measured 0.97″ outer diameter (1.05″ per standard). Apparently washing machines go BSP.
- A typical shower head has a ½” fitting, not clear if American or British (for this size, it seems like NPS and BSP are roughly the same).
- My supply stop valves (those wall taps for bathroom faucets) are 3/8″ (measured 0.64″ outer diameter). In Israel, these wall-mounted angle valves are called “NIL taps (ברז ניל)”, which most likely refers to the German company NIL, and therefore conforms to BSP.
- The pressure gauge is an ¼” (measured outer diameter 0.5″).
Iron and brass
It’s pretty well known, that if brass fittings are used on iron pipes, or if these two metals are mixed in any other way, the iron will corrode rapidly (within a few years), as they work together as a battery. So the material is crucial.
In Israel, all valves, taps and faucets are made of chrome or nickel plated brass, and is therefore OK for use with brass Tees and adapters.
“Teflon” tape (or PTFE)
When there’s no rubber ring sealing, teflon tape is applied on the thread. It works better when there’s a hard end to the screwing, as the force of the end works on the thread and the teflon applied to it. But it can work well otherwise.
It’s 20 rounds around, or it won’t seal. Apply evenly on the thread with slight tension. The direction is as for the turn direction while fitting, i.e. the circular motion will tighten the “teflon” even more (and not unwind it). Don’t cover the few first threads (the end of the pipe) for easier fitting. If the fitting torque is easy all the way, it’s not going to seal.
Introduction
Trying to solve a 90 second wait-on-some-start-job on each boot situation, I found that there’s little info on how to tackle a problem like this out there. Most web pages usually go “I did this, hurray it worked!”, but where do you start solving a problem like this if none of the do-this-do-that advice helps?
So this is a random pile of things to try out. Most of the things shown below didn’t solve my own issue, but these are the tools I collected in my toolbox.
I did this on a Debian 8 (Jessie), and the problem was that boot was stuck for a minute and a half on “A start job is running for sys-subsystem-net-devices-eth0.device”. It’s not directly relevant, except that it influences what I tried out, and hence listed below.
I have written a shorter version of this post which focuses on the specific problem. This post is more about the techniques for figuring out what’s going on.
PID 1 at your service
The tricky part of systemd is that much of the activity is done directly by the systemd process, having PID 1. Requests to start and stop services and other units are sent via DBus messages, i.e. over connections to UNIX sockets. To someone who is used to the good-old-systemV Linux, this is voodoo at its worst, but there are simple ways to keep track of this, as shown below.
In particular, don’t strace the “systemctl start” process — it just sends the request over DBus. Rather, attach strace to PID 1, also explained below. That’s where the fork to the actual job process takes place, if at all.
And don’t get confused by having /org/freedesktop/ appearing everywhere in the logs. It doesn’t necessarily have anything to do with the desktop (if such exists), and is likewise relevant to a non-graphical system. DBus’ started as a solution for desktop machines, and that’s the only reason “freedesktop” is everywhere.
First thing first
Read the man page, “man systemd.device” in my case. If there’s another computer with different configuration, see what happens there. What does it look like when it works?
journald -x
As mentioned on this page, if something went wrong during boot, check out the log to see why. The -x flag adds valuable info for solving issues of this sort.
For example,
# journald -x
[ ... ]
May 20 11:41:20 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 11:41:20 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
-- Subject: Unit sys-subsystem-net-devices-eth0.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit sys-subsystem-net-devices-eth0.device has failed.
--
-- The result is timeout.
May 20 11:41:20 diskless systemd[1]: Dependency failed for ifup for eth0.
-- Subject: Unit ifup@eth0.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ifup@eth0.service has failed.
--
-- The result is dependency.
Now, how to read it: It’s clear that sys-subsystem-net-devices-eth0.device is the unit that didn’t manage to kick off. But the more important clue is that ifup@eth0.service failed, because it depends on the former.
It’s important, because it explains why an attempt to launch sys-subsystem-net-devices-eth0.device was done in the first place. A lot of “there I fixed it” pages on the web disable the latter service, and get rid of the problem, not necessarily understanding how and why.
So here’s why: On some relatively early systemd versions, *.device units simply won’t launch. It’s worked around by making sure that no other unit requests them. But then some software package isn’t aware of this, and requests a .device unit and there’s the deadlock. Or more precisely, waiting 90s for the timeout.
Kicking if off manually
Obviously, the unit is inactive:
# systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device
Loaded: loaded
Active: inactive (dead)
May 20 12:47:07 diskless systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device...
May 20 12:48:37 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 12:48:37 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
Note that there’s also systemctl show for a detailed printout of the unit’s assignments.
Try to start it manually (this little session took 90 seconds, right?)
# systemctl start sys-subsystem-net-devices-eth0.device
Job for sys-subsystem-net-devices-eth0.device timed out.
The important takeaway is that we can repeat the problem on a running system (as opposed to a booting one). This allows running some tools for looking at what happens.
On the other hand, I’m not all that sure .device units are supposed to be started or stopped with systemctl at all. Or more likely, that requesting a start or stop on device units means waiting for them to reach the desired state by themselves. This goes along with the observation I made with strace (below), showing that systemd does nothing meaningful until it times out. So most likely, it just looked up the state of the device unit, saw it wasn’t started, and then went to sleep, essentially waiting for a udev event to bring the unit to the desired state, and consequently return a success status to the start request.
In fact, when I tried “systemctl stop” on the eth0 device on another machine, on which it the device file was activated automatically, it got stuck exactly the same way as for starting it on Debian 8.
As far as I understand, these should become active and inactive by a systemd-udev event by virtue of udev labeling. They are there to trigger other units that depend on them, not to be controlled explicitly.
But here comes a major red herring: Curiously enough, during the 90 seconds of waiting, “systemctl starts” created a child process, “/bin/systemd-tty-ask-password-agent –watch”. One can easily be misled into thinking that it’s this child process that blocks the completion of the former command.
So first, let’s convince ourselves that it’s not the problem, because running
# systemctl --no-ask-password start sys-subsystem-net-devices-eth0.device
doesn’t create this second process, but is stuck nevertheless.
This systemd-tty-ask-password-agent process listens for system-wide requests for obtaining a password from the user (e.g. when opening a crypto disk), and does that job if necessary. systemctl launches it just in case, regardless of the unit requested for starting. This is the way to make sure passwords are collected, if so needed. This process is usually not visible, because systemctl commands typically don’t last very long. More about it here.
Actually, checking with strace, systemctl was blocking all those 90 seconds on a ppoll(), waiting for some response from the /run/systemd/private UNIX socket. That’s the DBus connection with process number 1, systemd. In other words, systemctl requested the start of the unit over DBus, and then waited for the result for 90 seconds, at which point it got the answer that the attempt timed out.
DBus
DBus is like a small client-server intranet that is local to the computer. It allows processes that are connected to DBus two features:
- A process can act as a client, and request a function call on another process that is connected to the same bus (acting as a server). When the function call finishes, its result value(s) are returned through the bus, so there’s a real function call look and feel. Both the function call and the result may involve multiple elements of data with various types (strings, booleans, arrays, you name it). DBus has adopted object-oriented teminology, so these function calls are called method calls, and the whole OO jargon (properties in particular) applies.
- A process can broadcast events on the bus, which are called signals, so other processes that are subscribed to this signal get a notification.
DBus was originally developed to allow GUI desktop utilities to talk with each other. Later on, this framework was adopted for more fundamental uses, in particular systemd, Pulseaudio, bluetoothd, cupsd etc. So the “original” Dbus is now called the “session bus”, and the latter is the “system bus”. Two completely independent networks, if you like.
For a more comprehensive explanation, see Wikipedia’s page. I show an example of making method calls on DBus using the command line on this post. For a simple example on how Dbus is used, I suggest this.
Listening to DBus
There’s are two utilities, dbus-monitor and “busctl monitor” for dumping DBus messages (an eavesdrop add-on may be required to allow system-wide message monitoring, but this was not the case on my system). I’ll present monitoring of the system Dbus here. For an example of monitoring a session’s messages, there’s another post of mine.
So on the invocation of
# systemctl start sys-subsystem-net-devices-eth0.device
the output of
# dbus-monitor --system
was
signal sender=org.freedesktop.DBus -> dest=:1.8 serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameAcquired
string ":1.8"
signal sender=:1.0 -> dest=(null destination) serial=127 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitNew
string "sys-subsystem-net-devices-eth0.device"
object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice"
signal sender=:1.0 -> dest=(null destination) serial=128 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobNew
uint32 173
object path "/org/freedesktop/systemd1/job/173"
string "sys-subsystem-net-devices-eth0.device"
signal sender=:1.0 -> dest=(null destination) serial=129 path=/org/freedesktop/systemd1/job/173; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
string "org.freedesktop.systemd1.Job"
array [
dict entry(
string "State"
variant string "running"
)
]
array [
]
and when the timeout occurs with a
Job for sys-subsystem-net-devices-eth0.device timed out.
the following output is captured on the DBus:
signal sender=:1.0 -> dest=(null destination) serial=141 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobRemoved
uint32 173
object path "/org/freedesktop/systemd1/job/173"
string "sys-subsystem-net-devices-eth0.device"
string "timeout"
signal sender=:1.0 -> dest=(null destination) serial=142 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitRemoved
string "sys-subsystem-net-devices-eth0.device"
object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice"
Clearly, everything was done by the systemd main process, and almost nothing by the process created from console.
The “sender=:1.0″ part means that the sender is the process number 1 (systemd). Try
$ busctl
(that’s short for “busctl list”) to get a mapping between these addresses and processes.
See the number 173 in the object paths all over in the dbus traffic? That’s the job number as listed in
# systemctl list-jobs
JOB UNIT TYPE STATE
173 sys-subsystem-net-devices-eth0.device start running
1 jobs listed.
Note that these job numbers have absolutely nothing to do with the Linux PIDs.
Using strace
strace is often very useful for resolving OS problems. It’s however important to realize that the old-fashioned way of stracing the process created on command line will probably not yield much information, because this process only sends a request over DBus.
Instead, strace the process that does the actual work: PID 1, the Mother Of All Processes, the almighty systemd itself. I have to admit that I was first intimidated by the idea to attach strace to this process, but it turns out that it’s usually quite calm, and spits out relatively little unrelated mumbo-jumbo.
Bonus: It’s always the same command:
# strace -p 1 -s 128 -ff -o systemd-trace
This makes a file for each process systemd may fork into. If things went wrong because some process didn’t execute properly, this is how we catch it.
For example, when running the said “systemctl start sys-subsystem-net-devices-eth0.device” command, this was the output:
accept4(12, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 13
getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "\270\305\231\206+&\262MI\313[\337y}\314V", 16) = 16
close(18) = 0
fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(13, F_GETFD) = 0x1 (flags FD_CLOEXEC)
fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
setsockopt(13, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
setsockopt(13, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0
getsockopt(13, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(13, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0
getsockopt(13, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
setsockopt(13, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0
getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0
fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
getsockopt(13, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
getsockname(13, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, [23]) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45
epoll_ctl(4, EPOLL_CTL_ADD, 13, {0, {u32=2784593808, u64=94857137325968}}) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "\10,\360z\t\363+\355D\2556NLkhL", 16) = 16
close(18) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "$F\3302\215\326\320\251\261\240\217\232\224\1\346\205", 16) = 16
close(18) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "|\313E\273R\264 \375\v\245\235\206h\247\30-", 16) = 16
close(18) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "{\200\255\356\26\341b4V_P\225aHkO", 16) = 16
close(18) = 0
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLOUT, {u32=2784593808, u64=94857137325968}}) = 0
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2519, 883043000}}, NULL) = 0
epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359675827}) = 0
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2508, 883043000}}, NULL) = 0
epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359719924}) = 0
sendmsg(13, {msg_name(0)=NULL, msg_iov(3)=[{"OK b8c599862b26424d89cb5bdf797dcc56\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=2784593808, u64=94857137325968}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359793419}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359846496}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0018\0\0\0\1\0\0\0\240\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\t\0\0\0StartUnit\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop."..., 208}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 208
getuid() = 0
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1&\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1o\0\0", 32}, {"!\0\0\0/org/freedesktop/systemd1/job/242\0", 38}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 70
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\2\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224
sendmsg(35, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\302\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 360201428}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1*\0\0\0\2\0\0\0\227\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\7\0\0\0GetUnit\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1"..., 186}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 186
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1S\0\0\0\3\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1o\0\0", 32}, {"N\0\0\0/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0", 83}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 115
epoll_wait(4, {}, 33, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {2508, 360292582}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 360319860}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0019\0\0\0\3\0\0\0\300\0\0\0\1\1o\0N\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0\0\3\1s\0\3\0\0\0Get\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Pro"..., 241}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 241
lstat("/etc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=120, ...}) = 0
lstat("/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/generator", {st_mode=S_IFDIR|0755, st_size=360, ...}) = 0
lstat("/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/local", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
lstat("/usr/local/lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
lstat("/usr/local/lib/systemd", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=36864, ...}) = 0
lstat("/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib/systemd/system", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/generator.late", {st_mode=S_IFDIR|0755, st_size=440, ...}) = 0
lstat("/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/local/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\10\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\3\0\0\0\10\1g\0\1v\0\0", 32}, {"\1b\0\0\0\0\0\0", 8}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 40
epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 883466886}) = 0
read(29, "\1\0\0\0\0\0\0\0", 8) = 8
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2509, 383043000}}, NULL) = 0
epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1
So what we have here is the acceptance of the connection, sending and receiving messages like those captured with dbus-monitor above, but nothing meaningful was executed: There was no productive system call done, and systemctl didn’t fork. But we can also see which files (actually, directories) systemd was looking for, and didn’t find: It really wanted to find some sys-subsystem-net-devices-eth0.device.d in one of the famous paths. Not that it matters so much, though.
By contrast and for example, if “systemctl start atd” is launched and atd is not already running, systemd (as process 1) forks into another process and calls execve(“/usr/sbin/atd”) on the forked process (after a whole lot of cgroup stuff, closing files etc.). If the same systemctl command is called with the atd service already running, there is no such fork (not surprisingly, systemd does nothing when attempting to start an already started service).
For the record, the failed lookup of directories ins’t the problem: I had the luxury of trying exactly the same on a machine that doesn’t get stuck on starting sys-subsystem-net-devices-eth0.device, and the strace looked the same. Except that the systemd job was terminated immediately and successfully, rather than getting stuck.
On my own behalf, this was the moment I realized that this unit shouldn’t be started at all on the system it gets stuck on.
Checking udev
If a boot problem is related to a device, maybe something went wrong with the device’s bringup, which in turn prevented the relevant .device unit to become active, and then some other unit waits for it…?
So what is running when eth0 is detected?
# udevadm test /sys/class/net/eth0
calling: test
version 215
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.
load module index
Network interface NamePolicy= disabled on kernel commandline, ignoring.
timestamp of '/etc/systemd/network' changed
timestamp of '/lib/systemd/network' changed
Parsed configuration file /lib/systemd/network/99-default.link
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
read rules file: /lib/udev/rules.d/42-usb-hid-pm.rules
read rules file: /lib/udev/rules.d/50-bluetooth-hci-auto-poweron.rules
read rules file: /lib/udev/rules.d/50-firmware.rules
read rules file: /lib/udev/rules.d/50-udev-default.rules
read rules file: /lib/udev/rules.d/55-dm.rules
[ ... ]
read rules file: /etc/udev/rules.d/90-local-imagedisk.rules
read rules file: /lib/udev/rules.d/95-cd-devices.rules
read rules file: /lib/udev/rules.d/95-udev-late.rules
read rules file: /lib/udev/rules.d/97-hid2hci.rules
read rules file: /lib/udev/rules.d/99-systemd.rules
rules contain 393216 bytes tokens (32768 * 12 bytes), 23074 bytes strings
21081 strings (168928 bytes), 18407 de-duplicated (148529 bytes), 2675 trie nodes used
NAME 'eth0' /etc/udev/rules.d/70-persistent-net.rules:2
IMPORT builtin 'net_id' /lib/udev/rules.d/75-net-description.rules:6
IMPORT builtin 'hwdb' /lib/udev/rules.d/75-net-description.rules:12
IMPORT builtin 'path_id' /lib/udev/rules.d/80-net-setup-link.rules:5
IMPORT builtin 'net_setup_link' /lib/udev/rules.d/80-net-setup-link.rules:11
Config file /lib/systemd/network/99-default.link applies to device eth0
RUN 'net.agent' /lib/udev/rules.d/80-networking.rules:1
RUN '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/$name --prefix=/proc/sys/net/ipv4/neigh/$name --prefix=/proc/sys/net/ipv6/conf/$name --prefix=/proc/sys/net/ipv6/neigh/$name' /lib/udev/rules.d/99-systemd.rules:61
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/net/eth0
ID_BUS=pci
ID_MODEL_FROM_DATABASE=RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (Motherboard)
ID_MODEL_ID=0x8168
ID_NET_DRIVER=r8169
ID_NET_NAME_MAC=enx408d5c4d1b15
ID_NET_NAME_PATH=enp1s0
ID_PATH=pci-0000:01:00.0
ID_PATH_TAG=pci-0000_01_00_0
ID_PCI_CLASS_FROM_DATABASE=Network controller
ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller
ID_VENDOR_FROM_DATABASE=Realtek Semiconductor Co., Ltd.
ID_VENDOR_ID=0x10ec
IFINDEX=2
INTERFACE=eth0
SUBSYSTEM=net
SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0
TAGS=:systemd:
USEC_INITIALIZED=20767
run: 'net.agent'
run: '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/eth0 --prefix=/proc/sys/net/ipv4/neigh/eth0 --prefix=/proc/sys/net/ipv6/conf/eth0 --prefix=/proc/sys/net/ipv6/neigh/eth0'
unload module index
Unloaded link configuration context.
Note that the “systemd” tag is in place, and so is the SYSTEMD_ALIAS assignment. So there’s probably no reason udev-wise why there was no .device unit activated.
hwdb.bin
Update the file /etc/udev/hwdb.bin:
# udevadm hwdb --update
Note that it doesn’t touch /lib/udev/hwdb.bin, and I’m unclear how they interact, if at all (“wild” guess: /etc/udev/hwdb.bin overrules the one in /lib/udev, if it exists).
On newer systems it appears to be “systemd-hwdb update”.
Introduction
After installing wireshark (and tons of packages it depends on) on a rather fresh and bare-boned Debian 8 (Jessie), I got the “A start job is running for sys-subsystem-net-devices-eth0.device” message for a minute and half on every boot.
It was exceptionally difficult to find the reason, because so many packages were installed along with wireshark.
This is the short version of how this was solved. For the entire battery of stuff I tried out, I’ve written a separate post.
Bad omens
# systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device
Loaded: loaded
Active: inactive (dead)
May 20 12:47:07 diskless systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device...
May 20 12:48:37 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 12:48:37 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
OK. Not surprising it’s not active. So start manually…?
# systemctl start sys-subsystem-net-devices-eth0.device
Job for sys-subsystem-net-devices-eth0.device timed out.
The second line appeared after a minute and a half, of course.
So I went to another, more recent machine (Mint 19) and went
$ systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device - Killer E2500 Gigabit Ethernet Controll
Loaded: loaded
Active: active (plugged) since Wed 2019-02-20 14:48:54 IST; 2 months 30 days
Device: /sys/devices/pci0000:00/0000:00:1c.2/0000:04:00.0/net/eth0
And then comparing the outputs of just
$ systemctl
it became evident that *.device units are listed on the Mint 19 machine, but not on Debian 8.
Which led me to the conclusion that sys-subsystem-net-devices-eth0.device isn’t meant to be on Debian 8. That the problem isn’t that it’s not starting when commanded to do so, but that it’s not supposed to be started that way. The problem is that some other unit requests it.
As far as I understand, these .device units should become active and inactive by a systemd-udev event by virtue of udev labeling. They are there to trigger other units that depend on them, not to be controlled explicitly. For some reason they aren’t activated on the Debian 8 machine, despite udev rules being roughly the same as on the Mint 19 machine.
In the lack of proper docs (?), I’m left to guess that requesting a start or stop on device units means waiting for them to reach the desired state by themselves. This goes along with an observation I’ve made with strace, showing that systemd does nothing meaningful until it times out. So most likely, it just looked up the state of the device unit, saw it wasn’t started, and then went to sleep, essentially waiting for a udev event to bring the unit to the desired state, and consequently return a success status to the start request.
In fact, when I tried “systemctl stop” on the eth0 device on Mint 19 (i.e. the machine on which it was already loaded) it got stuck exactly the same way as for starting it on Debian 8. So that command probably meant “wait until eth0 goes away”.
Closing in
The trick is now to find which unit causes the attempt to kick off sys-subsystem-net-devices-eth0.device.
# journald -x
[ ... ]
May 20 11:41:20 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 11:41:20 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
-- Subject: Unit sys-subsystem-net-devices-eth0.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit sys-subsystem-net-devices-eth0.device has failed.
--
-- The result is timeout.
May 20 11:41:20 diskless systemd[1]: Dependency failed for ifup for eth0.
-- Subject: Unit ifup@eth0.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ifup@eth0.service has failed.
--
-- The result is dependency.
In English: It’s clear that sys-subsystem-net-devices-eth0.device is the unit that didn’t manage to kick off. But the more important clue is that ifup@eth0.service failed, because it depends on the former. The easier solution lies in the latter.
But frankly, I don’t really understand what happened here. If eth0 was detected by systemd, why wasn’t the relevant device unit activated? Or if it wasn’t, why was ifup@eth0.service kicked off? The relevant unit file is a wildcard service, not naming any specific device name.
Solution
The textbook solution is to find why .device files aren’t generated at all on my Debian 8 system, fix that, and then there won’t be any delay. The correct solution in some cases is to manipulate the udev rules, adding a “TAG+=”systemd”" rule to the device, so the device unit is started automatically by systemd (man systemd.device). In my case this tag was already there, so it’s probably some issue with the service that’s supposed to respond to the udev event. So that’s a dead end.
So go the clumsy way: Remove the unit file that requests the device unit (or maybe I should have masked it by adding a file in /etc?). In this case, it’s /lib/systemd/system/ifup@.service, which said:
[Unit]
Description=ifup for %I
After=local-fs.target network-pre.target networking.service systemd-sysctl.service
Before=network.target
BindsTo=sys-subsystem-net-devices-%i.device
After=sys-subsystem-net-devices-%i.device
ConditionPathIsDirectory=/run/network
DefaultDependencies=no
[Service]
ExecStart=/sbin/ifup --allow=hotplug %I
ExecStop=/sbin/ifdown %I
RemainAfterExit=true
and then make sure this had no adverse side effects (none found so far). Actually, removing this file can’t be worse than it was when it took 90 seconds to boot, because this service wasn’t launched anyhow, as its precondition never started.
The incident
All of the sudden, playing videos with Mplayer opened a black window. Sometimes going fullscreen helped, sometimes it didn’t, sometimes with video playing but without OSD. ffplay worked, but somewhat limping.
Setting: Linux Mint 19 on an x86_64, with a couple of fanless GeForce GT 1030 graphics cards and Cinnamon 3.8.9.
Mplayer’s output in this situation:
Playing IHS_1235.MOV.
libavformat version 57.83.100 (external)
libavformat file format detected.
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f858e2362a0]Protocol name not provided, cannot determine if input is local or a network protocol, buffers and access patterns cannot be configured optimally without knowing the protocol
[lavf] stream 0: video (h264), -vid 0
[lavf] stream 1: audio (pcm_s16le), -aid 0, -alang eng
VIDEO: [H264] 1920x1080 24bpp 59.940 fps 36067.5 kbps (4402.8 kbyte/s)
==========================================================================
Opening video decoder: [ffmpeg] FFmpeg's libavcodec codec family
libavcodec version 57.107.100 (external)
Selected video codec: [ffh264] vfm: ffmpeg (FFmpeg H.264)
==========================================================================
Opening audio decoder: [pcm] Uncompressed PCM audio decoder
AUDIO: 48000 Hz, 2 ch, s16le, 1536.0 kbit/100.00% (ratio: 192000->192000)
Selected audio codec: [pcm] afm: pcm (Uncompressed PCM)
==========================================================================
AO: [pulse] 48000Hz 2ch s16le (2 bytes per sample)
Starting playback...
Movie-Aspect is undefined - no prescaling applied.
VO: [vdpau] 1920x1080 => 1920x1080 Planar YV12
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.
A: 0.2 V: 0.0 A-V: 0.216 ct: 0.000 0/ 0 ??% ??% ??,?% 0 0
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.
And a lot of error messages, with “invalid handle value was provided” all over the place.
What does the graphics card have to say?
Opening Nvidia’s graphical control panel (Nvidia X Server Settings), it turns out that “User Dedicated Memory” stands at 1864 MB out of 1998 MB (93%). No wonder things don’t work.
OK, so who’s eating up all RAM? I have a wild guess, but nothing like getting it black on white:
$ nvidia-smi
Sun Apr 14 14:39:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 00000000:17:00.0 Off | N/A |
| 0% 41C P8 N/A / 30W | 1MiB / 2001MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 1030 Off | 00000000:65:00.0 On | N/A |
| 0% 51C P8 N/A / 30W | 1914MiB / 1998MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 1803 G /usr/lib/xorg/Xorg 433MiB |
| 1 2373 G cinnamon 1310MiB |
| 1 54180 G ...uest-channel-token=14764917277860092693 165MiB |
| 1 68188 G /usr/bin/nvidia-settings 0MiB |
+-----------------------------------------------------------------------------+
(The memory consumptions are at the far right on each line. Scroll to see them)
At that very moment it had slurped quite some CPU RAM as well: 5.7 GB virtual memory allocated and 1.3 GB resident (real RAM). So leaking memory everywhere. That’s after running two months.
The other hog is Google Chrome, by the way, (165 MiB), also after running continuously for two months.
Solution
The solution is surprisingly simple and harmless: Restart Cinammon. Yes, you can do this even if there are a lot of windows open, spread out in different workspaces. They will remain in place, don’t worry.
The somewhat scary option is to request to switch user on the desktop’s GUI, and login as the same user again. Scary, because I get a blank screen with only a cursor pointer. To get out of that situation, Ctrl-Alt-F7 got me a proper login screen. And after logging in, all is fine. I’m not all that sure I’ll get that login screen every time, though.
The safer way is to reset Cinnamon without logging out, but the tabs will be reordered within each workspace. To do this (as I mentioned on another post):
Press ALT-F2, type “r” and Enter. Look away for a few seconds, because what happens next looks like a sudden reboot, but it isn’t. All comes back.
Except that the order of the Windows tabs is reordered. It seems to help to do this before the restart: Go through each of the workspaces, minimize all windows and then open them from the tabs, going from left to right. Helps partially.
And a lot of memory has been freed. Resident CPU RAM went down from 1.3 GB to 256 MB, but even more important:
$ nvidia-smi
Sun Apr 14 14:49:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 00000000:17:00.0 Off | N/A |
| 0% 41C P8 N/A / 30W | 1MiB / 2001MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 1030 Off | 00000000:65:00.0 On | N/A |
| 0% 52C P0 N/A / 30W | 701MiB / 1998MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 1803 G /usr/lib/xorg/Xorg 498MiB |
| 1 2373 G cinnamon 17MiB |
| 1 54180 G ...uest-channel-token=14764917277860092693 177MiB |
+-----------------------------------------------------------------------------+
That’s a crash diet until the next time. Once a month, I guess.
As I implemented a MiG controller for KC705′s on-board SODIMM, the controller failed to calibrate at first. Despite that I’ve copied the instantiation and port connections from the example design. As for the pin placement, this was taken care of by the core itself, by virtue of memctl/memctl/user_design/constraints/memctl.xdc within MiG’s dedicated directory (which I called memctl).
And yet init_calib_complete remained low, indicating calibration had failed.
Actually, I had followed Xilinx’ XTP196 slides, except that I didn’t make an example design — I had my own.
Having ruled out holding the MiG controller in reset or a faulty pinout, it turned out that a constraint needs to be added to the application XDC file, namely
set_property slave_banks {32 34} [get_iobanks 33]
What it says (see UG912), is that I/O banks 32 and 34 should calibrate their on-chip terminations (DCI, Digitally Controller Impedance) based upon the reference resistors connected to the dedicated pins on bank 33. Without this constraints, on-chip termination on banks 32 and 34 doesn’t work, and the signal integrity on the relevant I/Os goes down the toilet. No wonder it didn’t calibrate.
The hint for this is on page 37 of XTP196, “Modifications to Example Design”, which tells us to overwrite the example design created by Vivado with a ZIP file Xilinx supplies. On the following page it lists the changes made, among others “Added DCI Cascade constraints to XDC”.