5 GHz Wifi access point on Linux Mint 19 / Atheros

+ how to just compile an Ubuntu distribution kernel without too much messing around.

Introduction

It’s not a real computer installation if the Wifi works out of the box. So these are my notes to self for setting up an access point on a 5 GHz channel. I’ll need it somehow, because each kernel upgrade will require tweaking with the kernel module.

The machine is running Linux Mint 19 (Tara, based upon Ubuntu Bionic, with kernel 4.15.0-20-generic). The NIC is Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter, Vendor/Product IDs 168c:003e.

Installing hostapd

# apt install hostapd

/etc/hostapd/hostapd.conf as follows:

macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0

#Support older EAPOL authentication (version 1)
eapol_version=1

# Uncomment these for base WPA & WPA2 support with a pre-shared key
wpa=3
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP

wpa_passphrase=mysecret

# Customize these for your local configuration...
interface=wlan0
hw_mode=a
channel=52
ssid=mywifi
country_code=GD

and /etc/default/hostapd as follows:

DAEMON_CONF="/etc/hostapd/hostapd.conf"
DAEMON_OPTS=""

Then unmask the service by deleting /etc/systemd/system/hostapd.service (it’s a symbolic link to /dev/null).attempting to start the service with

5 GHz is not for plain people

Nov 27 21:14:04 hostapd[6793]: wlan0: IEEE 802.11 Configured channel (52) not found from the channel list of current mode (2) IEEE 802.11a
Nov 27 21:14:04 hostapd[6793]: wlan0: IEEE 802.11 Hardware does not support configured channel

What do you mean it’s not supported? It’s on the list!

# iw list
[ ... ]
	Band 2:
[ ... ]
          Frequencies:
			* 5180 MHz [36] (17.0 dBm) (no IR)
			* 5200 MHz [40] (17.0 dBm) (no IR)
			* 5220 MHz [44] (17.0 dBm) (no IR)
			* 5240 MHz [48] (17.0 dBm) (no IR)
			* 5260 MHz [52] (24.0 dBm) (no IR, radar detection)
			* 5280 MHz [56] (24.0 dBm) (no IR, radar detection)
			* 5300 MHz [60] (24.0 dBm) (no IR, radar detection)
			* 5320 MHz [64] (24.0 dBm) (no IR, radar detection)
			* 5500 MHz [100] (24.0 dBm) (no IR, radar detection)
			* 5520 MHz [104] (24.0 dBm) (no IR, radar detection)
			* 5540 MHz [108] (24.0 dBm) (no IR, radar detection)
			* 5560 MHz [112] (24.0 dBm) (no IR, radar detection)
			* 5580 MHz [116] (24.0 dBm) (no IR, radar detection)
			* 5600 MHz [120] (24.0 dBm) (no IR, radar detection)
			* 5620 MHz [124] (24.0 dBm) (no IR, radar detection)
			* 5640 MHz [128] (24.0 dBm) (no IR, radar detection)
			* 5660 MHz [132] (24.0 dBm) (no IR, radar detection)
			* 5680 MHz [136] (24.0 dBm) (no IR, radar detection)
			* 5700 MHz [140] (24.0 dBm) (no IR, radar detection)
			* 5720 MHz [144] (24.0 dBm) (no IR, radar detection)
			* 5745 MHz [149] (30.0 dBm) (no IR)
			* 5765 MHz [153] (30.0 dBm) (no IR)
			* 5785 MHz [157] (30.0 dBm) (no IR)
			* 5805 MHz [161] (30.0 dBm) (no IR)
			* 5825 MHz [165] (30.0 dBm) (no IR)
			* 5845 MHz [169] (disabled)
[ ... ]

The problem is evident when executing hostapd with the -dd flag (edit /etc/default/hostapd), in which case it lists the allowed channels. And none of the 5 GHz channels is listed. The underlying reason is the “no IR” part given in “iw list”, meaning no Initial Radiation, hence no access point allowed.

It’s very cute that the driver makes sure I won’t break the regulations, but it so happens that these frequencies are allowed in Israel for indoor use. My computer is indoors.

The way to work around this is to edit one of the driver’s sources, and use it instead.

Note that the typical error message when starting hostapd as a systemd service is quite misleading:

hostapd[735]: wlan0: IEEE 802.11 Configured channel (52) not found from the channel list of current mode (2) IEEE 802.11a
hostapd[735]: wlan0: IEEE 802.11 Hardware does not support configured channel
hostapd[735]: wlan0: IEEE 802.11 Configured channel (52) not found from the channel list of current mode (2) IEEE 802.11a
hostapd[735]: Could not select hw_mode and channel. (-3)
hostapd[735]: wlan0: interface state UNINITIALIZED->DISABLED
hostapd[735]: wlan0: AP-DISABLED
hostapd[735]: wlan0: Unable to setup interface.
hostapd[735]: wlan0: interface state DISABLED->DISABLED
hostapd[735]: wlan0: AP-DISABLED
hostapd[735]: hostapd_free_hapd_data: Interface wlan0 wasn't started
hostapd[735]: nl80211: deinit ifname=wlan0 disabled_11b_rates=0
hostapd[735]: wlan0: IEEE 802.11 Hardware does not support configured channel

[ ... ]

systemd[1]: hostapd.service: Control process exited, code=exited status=1
systemd[1]: hostapd.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Advanced IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator.

Not only is the message marked in red (as it also appears with journalctl itself) not related to the real reason, which is given a few rows earlier (configured channel not found), but these important  log lines don’t appear in the output of “systemctl status hostapd”, as they’re cut out.

Preparing for kernel compilation

In theory, I could have compiled the driver only, and replaced the files in the /lib/modules directory. But I’m in for a minimal change, and minimal brain effort. So the technique is to download the entire kernel, compile things that don’t really need compilation. Then pinpoint the correction, and recompile only that.

Unfortunately, Ubuntu’s view on kernel compilation seems to be that it can only be desired for preparing a deb package. After all, who wants to do anything else? So it gets a bit off the regular kernel compilation routine.

OK, so first I had to install some stuff:

# apt install libssl-dev
# apt install libelf-dev

Download the kernel (took me 25 minutes):

$ time git clone git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git

Compilation

The trick is to make the modules of a kernel that is identical to the running one (so there won’t be any bugs due to mismatches) and also match the kernel version string exactly (or the module won’t load).

Check out tag Ubuntu-4.15.0-20.21 (in my case, for 4.15.0-20-generic). This matches the kernel definition at the beginning of dmesg (and also the compilation date).

Follow this post and to prevent the “+” at the end of the kernel version.

Change directory to the kernel tree’s root, and copy the config file:

$ cp /boot/config-`uname -r` .config

Make sure the configuration is in sync:

$ make oldconfig

There will be some output, but no configuration question should be made — if that happens, it’s a sign that the wrong kernel revision has been checked out. In fact,

$ diff /boot/config-`uname -r` .config

should only output the difference in one comment line (the file’s header).

And then run the magic command:

$ fakeroot debian/rules clean

Don’t ask me what it’s for (I took it from this page), but among others, it does

cp debian/scripts/retpoline-extract-one scripts/ubuntu-retpoline-extract-one

and without it one gets the following error:

/bin/sh: ./scripts/ubuntu-retpoline-extract-one: No such file or directory

Ready to go, then. Compile only the modules. The kernel image itself is of no interest:

$ time make KERNELVERSION=`uname -r` -j 12 modules && echo Success

The -j 12 flag means running 12 processes in parallel. Pick your own favorite, depending in the CPU’s core count. Took 13 minutes on my machine.

Alternatively, compile just the relevant subdirectory. Quicker, no reason it shouldn’t work, but this is not how I did it myself:

$ make prepare scripts
$ time make KERNELVERSION=`uname -r` -j 12 M=drivers/net/wireless/ath/ && echo Success

And then use the same command when repeating the compilation below, of course.

Modify the ath.c file

Following this post (more or less) , edit drivers/net/wireless/ath/regd.c and neutralize the following functions with a “return” immediately after variable declarations. Or replace them with functions just returning immediately.

  • ath_reg_apply_beaconing_flags()
  • ath_reg_apply_ir_flags()
  • ath_reg_apply_radar_flags()

Also add a “return 0″ in ath_regd_init_wiphy() just before the call to wiphy_apply_custom_regulatory(), so the three calls to apply-something functions are skipped. In the said post, the entire init function was disabled, but I found that unnecessarily aggressive (and probably breaks something).

Note that there’s e.g. __ath_reg_apply_beaconing_flags() functions. These are not the ones to edit.

And then recompile:

$ make KERNELVERSION=`uname -r` modules && echo Success

This recompiles regd.c and ath.c, and the generates ath.ko. Never mind that the file is huge (2.6 MB) in comparison with the original one (40 kB). Once in the kernel, they occupy the same size.

As root, rename the existing ath.ko in /lib/modules/`uname -r`/kernel/drivers/net/wireless/ath/ to something else (with a non-ko extension, or it remains in the dependency files), and copy the new one (from drivers/net/wireless/ath/) to the same place.

Unload modules from kernel:

# rmmod ath10k_pci && rmmod ath10k_core && rmmod ath

and reload:

# modprobe ath10k_pci

And check the result (yay):

# iw list
[ ... ]
		Frequencies:
			* 5180 MHz [36] (30.0 dBm)
			* 5200 MHz [40] (30.0 dBm)
			* 5220 MHz [44] (30.0 dBm)
			* 5240 MHz [48] (30.0 dBm)
			* 5260 MHz [52] (30.0 dBm)
			* 5280 MHz [56] (30.0 dBm)
			* 5300 MHz [60] (30.0 dBm)
			* 5320 MHz [64] (30.0 dBm)
			* 5500 MHz [100] (30.0 dBm)
			* 5520 MHz [104] (30.0 dBm)
			* 5540 MHz [108] (30.0 dBm)
			* 5560 MHz [112] (30.0 dBm)
			* 5580 MHz [116] (30.0 dBm)
			* 5600 MHz [120] (30.0 dBm)
			* 5620 MHz [124] (30.0 dBm)
			* 5640 MHz [128] (30.0 dBm)
			* 5660 MHz [132] (30.0 dBm)
			* 5680 MHz [136] (30.0 dBm)
			* 5700 MHz [140] (30.0 dBm)
			* 5720 MHz [144] (30.0 dBm)
			* 5745 MHz [149] (30.0 dBm)
			* 5765 MHz [153] (30.0 dBm)
			* 5785 MHz [157] (30.0 dBm)
			* 5805 MHz [161] (30.0 dBm)
			* 5825 MHz [165] (30.0 dBm)
			* 5845 MHz [169] (30.0 dBm)
[ ... ]

The no-IR marks are gone, and hostapd now happily uses these channels.

Probably not: Upgrading firmware

As I first through that the the problem was an old firmware version, as discussed on this forum post, I went for upgrading it. These are my notes on that. Spoiler: It was probably unnecessary, but I’ll never know, and neither will you.

From the dmesg output:

[   16.152377] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:03:00.0.bin failed with error -2
[   16.152387] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/cal-pci-0000:03:00.0.bin failed with error -2
[   16.201636] ath10k_pci 0000:03:00.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:1535
[   16.201638] ath10k_pci 0000:03:00.0: kconfig debug 0 debugfs 1 tracing 1 dfs 0 testmode 0
[   16.201968] ath10k_pci 0000:03:00.0: firmware ver WLAN.RM.4.4.1-00079-QCARMSWPZ-1 api 6 features wowlan,ignore-otp crc32 fd869beb
[   16.386440] ath10k_pci 0000:03:00.0: board_file api 2 bmi_id N/A crc32 20d869c3

I was first mislead to think the firmware wasn’t loaded, but the later lines indicate it was acutally OK.

Listing the firmware files used by the kernel module:

$ modinfo ath10k_pci
filename:       /lib/modules/4.15.0-20-generic/kernel/drivers/net/wireless/ath/ath10k/ath10k_pci.ko
firmware:       ath10k/QCA9377/hw1.0/board.bin
firmware:       ath10k/QCA9377/hw1.0/firmware-5.bin
firmware:       ath10k/QCA6174/hw3.0/board-2.bin
firmware:       ath10k/QCA6174/hw3.0/board.bin
firmware:       ath10k/QCA6174/hw3.0/firmware-6.bin
firmware:       ath10k/QCA6174/hw3.0/firmware-5.bin
firmware:       ath10k/QCA6174/hw3.0/firmware-4.bin
firmware:       ath10k/QCA6174/hw2.1/board-2.bin
firmware:       ath10k/QCA6174/hw2.1/board.bin
firmware:       ath10k/QCA6174/hw2.1/firmware-5.bin
firmware:       ath10k/QCA6174/hw2.1/firmware-4.bin
firmware:       ath10k/QCA9887/hw1.0/board-2.bin
firmware:       ath10k/QCA9887/hw1.0/board.bin
firmware:       ath10k/QCA9887/hw1.0/firmware-5.bin
firmware:       ath10k/QCA988X/hw2.0/board-2.bin
firmware:       ath10k/QCA988X/hw2.0/board.bin
firmware:       ath10k/QCA988X/hw2.0/firmware-5.bin
firmware:       ath10k/QCA988X/hw2.0/firmware-4.bin
firmware:       ath10k/QCA988X/hw2.0/firmware-3.bin
firmware:       ath10k/QCA988X/hw2.0/firmware-2.bin

So which firmware file did it load? Well, there’s a firmware git repo for Atheros 10k:

$ git clone https://github.com/kvalo/ath10k-firmware.git

I’m not very happy running firmware found just somewhere, but the author of this Git repo is Kalle Valo, who works at Qualcomm. The Github account is active since 2010, and the files included in the Linux kernel are included there. So it looks legit.

Comparing files with the ones in the Git repo, which states the full version names, the files loaded were hw3.0/firmware-6.bin and another one (board-2.bin, I guess). The former went into the repo on Decemeber 18, 2017, which is more than a year after the problem in the forum post was solved. My firmware is hence fairly up to date.

Nevertheless, I upgraded to the ones added to the git firmware repo on November 13, 2018, and re-generated initramfs (not that it should matter — using lsinitramfs it’s clear that none of these firmware files are there). Did it help? As expected, no. But hey, now I have the latest and shiniest firmware:

[   16.498706] ath10k_pci 0000:03:00.0: firmware ver WLAN.RM.4.4.1-00124-QCARMSWPZ-1 api 6 features wowlan,ignore-otp crc32 d8fe1bac
[   16.677095] ath10k_pci 0000:03:00.0: board_file api 2 bmi_id N/A crc32 506ce037

Exciting! Not.

Better than netstat: lsof tells us who is listening to what

Be sure to read the first comment below, where I’m told netstat can actually do the job. Even though I have to admit that I still find lsof’s output more readable.

OK, so we have netstat to tell us which ports are opened for listening:

$ netstat -n -a | grep "LISTEN "

Thanks, that nice, but what process is listening to these ports? For TCP sockets, it’s (as root):

# lsof -n -P -i tcp 2>/dev/null | grep LISTEN

The -P flag disables conversion from port numbers to protocol names. -n prevents conversion of host names.

Upgrading to Linux Mint 19, running the old system in a chroot

Background

Archaeological findings have revealed that prehistoric humans buried their forefathers under the floor of their huts. Fast forward to 2018, yours truly decided to continue running the (ancient) Fedora 12 as a chroot when migrating to Linux Mint 19. That’s an eight years difference.

While a lot of Linux users are happy to just install the new system and migrate everything “automatically”, this isn’t a good idea if you’re into more than plain tasks. Upgrading is supposed to be smooth, but small changes in the default behavior, API or whatever always make things that worked before fail, and sometimes with significant damage. Of the sort of not receiving emails, backup jobs not really working as before etc. Or just a new bug.

I’ve talked with quite a few sysadmins who were responsible for computers that actually needed to work continuously and reliably, and it’s not long before the apology for their ancient Linux distribution arrived. There’s no need to apologize: Upgrading is not good for keeping the system running smoothly. If it ain’t broke, don’t fix it.

But after some time, the hardware gets old and it becomes difficult to install new software. So I had this idea to keep running the old computer, with all of its properly running services and cronjobs, as a virtual machine. And then I thought, maybe go VPS-style. And then I realized I don’t need the VPS isolation at all. So the idea is to keep the old system as a chroot inside the new one.

Some services (httpd, mail handling, dhcpd) will keep running in the chroot, and others (the desktop in particular, with new shiny GUI programs) running natively. Old and new on the same machine.

The trick is making sure one doesn’t stamp on the feet of the other. These are my insights as I managed to get this up and running.

The basics

The idea is to place the old root filesystem (only) into somewhere in the new system, and chroot into it for the sake of running services and oldschool programs:

  • The old root is placed as e.g. /oldy-root/ in the new filesystem (note that oldy is a legit alternative spelling for oldie…).
  • bind-mounts are used for a unified view of home directories and those containing data.
  • Some services are executed from within the chroot environment. How to run them from Mint 19 (hence using systemd) is described below.
  • Running old programs is also possible by chrooting from shell. This is also discussed below.

Don’t put the old root on a filesystem that contains useful data, because odds are that such file system will be bind-mounted into the chrooted filesystem, which will cause a directory tree loop. Then try to calculate disk space or backup with tar. So pick a separate filesystem (i.e. a separate partition or LVM volume), or possibly as a subdirectory of the same filesystem as the “real” root.

Bind mounting

This is where the tricky choices are made. The point is to make the old and new systems see more or less the same application data, and also allow software to communicate over /tmp. So this is the relevant part in my /etc/fstab:

# Bind mounts for oldy root: system essentials
/dev                        /oldy-root/dev none bind                0       2
/dev/pts                    /oldy-root/dev/pts none bind            0       2
/dev/shm                    /oldy-root/dev/shm none bind            0       2
/sys                        /oldy-root/sys none bind                0       2
/proc                       /oldy-root/proc none bind               0       2

# Bind mounts for oldy root: Storage
/home                       /oldy-root/home none bind               0       2
/storage                    /oldy-root/storage none bind            0       2
/tmp                        /oldy-root/tmp  none bind               0       2
/mnt                        /oldy-root/mnt  none bind               0       2
/media                      /oldy-root/media none bind              0       2

Most notable are /mnt and /media. Bind-mounting these allows temporary mounts to be visible at both sides. /tmp is required for the UNIX domain socket used for playing sound from the old system. And other sockets, I suppose.

Note that /run isn’t bind-mounted. The reason is that the tree structure has changed, so it’s quite pointless (the mounting point used to be /var/run, and the place of the runtime files tend to change with time). The motivation for bind mounting would have been to let software from the old and new software interact, and indeed, there are a few UNIX sockets there, most notably the DBus domain UNIX socket.

But DBus is a good example of how hopeless it is to bind-mount /run: Old software attempting to talk with the Console Kit on the new DBus server fails completely at the protocol level (or namespace? I didn’t really dig into that).

So just copy the old /var/run into the root filesystem and that’s it. CUPS ran smoothly, GUI programs run fairly OK, and sound is done through a UNIX domain socket as suggested in the comments of this post.

I opted out on bind mounting /lib/modules and /usr/src. This makes manipulations of kernel modules (as needed by VMware, for example) impossible from the old system. But gcc is outdated for compiling under the new Linux kernel build system, so there was little point.

/root isn’t bind-mounted either. I wasn’t so sure about that, but in the end, it’s not a very useful directory. Keeping them separate makes the shell history for the root user distinct, and that’s actually a good thing.

Make /dev/log for real

Almost all service programs (and others) send messages to the system log by writing to the UNIX domain socket /dev/log. It’s actually a misnomer, because /dev/log is not a device file. But you don’t break tradition.

WARNING: If the logging server doesn’t work properly, Linux will fail to boot, dropping you into a tiny busybox rescue shell. So before playing with this, reboot to verify all is fine, and then make the changes. Be sure to prepare yourself for reverting your changes with plain command-line utilities (cp, mv, cat) and reboot to make sure all is fine.

In Mint 19 (and forever on), logging is handled by systemd-journald, which is a godsend. However for some reason (does anyone know why? Kindly comment below), the UNIX domain socket it creates is placed at /run/systemd/journal/dev-log, and /dev/log is a symlink to it. There are a few bug reports out there on software refusing to log into a symlink.

But that’s small potatoes: Since I decided not to bind-mount /run, there’s no access to this socket from the old system.

The solution is to swap the two: Make /dev/log the UNIX socket (as it was before), and /run/systemd/journal/dev-log the symlink (I wonder if the latter is necessary). To achieve this, copy /lib/systemd/system/systemd-journald-dev-log.socket into /etc/systemd/system/systemd-journald-dev-log.socket. This will make the latter override the former (keep the file name accurate), and make the change survive possible upgrades — the file in /lib can be overwritten by apt, the one in /etc won’t be by convention.

Edit the file in /etc, in the part saying:

[Socket]
Service=systemd-journald.service
ListenDatagram=/run/systemd/journal/dev-log
Symlinks=/dev/log
SocketMode=0666
PassCredentials=yes
PassSecurity=yes

and swap the files, making it

ListenDatagram=/dev/log
Symlinks=/run/systemd/journal/dev-log

instead.

All in all this works perfectly. Old programs work well (try “logger” command line utility on both sides). This can cause problems if the program expects “the real thing” on /run/systemd/journal/dev-log. Quite unlikely.

As a side note, I had this idea to make journald listen to two UNIX domain sockets: Dropping the Symlinks assignment in the original .socket file, and copying it into a new .socket file, setting ListenDatagram to /dev/log. Two .socket files, two UNIX sockets. Sounded like a good idea, only it failed with an error message saying “Too many /dev/log sockets passed”.

Running old services

systemd’s take on sysV-style services (i.e. those init.d, rcN.d scripts) is that when systemctl is called with reference to a service, it first tries with its native services, and if none is found, it looks for a service of that name in /etc/init.d.

In order to run old services, I wrote a catch-all init.d script, /etc/init.d/oldy-chrooter. It’s intended to be symlinked to, so it tells which service it should run from the command used to call it, then chroots, and executes the script inside the old system. And guess what, systemd plays along with this.

The script follows. Note that it’s written in Perl, but it has the standard info header, which is required on init scripts. String manipulations are easier this way.

#!/usr/bin/perl
### BEGIN INIT INFO
# Required-Start:    $local_fs $remote_fs $syslog
# Required-Stop:     $local_fs $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# X-Interactive:     false
# Short-Description: Oldy root wrapper service
# Description:       Start a service within the oldy root
### END INIT INFO

use warnings;
use strict;

my $targetroot = '/oldy-root';

my ($realcmd) = ($0 =~ /\/oldy-([^\/]+)$/);

die("oldy chroot delegation script called with non-oldy command \"$0\"\n")
  unless (defined $realcmd);

chroot $targetroot or die("Failed to chroot to $targetroot\n");

exec("/etc/init.d/$realcmd", @ARGV) or
  die("Failed to execute \"/etc/init.d/$realcmd\" in oldy chroot\n");

To expose the chroot’s httpd service, make a symlink in init.d:

# cd /etc/init.d/
# ln -s oldy-chrooter oldy-httpd

And then enable with

# systemctl enable oldy-httpd
oldy-httpd.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable oldy-httpd

which indeed runs /lib/systemd/systemd-sysv-install, a shell script, which in turn runs /usr/sbin/update-rc.d with the same arguments. The latter is a Perl script, which analyzes the init.d file, and, among others, parses the INFO header.

The result is the SysV-style generation of S01/K01 symbolic links into /etc/rcN.d. Consequently, it’s possible to start and stop the service as usual. If the service isn’t enabled (or disabled) with systemctl first, attempting to start and stop the service will result in an error message saying the service isn’t found.

It’s a good idea to install the same services on the “main” system and disable them afterwards. There’s no risk for overwriting the old root’s installation, and this allows installation and execution of programs that depend on these services (or they would complain based upon the software package database).

Running programs

Running stuff inside the chroot should be quick and easy. For this reason, I wrote a small C program, which opens a shell within the chroot when called without argument. With one argument, it executes it within the chroot. It can be called by a non-root user, and the same user is applied in the chroot.

This is compiled with

$ gcc oldy.c -o oldy -Wall -O3

and placed /usr/local/bin with setuid root:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <pwd.h>

int main(int argc, char *argv[]) {
  const char jail[] = "/oldy-root/";
  const char newhome[] = "/oldy-root/home/eli/";
  struct passwd *pwd;

  if ((argc!=2) && (argc!=1)){
    printf("Usage: %s [ command ]\n", argv[0]);
    exit(1);
  }

  pwd = getpwuid(getuid());
  if (!pwd) {
    perror("Failed to obtain user name for current user(?!)");
    exit(1);
  }

  // It's necessary to set the ID to 0, or su asks for password despite the
  // root setuid flag of the executable

  if (setuid(0)) {
    perror("Failed to change user");
    exit(1);
  }

  if (chdir(newhome)) {
    perror("Failed to change directory");
    exit(1);
  }

  if (chroot(jail)) {
    perror("Failed to chroot");
    exit(1);
  }

  // oldycmd and oldyshell won't appear, as they're overridden by su

  if (argc == 1)
    execl("/bin/su", "oldyshell", "-", pwd->pw_name, (char *) NULL);
  else
    execl("/bin/su", "oldycmd", "-", pwd->pw_name, "-c", argv[1], (char *) NULL);
  perror("Execution failed");
  exit(1);
}

Notes:

  • Using setuid root is a number one for security holes. I’m not sure I would have this thing on a computer used by strangers.
  • getpwuid() gets the real user ID (not the effective one, as set by setuid), so the call to “su” is made with the original user (even if it’s root, of course). It will fail if that user doesn’t exist.
  • … but note that the user in the chroot system is then one having the same user name as in the original one, not uid. There should be no difference, but watch it if there is (security holes…?)
  • I used “su -” and not just executing bash for the sake of su’s “-” flag, which sets up the environment. Otherwise, it’s a mess.

It’s perfectly OK to run GUI programs with this trick. However it becomes extremely confusing with command line. Is this shell prompt on the old or new system? To fix this, edit /etc/bashrc in the chroot system only to change the prompt. I went for changing the line saying

[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="[\u@\h \W]\\$ "

to

[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="\[\e[44m\][\u@chroot \W]\[\e[m\]\\$ "

so the “\h” part, which turns into the host’s name now appears as “chroot”. But more importantly, the text background of the shell prompt is changed to blue (as opposed to nothing), so it’s easy to tell where I am.

If you’re into playing with the colors, I warmly recommend looking at this.

Lifting the user processes limit

At some point (it took a few months), I started to have failures of this sort:

$ oldy
oldyshell: /bin/bash: Resource temporarily unavailable

and even worse, some of the chroot-based utilities also failed sporadically.

Checking with ulimit -a, it turned out that the limit for the number of processes owned by my “regular” user was limited to 1024. Checking with ps, I had only about 510 processes belonging to that UID. So it’s not clear why I hit the limit. In the non-chroot environment, the limit is significantly higher.

So edit /etc/security/limits.d/90-nproc.conf (the one inside the jail), changing the line saying

-*          soft    nproc     1024

to

*          soft    nproc     65536

There’s no need for any reboot or anything of that sort, but the already running processes remain within the limit.

Desktop icons and wallpaper messup

This is a seemingly small, but annoying thing: When Nautilus is launched from within the old system, it restores the old wallpaper and sets all icons on the desktop. There are suggestions on how to fix it, but they rely on gsettings, which came after Fedora 12. Haven’t tested this, but is the common suggestion is:

$ gsettings set org.gnome.desktop.background show-desktop-icons false

So for old systems as mine, first, check the current value:

$ gconftool-2 --get /apps/nautilus/preferences/show_desktop

and if it’s “true”, fix it:

$ gconftool-2 --type bool --set /apps/nautilus/preferences/show_desktop false

The settings are stored in ~/.gconf/apps/nautilus/preferences/%gconf.xml.

Setting title in gnome-terminal

So someone thought that the possibility to set the title in the Terminal window, directly from the GUI,  is unnecessary. That happens to be one of the most useful features, if you ask me. I’d really like to know why they dropped that. Or maybe not.

After some wandering around, and reading suggestions on how to do it in various other ways, I went for the old-new solution: Run the old executable in the new system. Namely:

# cd /usr/bin
# mv gnome-terminal new-gnome-terminal
# ln -s /oldy-root/usr/bin/gnome-terminal

It was also necessary to install some library stuff:

# apt install libvte9

But then it complained that it can’t find some terminal.xml file. So

# cd /usr/share/
# ln -s /oldy-root/usr/share/gnome-terminal

And then I needed to set up the keystroke shortcuts (Copy, Paste, New Tab etc.) but that’s really no bother.

Other things to keep in mind

  • Some users and groups must be migrated from the old system to the new manually. I do this always when installing a new computer to make NFS work properly etc, but in this case, some service-related users and groups need to be in sync.
  • Not directly related, but if the IP address of the host changes (which it usually does), set the updated IP address in /etc/sendmail.mc, and recompile. Or get an error saying “opendaemonsocket: daemon MTA: cannot bind: Cannot assign requested address”.

fsck errors after shrinking an unmounted ext4 with resize2fs

Motivation

I’m using resize2fs a lot to when backing up into a USB stick. The procedure is to create an image of an encrypted ext4 file system, and raw write it into the USB flash device. To save time writing to the USB stick the image is shrunk to its minimal size with resize2fs -M.

Uh-oh

This has been working great for years with my oldie resize2fs 1.41.9, but after upgrading my computer (Linux Mint 19), and starting to use 1.44.1, things began to go wrong:

# e2fsck -f /dev/mapper/temporary_18395
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/temporary_18395: 1078201/7815168 files (0.1% non-contiguous), 27434779/31249871 blocks

# resize2fs -M -p /dev/mapper/temporary_18395
resize2fs 1.44.1 (24-Mar-2018)
Resizing the filesystem on /dev/mapper/temporary_18395 to 27999634 (4k) blocks.
Begin pass 2 (max = 1280208)
Relocating blocks             XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 954)
Scanning inode table          XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 4 (max = 89142)
Updating inode references     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/mapper/temporary_18395 is now 27999634 (4k) blocks long.

# e2fsck -f /dev/mapper/temporary_18395
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Inode 85354 extent block passes checks, but checksum does not match extent
	(logical block 237568, physical block 11929600, len 24454)
Fix<y>? yes
Inode 85942 extent block passes checks, but checksum does not match extent
	(logical block 129024, physical block 12890112, len 7954)
Fix<y>? yes
Inode 117693 extent block passes checks, but checksum does not match extent
	(logical block 53248, physical block 391168, len 8310)
Fix<y>? yes
Inode 122577 extent block passes checks, but checksum does not match extent
	(logical block 61440, physical block 399478, len 607)
Fix<y>? yes
Inode 129597 extent block passes checks, but checksum does not match extent
	(logical block 409600, physical block 14016512, len 12918)
Fix<y>? yes
Inode 129599 extent block passes checks, but checksum does not match extent
	(logical block 274432, physical block 13640964, len 1570)
Fix<y>? yes
Inode 129600 extent block passes checks, but checksum does not match extent
	(logical block 120832, physical block 14653440, len 13287)
Fix<y>? yes
Inode 129606 extent block passes checks, but checksum does not match extent
	(logical block 133120, physical block 14870528, len 16556)
Fix<y>? yes
Inode 129613 extent block passes checks, but checksum does not match extent
	(logical block 75776, physical block 15054848, len 23962)
Fix<y>? yes
Inode 129617 extent block passes checks, but checksum does not match extent
	(logical block 284672, physical block 15716352, len 7504)
Fix ('a' enables 'yes' to all) <y>? yes
Inode 129622 extent block passes checks, but checksum does not match extent
	(logical block 86016, physical block 15532032, len 18477)
Fix ('a' enables 'yes' to all) <y>? yes
Inode 129626 extent block passes checks, but checksum does not match extent
	(logical block 145408, physical block 16967680, len 5536)
Fix ('a' enables 'yes' to all) <y>? yes
Inode 129630 extent block passes checks, but checksum does not match extent
	(logical block 165888, physical block 17125376, len 29036)
Fix ('a' enables 'yes' to all) <y>? yes
Inode 129677 extent block passes checks, but checksum does not match extent
	(logical block 126976, physical block 17100800, len 24239)
Fix<y>? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/temporary_18395: 1078201/7004160 files (0.1% non-contiguous), 27383882/27999634 blocks

Not the end of the world

This bug has been reported and fixed. Judging by the change made, it was only about the checksums, so while the bug caused fsck to detect (and properly fix) errors, there’s no loss of data (I encountered the same problem when shrinking a 5.7 TB partition by 40 GB — fsck errors, but I checked every single file, a total of ~3 TB, and all was fine).

I beg to differ on the commit message saying it’s a “relatively rare case” as it happened to me every single time in two completely different settings, none of which were special in any way. However we all use journaled filesystems, so fsck checks have become rare, which can explain how this has gone unnoticed: Unless resize2fs officially failed somehow, it leaves the filesystem marked as clean. Only “e2fsck -f ” will reveal the problem.

I would speculate that the reason for this bug is this commit (end of 2014), which speeds up the checksum rewrite after moving an inode. It’s somewhat worrying that a program of this sensitive type isn’t tested properly before being released for everyone’s use.

My own remedy was to compile an updated revision (1.44.4) from the repository, commit ID 75da66777937dc16629e4aea0b436e4cffaa866e. Actually, I first tried to revert to resize2fs 1.41.9, but that one failed shrinking a 128 GB filesystem with only 8 GB left, saying it had run out of space.

Conclusion

It’s almost 2019, the word is that shrinking an ext4 filesystem is dangerous, and guess what, it’s probably a bit true. One could wish it wasn’t, but unfortunately the utilities don’t seem to be maintained with the level of care that one could hope for, given the damage they can make.

tar –one-file-system diving into a bind-mounted directory

Using tar -c –one-file-system a lot for backing up large parts of my disk, I was surprised to note that it went straight into a large part that was bind-mounted into a subdirectory of the part I was backing up.

To put it clear: tar –one-file-system doesn’t (always) detect bind mounts.

Why? Let’s look, for example at the source code (tar version 1.30), src/incremen.c, line 561:

if (one_file_system_option && st->parent
 && stat_data->st_dev != st->parent->stat.st_dev)
 {
   [ ... ]
 }

So tar detects mount points by comparing the ID of the device containing the directory which is a candidate for diving into with its parent’s. This st_dev entry in the stat structure is a 16-bit concatenation of the device’s major and minor numbers, so it’s the underlying physical device (or pseudo device with a zero major for /proc, /sys etc). On a plain “stat filename” at command prompt, this appears as “Device”. For example,

$ stat /
 File: `/'
 Size: 4096          Blocks: 8          IO Block: 4096   directory
Device: fd03h/64771d    Inode: 2           Links: 29
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-11-24 15:37:17.072075635 +0200
Modify: 2018-09-17 03:55:01.871469999 +0300
Change: 2018-09-17 03:55:01.871469999 +0300

With “real” mounts, the underlying device is different, so tar detects that correctly. But with a bind mount from the same physical device, tar considers it to be the same filesystem.

Which is, in a way, correct. The bind-mounted part does, after all, belong to the same filesystem, and this is exactly what the –one-file-system promises. It’s only us, lazy humans, who expect –one-file-system not to dive into a mounted directory.

Unrelated, but still

Whatever you do, don’t press CTRL-C while the extracting goes on. If tar quits in the middle, there will be file ownerships and permissions unset, and symlinks set to zero-length files too. It wrecks the entire backup, even in places far away from where tar was working when it was stopped.

Installing Linux Mint 19.1 with UEFI boot, RAID, encryption and LVM

Introduction

These are my notes as I attempted to install Linux Mint 19.1 (Tara) on a machine with software RAID, full disk encryption (boot partitions excluded) and LVM. The thing is that the year is 2018, and the old MBR booting method is still available but not a good idea for a system that’s supposed to last. So UEFI it is. And that caused some issues.

For the RAID / encryption part, I had to set up the disks manually, which I’m completely fine with, as I merely repeated something I’ve already done several years ago, and then I thought the installer would get the hint.

But this wasn’t that simple at all. I believe I’ve run the installer some 20 times until I got it right. This reminded of a Windows installation: It’s simple as long as the installation is mainstream. Otherwise, you’re cooked.

And if this post seems a bit long, it’s because I spent two whole days shaving this yak.

And as a 2025 edit, there also a later post on installing Mint 22.2 with some additional insights.

Rule #1

This is a bit of a forward reference, but important enough for breaking the order: Whenever manipulating anything related to boot loading, be sure that the machine is already booted in UEFI mode. In particular, when booting from a Live USB stick, the computer might have chosen MBR mode and then the installation will be a mess.

The easiest way to check is with

# efibootmgr
EFI variables are not supported on this system.

If the error message above shows, it’s bad. Re-boot the system, and pick the UEFI boot alternative from the BIOS’ boot menu. If that doesn’t help, look in the kernel log for a reason UEFI isn’t activated. It might be a driver issue (even though it’s not the likely case).

When it’s fine, you’ll get something like this:

# efibootmgr
BootCurrent: 0003
Timeout: 1 seconds
BootOrder: 0000,0003,0004,0002
Boot0000* ubuntu
Boot0002* Hard Drive
Boot0003* UEFI: SanDisk Cruzer Switch 1.27
Boot0004* UEFI: SanDisk Cruzer Switch 1.27, Partition 2

Alternatively, check for the existence of /sys/firmware/efi/. If the “efi” directory is present, it’s most likely fine.

GPT in brief

The GUID partition table is the replacement for the (good old?) MBR-based one. It supports much larger disks, the old head-cylinder-sector terminology is gone forever, and it allows for many more partitions that you’ll ever need. In particular since we’ve got LVM. And instead of those plain numbers for each partition, they are now assigned long GUID identifiers, so there’s more mumbo-jumbo to print out.

GPT is often related to UEFI boot, but I’m not sure there’s any necessary connection. It’s nevertheless a good choice unless you’re a fan of dinosaurs.

UEFI in brief

UEFI / EFI is the boot process which replaces the not-so-good old MBR boot. The old MBR method involved reading a snippet of machine code from the MBR sector and execute it. That little piece of code would then load another chunk of code into memory from some sectors on the disk, and so on. All in all, a tiny bootloader loaded a small bootloader which loaded GRUB or LILO, and that eventually loaded Linux.

Confused with the MBR thingy? That’s because the MBR sector contains the partition information as well as the first stage boot loader. Homework: Can you do MBR boot on GPT? Can you do UEFI on an MBR partition?

Aside from the complicated boot process, this also required keeping track of those hidden sectors, so they won’t be overwritten by files. After all, the boot loader had to sit somewhere, and that was usually on sectors belonging to the main filesystem.

So it was messy.

EFI (and later UEFI) is a simple concept. Let the BIOS read the bootloader from a dedicated EFI partition in FAT format: When the computer is powered up, the BIOS scans this partition (or partitions) for boot binary candidates (files with .efi extension, containing the bootloader’s executable, in specific parts of the hierarchy), and lists them on its boot menu. Note that it may (and probably will) add good old MBR boot possibilities, if such exist, to the menu, even though they have nothing to do with UEFI.

And then the BIOS selects one boot option, possibly after asking the user. In our case, it’s preferably the one belonging to GRUB. Which turns out to be one of /EFI/BOOT/BOOTX64.EFI, /EFI/ubuntu/fwupx64.efi and /EFI/ubuntu/grubx64.efi (don’t ask me why GRUB generates three of them).

A lengthy guide to UEFI can be found here.

UEFI summarized

  • The entire boot process is based upon plain files only. No “active boot partition”, no hidden sectors. Easy to backup, restore, even reverting to a previous setting by replacing the file content of two partitions.
  • … but there’s now a need for a special EFI boot partition in FAT format.
  • The BIOS doesn’t just list devices to boot from, but possibly several boot options from each device.

Two partitions just to boot?

In the good old days, GRUB hid somewhere on the disk, and the kernel / initramfs image could be on the root partition. So one could run Linux on a single partition (swap excluded, if any).

But the EFI partition is of FAT format (preferably FAT32), and then we have a little GRUB convention thing: The kernel and the initramfs image are placed in /boot. The EFI partition is on /boot/efi. So in theory, it’s possible to load the kernel and initramfs from the EFI partition, but the files won’t be where they usually are, and now have fun playing with GRUB’s configuration.

Now, even though it seems possible to have GRUB open both RAID and an encrypted filesystem, I’m not into this level of trickery. Hence /boot can’t be placed on the RAID’s filesystem, as it won’t be visible before the kernel has booted. So /boot has to be in a partition of its own. Actually, this is what is usually done with any software RAID / full disk encryption setting.

This is barely an issue in a RAID setting, because if one disk has a partition for booting purposes, it makes sense allocating the same non-RAID partition on the others. So put the EFI partition on one disk, and /boot on another.

Remember to back up the files in these two partitions. If something goes wrong, just restore the files from the backup tarball. Just don’t forget when recovering, that the EFI partition is FAT.

Finally: Does a partition need to be assigned EFI type to be detected as such? Probably not, but it’s a good idea to set it so.

Installing: The Wrong Way

What I did initially, was to boot from the Live USB stick, set up the RAID and encrypted /dev/md0, and happily click the “Install Ubuntu” icon. Then I went for a “something else” installation, picked the relevant LVM partitions, and kicked it off.

The installation failed with a popup saying “The ‘grub-efi-amd64-signed’ package failed to install into /target/” and then warn me that without the GRUB package the installed system won’t boot (which is sadly correct, but partly: I was thrown into a GRUB shell). Looking into /var/log/syslog, it said on behalf of grub-install: “Cannot find EFI directory.”

This was the case regardless of whether I selected /dev/sda or /dev/sda1 as the device to write bootloader into.

Different attempts to generate an EFI partition and then run the installer failed as well.

Installation (the right way)

Boot the system from a Live USB stick, and verify that you follow Rule #1 above. That is: Check that the “efibootmgr” returns something else than an error.

Then set up RAID + LUKS + LVM as described in this old post of mine. 8 years later, nothing has changed (except for the format of /etc/crypttab, actually). Only the Mint wasn’t as smooth on installing on top of this setting.

The EFI partition should be FAT32, and selected as “use as EFI partition” in the installer’s parted. Set the partition type of /dev/sda1 (only) to EFI (number 1 in GPT) and format it as FAT32. Ubiquity didn’t do this for me, for some reason. So manually:

# mkfs.fat -v -F 32 /dev/sda1

/dev/sdb1 will be used for /boot. /dev/sdc1 remains unused, most likely a place to keep the backups of the two boot related partitions.

So now to the installation itself.

Inspired by this guide, the trick is to skip the installation of the bootloader, and then do it manually. So kick off the RAID with mdadm, open the encrypted partition, verify that the LVM devfiles are in place in /dev/mapper. When opening the encrypted disk, assign the /dev/mapper name that you want to stay with — you’ll have to reboot to fix this later otherwise.

Then use the -b flag in the invocation of ubiquity to run a full installation, just without the bootloader.

# ubiquity -b

Go for a “something else” installation type, select to mount / in the dedicated encrypted LVM partition, and /boot in /dev/sdb1 (or any other non-RAID, non-encrypted partition). Make sure /dev/sda1 is detected an EFI partition, and that it’s intended for EFI boot.

Once it finishes (takes 50 minutes or so, all in all), an “Installation Complete” popup will suggest “Continue Testing” or “Restart Now”. So pick “Continue Testing”. There’s no bootloader yet.

The new operating system will still be mounted as /target. So bind-mount some necessities, and chroot into the new installation:

# for i in /dev /dev/pts /sys /proc /run ; do mount --bind $i /target/$i ; done
# chroot /target

All that follows below is within the new root.

First, mount /boot and /boot/efi with

# mount -a

This should work, as /etc/fstab should have been set up properly during the installation.

Then, (re)install RAID support:

# apt-get install mdadm

It may seem peculiar to install mdadm again, as it was necessary to run exactly the same apt-get command before assembling the RAID in order to get this far. However mdadm isn’t installed on the new system, and without that, there will be no RAID support in the to-be initramfs. Without that, the RAID won’t be assembled on boot, and hence boot will fail.

Set up /etc/crypttab, so it refers to the encrypted partition. Otherwise, there will be no attempt to open it during boot. Find the UUID with

# cryptsetup luksUUID /dev/md0
201b318f-3ffd-47fc-9e00-0356747e3a73

and then /etc/crypttab should say something like

luks-disk UUID=201b318f-3ffd-47fc-9e00-0356747e3a73 none luks

Note that “luks-disk” is just an arbitrary name, which will appear in /dev/mapper. This name should match the one currently found in /dev/mapper, or the inclusion of the crypttab’s info in the new initramfs is likely to fail (with a warning from cryptsetup).

Next, edit /etc/default/grub, making changes as desired (I went for GRUB_TIMEOUT_STYLE to “menu”, to always get a GRUB menu, and also remove “quiet splash” from the kernel command). There is no need for anything related to the use of RAID nor encryption.

Install the GRUB EFI package:

# apt-get install grub-efi-amd64

It might be a good idea to make sure that the initramfs is in sync:

# update-initramfs -u

Then install GRUB:

# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.15.0-20-generic
Found initrd image: /boot/initrd.img-4.15.0-20-generic
grub-probe: error: cannot find a GRUB drive for /dev/sdd1.  Check your device.map.
Adding boot menu entry for EFI firmware configuration
done
# grub-install
Installing for x86_64-efi platform.
Installation finished. No error reported.

It seems like the apt-get command also led to the execution of the initramfs update and GRUB installation. However I ran these commands nevertheless.

Don’t worry about the error on not finding anything for /dev/sdd1. It’s the USB stick. Indeed, it doesn’t belong.

That’s it. Cross fingers, and reboot. You should be prompted for the passphrase.

Epilogue: How does GRUB executable know where to go next?

Recall that GRUB is packaged as a chunk of code in an .efi file, which is loaded from a dedicated partition. The images are elsewhere. How does it know where to look for them?

So I don’t know exactly how, but it’s clearly fused into the GRUB’s bootloader binary:

# strings -n 8 /boot/efi/EFI/ubuntu/grubx64.efi | tail -2
search.fs_uuid f573c12a-c7e4-41e4-99ef-5fda4a595873 root hd1,gpt1
set prefix=($root)'/grub'

and it so happens that hd1,gpt1 is exactly /dev/sdb1, where the /boot partition is kept, and that the UUID matches the one given as “UUID=” for that partition by the “blkid” utility.

So moving /boot most likely requires reinstalling GRUB. Which isn’t a great surprise. See another post of mine for more about GRUB internals.

Conclusion

It’s a bit unfortunate that in 2018 Linux Mint Ubiquity didn’t manage to land on its feet, and even worse, to warn the user that it’s about to fail colossally. It could even have suggested not to install the bootloader…?

And maybe that’s the way it is: If you want a professional Linux system, better be professional yourself…

Linux + APC Smart UPS 750 notes (apcupsd and other stuff)

Introduction

These are my somewhat messy jots while setting up an APC Smart UPS 750 (SMT750I) with a Linux Mint 19 machine, for a clean shutdown on power failure. Failures and mistakes shown as well.

Even though I had issues with receiving a broken UPS at first, and waiting two months for a replacement (ridiculous support by Israeli Serv Pro support company), the bottom line is that it seems like a good choice: The UPS and its apcupsd driver handles the events in a sensible way, in particular when power returns after the shutdown process has begun (this is where UPSes tend to mess up).

As for battery replacement, two standard 12V / 7 AH batteries can be used, as shown in this video. Otherwise, good luck finding vendor-specific parts ten years from now in Israel. For my lengthy notes on battery replacement, see this separate post.

Turning off the UPS manually (and losing power to computer): Press and hold the power button until the second beep. The first beep confirms pressing the button, the second says releasing the button will shut down the UPS.

Turning off beeping when the UPS is on battery: Press the ESC button for a second or so.

Basic installation

Driver for UPS:

# apt-get install apcupsd

Settings

Edit /etc/apcupsd/apcupsd.conf, and remove the line saying

DEVICE /dev/ttyS0

Change TIMEOUT, so the system is shut down after 10 minutes of not having power. Don’t empty the batteries — I may want to fetch a file from the computer with the network power down. This timeout applies also if the computer was started in an on-battery state.

TIMEOUT 600

Don’t annoy anyone to log off. There is nobody to annoy except myself:

ANNOY 0

No need for a net info server. A security hole at best in my case.

NETSERVER off

Wrong. Keep the server, or apcaccess won’t work.

Stop “wall” messages

This is really unnecessary on a single-user computer (is it ever a good idea?). If power goes out, it’s dark and the UPS beeps. No need to get all shell consoles cluttered. The events are logged in /var/log/apcupsd.events as well as the syslog, so there’s no need to store them anywhere.

Edit /etc/apcupsd/apccontrol, changing

WALL=wall

to

WALL=cat

The bad news is that an apt-get upgrade on apcupsd is likely to revert this change.

Hello, world

Possibly as non-root:

$ apcaccess status
APC      : 001,027,0656
DATE     : 2018-10-28 21:36:29 +0200
HOSTNAME : preruhe
VERSION  : 3.14.14 (31 May 2016) debian
UPSNAME  : preruhe
CABLE    : USB Cable
DRIVER   : USB UPS Driver
UPSMODE  : Stand Alone
STARTTIME: 2018-10-28 21:36:27 +0200
MODEL    : Smart-UPS 750
STATUS   : ONLINE
BCHARGE  : 100.0 Percent
TIMELEFT : 48.0 Minutes
MBATTCHG : 5 Percent
MINTIMEL : 3 Minutes
MAXTIME  : 0 Seconds
ALARMDEL : 30 Seconds
BATTV    : 27.0 Volts
NUMXFERS : 0
TONBATT  : 0 Seconds
CUMONBATT: 0 Seconds
XOFFBATT : N/A
STATFLAG : 0x05000008
MANDATE  : 2018-05-22
SERIALNO : AS182158746
NOMBATTV : 24.0 Volts
FIRMWARE : UPS 09.3 / ID=18
END APC  : 2018-10-28 21:36:47 +0200

Shutting down UPS on computer shutdown

By default, the computer puts the UPS in “hibernation” mode at a late stage of its own shutdown. This turns the power down (saving battery), and resumes power when the network power returns. The trick is that apcupsd creates a /etc/apcupsd/powerfail file before shutting down the computer due to a power failure, and /lib/systemd/system-shutdown/apcupsd_shutdown handles the rest:

#!/bin/sh
# apcupsd: kill power via UPS (if powerfail situation)
# (Originally from Fedora.)

# See if this is a powerfail situation.
faildir=$(grep -e^PWRFAILDIR /etc/apcupsd/apcupsd.conf)
faildir="${faildir#PWRFAILDIR }"

if [ -f "${faildir:=/etc/apcupsd}/powerfail" ]; then
  echo
  echo "APCUPSD will now power off the UPS"
  echo
  /etc/apcupsd/apccontrol killpower
fi

Note that the powerfail file is created before the shutdown, not when the UPS goes on battery.

So for the fun, try

# touch /etc/apcupsd/powerfail

and then shutdown the computer normally. This shows the behavior of a shutdown forced by the UPS daemon. As expected, this file is deleted after booting the system (most likely by apcusbd itself).

What happens on shutdown

The UPS powers down after 90 seconds (after displaying a countdown on its small screen), regardless of whether power has returned or not. This is followed by a “stayoff” of 60 seconds, after which it will power on again when power returns. During the UPS hibernation, the four LEDs are doing a disco pattern.

I want the USB to stay off until I turn it on manually. The nature of power failures is that they can go on and off, and I don’t want the UPS to go on, and then empty the battery on these.

To make a full poweroff instead of a hibernation, edit (or create) /etc/apcupsd/killpower, so it says:

#!/bin/bash
#

APCUPSD=/sbin/apcupsd

echo "Apccontrol doing: ${APCUPSD} --power-off on UPS ${2}"
sleep 10
${APCUPSD} --power-off
echo "Apccontrol has done: ${APCUPSD} --power-off on UPS ${2}"

exit 99

This is more or less a replica of /etc/apcupsd/apccontrol’s handler for “killpower” command, only with apcupsd called with the –power-off flag instead of –killpower. The latter “hibernates” the UPS, so it wakes up when power returns. That’s the thing I didn’t want.

The “exit 99″ at the end inhibits apccontrol’s original handler.

So now there’s a “UPS TurnOff” countdown of 60 seconds, after which the UPS is shut down until power on manually.

Manual fixes

Set menus to advanced, if they’re not already. Then:

  • Configuration > Auto Self Test, set to Startup Only: I tried to yank the battery’s plug on the UPS’ rear during a self test, and the computer’s power went down. So I presume that a failing self test will drop the power to the computer. Not clear what the point is.
  • Configuration > Config Main Group Outlets > Turn Off Delay set to 10 seconds, to prevent an attempt to reboot the computer when the USB is about to power down. Surprisingly enough, this works when hibernating the UPS, but when enabling the power-off script above, the delay is 60 seconds, despite this change. I haven’t figured out how to change this.

Maybe the source code tells something

I dug in the sources for the reason that the UPS shuts down after 60 seconds, despite me setting the “Turn Off Delay” to 10 seconds directly on the UPS’ control buttons.

The relevant files in the acpupsd tarball:

  • src/apcupsd.c: The actual daemon and main executable. Surprisingly readable.
  • src/drivers/apcsmart/smartoper.c: The actual handlers of the operations (shutdown and power kill, with functions with obvious names). Written quite well, in terms of the persistence to carry out sensible operations when things go unexpected. Also see drivers/apcsmart/apcsmart.h.

So looking at smartoper.c, it turns out that the kill_power() method (which is used for “hibernation” of the UPS) sends a “soft” shutdown with an “S” command to the UPS, and doesn’t tell it the delay time. Hence the UPS decides the delay by itself (which is what I selected with the buttons).

The shutdown() method, on the other hand, calls apcsmart_ups_get_shutdown_delay(), which accepts an argument saying what the delay is. The name of this function is however misleading, as it just sends a shutdown command to the UPS, without telling it the delay. The figure in the delay is used only in the log messages. The UPS gets a “K” command, and doesn’t tell the UPS anything else. Basically, it works the same as kill_power(), only with a different command.

Trying NUT (actually, don’t)

What tempted me into trying out NUT was this page which implied that it has something related with shutdown.stayoff. And it’s keeping the UPS off that I wanted. But it seems like apcupsd is a much better choice.

Note my older post on NUT.

Since I went through the rubbish, here’s a quick runthrough. First install nut (which automatically ditches apcuspd (uninstalls it totally, it seems):

# apt-get install nut

The relevant part in /etc/nut/ups.conf for the ups named “smarter”:

[smarter]
        driver = usbhid-ups
        port = auto
        vendorid = 051d

I’m under the impression that the “port” assignment is ignored altogether. Don’t try it with other drivers — you’ll get “no such file”, for good reasons. Possibly usbhid-ups is the only way to utilize a USB connection.

And then in /etc/nut/upsmon.conf, added the line

MONITOR smarter@localhost 1 upsmon pass master

The truth is that I messed around a bit without too much notice of what I did, so I might have missed something. Anyhow, a reboot was required, after which the UPS was visible:

# upsc smarter
Init SSL without certificate database
battery.charge: 100
battery.charge.low: 10
battery.charge.warning: 50
battery.runtime: 3000
battery.runtime.low: 120
battery.type: PbAc
battery.voltage: 26.8
battery.voltage.nominal: 24.0
device.mfr: American Power Conversion
device.model: Smart-UPS 750
device.serial: AS1821351109
device.type: ups
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.synchronous: no
driver.parameter.vendorid: 051d
driver.version: 2.7.4
driver.version.data: APC HID 0.96
driver.version.internal: 0.41
ups.beeper.status: enabled
ups.delay.shutdown: 20
ups.firmware: UPS 09.3 / ID=18
ups.mfr: American Power Conversion
ups.mfr.date: 2018/05/22
ups.model: Smart-UPS 750
ups.productid: 0003
ups.serial: AS1821351109
ups.status: OL
ups.timer.reboot: -1
ups.timer.shutdown: -1
ups.vendorid: 051d

and getting the list of commands:

# upscmd -l smarter
Instant commands supported on UPS [smarter]:

beeper.disable - Disable the UPS beeper
beeper.enable - Enable the UPS beeper
beeper.mute - Temporarily mute the UPS beeper
beeper.off - Obsolete (use beeper.disable or beeper.mute)
beeper.on - Obsolete (use beeper.enable)
load.off - Turn off the load immediately
load.off.delay - Turn off the load with a delay (seconds)
shutdown.reboot - Shut down the load briefly while rebooting the UPS
shutdown.stop - Stop a shutdown in progress

So it didn’t really help.

Download a 3D-printable spacer / leg for Z-Turn Lite + IO Cape

Z-Turn Lite + IO Cape with 3D-printed spacer(click to enlarge)

Even though I find Myir’s Z-Turn Lite + its IO Cape combination of cards useful and well designed, there’s a small and annoying detail about them: The spacers that arrive with the boards don’t allow setting them up steadily on a flat surface, because the Z-Turn board is elevated over the IO Cape board. As a result, the former board’s far edge has no support, which makes the two boards wiggle. And a little careless movement is all it takes to have these boards disconnected from each other.

So I made a simple 3D design of a plastic leg (or spacer, if you like) for supporting the Z-Turn Lite board. See the small white thing holding the board to the left of the picture above? Or the one in the picture below? That’s the one.

3D-printed spacer attached to Z-Turn Lite board

If you’d like to print your own, just click here to download a zip file containing the Blender v2.76 model  file as well as a ready-to-print STL file. It’s hereby released to the public domain under Creative Common’s CC0 license.

The 3D model of the spacer, in Blender

The units of this model is millimeters. You’ll need this little part of info.

I printed mine at Hubs (they were 3D Hubs at the time). Because I bundled this with another, more bulky piece of work, the technique used was FDM at 200 μm, with standard ABS as material. If you’re into 3D printing, you surely just read “cheapest”. And indeed, printing four of these should cost no more than one USD. But then there’s the set-up cost and shipping, which will most likely be much more than the printing itself. So print a bunch of them, even though only two are needed. It’s going to be a few dollars anyhow.

Even though these spacers aren’t very pretty, and with zero mechanical sophistication, they do the job. At least those I got require just a little bit of force to get into the holes, and they stay there (thanks to the pin diameter of 3.2 mm, which matches the holes’ exactly). And because it’s such a dirt simple design, this model should be printable with any technique and rigid material.

Wrapping up, here’s a picture of three printed spacers + two of the spacers that arrived with the boards. Just for comparison.

3D-printed spacers compared with Myir's spacers

Blender notes to self: 3D Printing

As I use Blender only occasionally, I’ve written down quite a few hints to myself for getting back to business. If this helps anyone else, so much better.

I’ve also written two similar posts on this matter: A general post on Blender and a post on rendering and animation.

Printing methods

See a summary chart on this page.

  • Fused deposition modeling (FDM/FFF): Melted plastic (ABS/PLA/Nylon) coming out from a nozzle. Layer thickness ~0.2mm. Cheap, but the geometry is limited to self-supported models, or the result literally drops. Also relatively limited accuracy and minimal thickness.
  • Selective laser sintering (SLS): Laser sinters or melts a powdered material (typically nylon/polyamide). Layer thickness ~ 0.1mm. No restriction on geometry,  but the printer parts have surface porosity. The only plastic-like material out there is nylon, typically PA12 (or PA11), coming in dull colors. PA2200 is a very common (and good) powder in use, which produces PA12.
  • Stereolithography (SLA/SL/DLP): Based upon curing of a photopolymer resin with a UV laser. Layer thickness ~ 0.05mm. High quality but expensive manufacturing.

Preparing for printing

  • When the object has fine details on a larger object (e.g. a funnel made to match a certain geometry at its top), consider setting up the larger structure first, create a dense mesh with subdivision surface, and do the adaptions on the final mesh (or partly subdivided?), possibly with a modifier (e.g. Curve). It’s otherwise extremely difficult to get a sane mesh, and it bites back with overlapping faces and whatnot.
  • The mesh must be manifold = no holes. Also, it should have no vertices, edges or faces that don’t enclose an volume, no intersection of bodies, no overlapping of edges or faces. Double vertices and edges are not good, but since the mesh is translated into STL, they go unnoticed as long as the duplicates are accurate. If they’re not, this causes warnings that can be ignored, but can lead to missing the important warnings.
  • Watch the model with Flat shading (click button in Tools) at the toolshelf to the left. Smooth shading is misleading.
  • When resizing in Object Mode, be sure to apply (Object > Apply > Scale), so that the measurements in Edit Mode (and otherwise) are correct. Same goes for applying rotation and possibly location.
  • The result is like at rendering. Bends done by bones are exported.
  • Export to .stl, which is a format consisting of just a list of triangles. The file doesn’t include units, which is why it’s required to state units when uploading a file.
  • In properties / Scene (third icon from the left), set the Units to Metric and Scale to 0.001 for millimeters (these units will go to the STL file, which is unitless).
  • Also, in the “View” part of the properties pane (keystroke “n”), under Clip, make sure “End” is significantly larger than the objects involved, or there will be weird cut-out effects as the view is rotated and moved around. This property sets the “global cube”. What’s outside this cube becomes invisible — faces become partially cut.
  • In the same pane, under Mesh Display, consider enabling Length for “Edge Info”, which displays real-life measures of each edge. Only in Edit mode, only for selected edges. These length are subject to scaling, so it’s wrong if the object has been scaled.
  • Consider lock the scaling to unity of relevant objects to prevent confusion.
  • The 3D printing add-on should be enabled. At the left bar, there will be a 3D Printing tab, allowing for a volume calculation.
  • Before uploading, do some cleanup: Mesh > Vertices > Remove Doubles, as well as the Cleanup/Isolated and Cleanup/Non-Manifold in the 3D printing toolbox.
  • If the 3D toolbox spins forever when pressing the “Volume” button, it’s not a good omen, obviously.
  • Once uploaded, odds are that a lot of warnings on non-manifold edges and intersected faces. These can be checked with Blender’s 3D Printing Toolbox. In particular note that in Edit Mode, there’s a button saying “Intersected Face” which selects the faces marked as intersected. The underlying reason is can be the use of the Boolean modifier, which may create a lot of double edges (two adjacent faces have separate edges instead of sharing one). These double edges occur a lot more than those causing warnings by these tools, probably only when there’s some difference between the two edges. If this is the reason for these warnings, there’s no problem going ahead printing (saying this from first-hand experience).
  • Pay attention to the “Infill” percentage, which means how much of the internal volumes contain with plastic vs. filled with air cubes by the printing software. The layer height also influences the precision and finish.
  • Matching parts: If one part is supposed to go into another, there is no need for an air gap, but there will be friction (my experience with a 2 mm blade into a groove with the exact width, ABS 200 um printing).
  • Checks and export into STL include the active object in Blender only. No need to remove supporting objects before exporting.

Printing online

Online printing is an ugly business, and it’s a bit difficult to blame the service providers. People upload inherently flawed models, made with modeling software that produces output files with geometrical ambiguities, end up disappointed and then blame the printing shop. Those running these services get used to angry customers who write bad reviews on them everywhere, and eventually adopt a strategy of “the customer is always wrong”. It’s impossible to work on a good reputation when a lot of people get angry on them, and there’s nothing they can do about it.

But some companies take this to extreme. See my notes on Shapeways below.

Another thing is that 3D printing is unrepeatable. There is always some human intervention in the process for achieving the best results, as perceived by the operator. With FDM printing, there are several parameters that affect how the layering is done. With SLS printing, this may involve rotating the 3D model, in particular if the same model is printed several times. Such rotation can allow more pieces in each round. And with any technique, human intervention is often there to fix flaws in the mesh, if such exist. Each time, these fixes may be different, or not made at all.

So printing is a bit of a roulette game. The best strategy is therefore to minimize the risk. Simply put, go for the service provider that appears legit and offers the lowest total cost. That’s the only parameter you really have control of.

Printing services charging higher aren’t necessarily better. There is very little indication of who will provide you with the result you wanted. If they ship with UPS, it’s not a good sign, as this courier provides lower prices and has less emphasis on how happy the receiver of the package will be. It seems like serious vendors turn to DHL. But this doesn’t help much, as most vendors work with UPS.

If larger quantities are required, test printing is a good idea to spot problems with the model. However if the design is sensitive to tolerances below 0.5 mm (even with SLS), testing doesn’t guarantee anything. In particular, the transition from a small amount to a larger one can make the same printing service provider choose another machine, a different orientation, or a different post processing chain.

There is no way around this gamble. Neither does it help much to stay with a specific vendor. It makes perfectly sense to make the test printing with one provider, and the larger amounts with another. Stick to the one that gives the better price for each phase. Try to remain with the same material, and on a really good day, with the same machinery.

In short: 3D printing isn’t a long-term relationship. The service providers would of course rather have you stay, but they will typically do nothing but some sweet talk for that purpose.

Where to look for print shops

This is the market situation as I understand it as of January 2024. It’s a dynamic market.

Today’s bazaar of printing companies seems to Treatstock, which allow uploading a design and get offers from a lot of companies. The situation with Printelize is not clear, as their website doesn’t seem to work.

Treatstock’s web site user interface is horrible, and it’s easy to miss the best offer in terms of price. For example, if you select a color that the most economic company doesn’t offer, that company won’t be listed. So what if you don’t really care about the color. And for some reason, it was exceptionally difficult to get a by-price sorted list of offers for Polycarbonate. Eventually, the method was to pick one of the vendors, upload the files and then choose to view other vendors (plus enable international vendors, or else nothing would show up). And then start fine-tuning the color and exact material.

Generally speaking, this website is vague about the printing materials, however it’s possible to deduce the exact specification by visiting the vendor’s information page, and choose the “3D printing tab”. The printers that the vendor uses are listed, and if you look up the printer, there’s a good chance that the vendor use the filaments that are specific for it. So the datasheet can be obtained from the printer’s manufacturer. There’s always a risk that they use replacement filaments, but 3D printing is an uncertain business to begin with…

Treatstock doesn’t allow selecting the printing method, and offers service providers that only have FDM even for a job that is way too complicated for that method. So it’s suitable only for low-end projects.

And a final note about Treatstock: If a vendor refuses to do the job, cancel it immediately. Otherwise, another vendor may pick it up and print it before you know. The problem is that this new vendor might use a different printer and hence a different filament that matches the material’s description and color. Not a big deal if everything is plastic to you, but it’s a real showstopper if you actually picked a specific material according to the datasheet. I had an incident of this sort, which fortunately ended with a full refund even though the thingy was already printed. Maybe because I paid with Paypal (which are more refund-friendly than credit card companies).

Quite recently, I found Craftcloud, which is a bit vague about materials and processes, even though it’s more-or-less written in the offers. And I got the impression that more companies from China are represented there.

In the past, I recommended place to find a print shop is 3D Hubs. However they’ve changed to offering a single deal, and other places seem to give better offers. Their tools for analyzing a model are still great (X-Ray view and graphically highlighting problematic places), so they’re still worth a visit. Even though some of their complaints on my designs were clear false alarms.

And their minimal order is $35. Lower orders are simply raised to that sum.

I’ve had good experience with 3D Print UK, who performed professional SLS printing of 110 small pieces with PA2200: Their price is unbeatable, their black dye is really nice, and the polish finish (plain type, done for free) is the smoothest I’ve seen so far. They did however rotate the printing orientation without asking, which I wasn’t all that happy with. So it’s recommended to request locking the orientation if that makes any difference. But even if I had to reprint everything because of this, and pay for it fully — it would still have been cheaper than any alternative I had.

My experience with Shapeways

I’ve had a good experience with them regarding a non-professional SLS job (“Versatile plastic”, which is a nice name for PA2200, a powder for Nylon 12), which didn’t require much accuracy. The 3D model they got was flawless, so they printed it, sent it, and all was fine.

And then I needed some professional printing with the same process and material. When I say professional, I mean a Kit-Kat sized plastic part with a groove into which a PCB is pushed. It it’s too narrow, it won’t go in. Too wide, and it won’t hold the PCB in its position. Plus some 2mm holes fitting an M2 screw and matching nut. And there were also issues with mechanical strength and flexibility. In short, every 0.1 mm counts.

As I was under the illusion of printing repeatability at the time, I made a round of test prints of my model. I won’t get into the technical details of how the printing results were different from the 3D model, because the crucial inaccuracy turned out to be inevitable, as I learned later by experience. And it was also quite easy to fix on the 3D model. However Shapeway’s response to my complaints (“the customer is always wrong” mixed with “the customer always measures wrong”) made me abandon them temporarily for Sculpteo, which turned out even worse (see below).

So after the Sculpteo detour, I went back to Shapeways with a model I hoped would work. It turned out that they don’t reduce the price at all for larger amounts, even not for 110 pieces. But I was ready to put up with that, as I had reasons to believe that the result would turn out OK.

But then they made the move which surely broke some kind of record, and I’m not sure if it’s about being obnoxious or plain stupidity. And it was all about what to write on the packet, containing a $379 print job: Immediately after issuing an order, I asked that the “Sold to” on the UPS waybill would be the same as “Bill to” in the order and invoice. This has to do with taxing, customs and in particular who owns the goods officially. It matters if it’s a company making the order. Anyhow, this is what they should have done by themselves, since the shipping address is just where the package goes physically. They buyer is whoever pays.

I got the answer that it’s impossible, and that the waybill on the parcel is printed automatically. So I asked to delay the shipping until this issue was sorted out, and got the answer (from more than one person) that it can’t be stopped. Indeed they shipped it, it got the wrong custom declaration, and it went down to some trashcan in UPS’ offices. I couldn’t use any of its content the way it was declared (company bureaucratics).

So this is something one must know about Shapeways: Once you’ve placed the order, the train can’t be stopped. No human is any control anymore. It just happens by itself.

Tactically speaking, they were right: Once they send the parcel in my direction (more or less), the credit card company can’t cancel the deal. But if it doesn’t arrive in time, I do have a case with the credit card company. So they did the right thing, given that their underlying attitude was to grab the money, and tomorrow doesn’t exist.

My experience with Sculpteo

I made quite a few different tests 3d models and sent them for printing with different materials and finish. It ended up with a complete fiasco. Here are a partial list of things that went wrong.

  • One of the test models was printed twice instead of another which wasn’t printed at all.
  • Some models were requested with color, some without. They got it mixed up, and dyed the wrong models.
  • Tons of residual powder inside the plastic.
  • There was this 1-2 mm thick cavity for containing a PCB. It was filled with stuck plastic, so the PCB couldn’t be inserted. I wasn’t able to clean that up.
  • The screw holes ended up too narrow to fit the screws.
  • Plastic parts with 0.3 mm spacing between them melted together on the test on 60µ printing, but not with the 100µ-120µ printing (yes, the finer printing is the one that failed).
  • Put several pieces from different models in one bag.

The worst part was of course that plastic parts had melted together. The warnings you get from the web tools are always about small details which might break (don’t worry, some did, but I accept that since I was warned about it) but nothing about melting. And that’s on a gap of 0.3mm relative to 100µ-120µ printing.

Their response read as follows (early March 2020, this isn’t Covid affected yet):

I would like to point out that Sculpteo prints hundreds of objects daily and your objects look very similar and can easily be mixed up between the models, This is also why the incorrect objects were dyed.
For the space between parts like the stems that are fused together need a spacing of 0.5 mm and not 0.3 mm as you have made, this is why the objects are fused together, normally you can use a Stanley knife to cut these areas.
You have stated that the holes are incorrect, I would like to remind you that there is an average tolerance of +/- 0.3 mm. this is why there could be a slight issue but it is part of 3D printing technology.
I can have the parts that were incorrectly dyed and the parts that were dyed when they should not have been reprinted and sent to you as soon as possible.
[ ... ]
In regards to the objects being filled with powder, it is due to the 1 to 1 ratio (example: 1 mm in width for 1 mm in depth). Our high-pressure air jet is not able to remove the powder in an enclosed area and it can be easily be removed with a paperclip.
We regret if you think that we are not able to provide a professional service but we do give lots of information that our customer can read before placing their order.

Bottom line: Sculpteo may be nice for playing a bit with 3d printing, but if you have professional intentions, you probably want to stay away from them. One may think that the ±0.3 mm tolerance is a general statement to keep bottoms covered, but then it happens in actual 60µ printing.

Blender notes to self: Rendering and animation related

As I use Blender only occasionally, I’ve written down quite a few hints to myself for getting back to business. If this helps anyone else, so much better.

I’ve also written two similar posts on this matter: A general post on Blender and a post on 3D printing.

Bones

  • For a simple beginner’s use example, see this page.
  • Bones are simply a handle which one can do the Grab / Rotate / Size trio on. It has an pivot point and a handle. The manipulations on the bone apply to all vertices in the bone’s Vertex Group, relative to the bone’s pivot point, and in proportion to their weight for that group.
  • The Vertex Groups are listed under the object’s properties, under Object Data (icon is an upside down triangle of dots). In Weight Paint mode, this is where the group to paint weights for is selected.
  • The Vertex Groups’ names are taken from the bones’ when weights are assigned automatically.
  • The Armature modifier is added (automatically) to the object subject to the bones. Be sure that it’s the first modifier (uppermost in the stack), in particular before Subdivision Surface. It’s the original mesh we want to move, not tear pieces of the rounded one. Corollary: The bones’ deformations can be applied, like any modifier.
  • Always check the bones’ motion alignment with the parent bone, and set the bones’ Roll parameter (in the bones’ properties, icon with bone) if necessary (in particular if the previous segment has been resized). This sets the axis in space at which the bone rotates, and has to be done manually in Edit mode. It controls the direction the bone rotates w.r.t its origin, which is crucial for intuitive motion, so the bones seem to move right, but just a little off the desired direction. Just align the square of the bone symbol with the previous segment’s direction.
  • The automatic weights aren’t all that good. In the end, there’s no way out but to assign the weights manually.
  • And the Weight painting is good for getting a picture of what’s going on. But assigning weights with it is really bad. In particular as it’s easy to mistakenly paint a completely unrelated vertex, leading to weird things happening.
  • Instead, set the weight manully under the Object data tab (just mentioned). Select the vertices in Edit Mode, write the desired weight in the dedicated place under the Object data tab, and click “Assign”.
  • The Armature must be a parent of the object to be distorted. Extruded bones are children of the bones they’re extruded from.
  • To move around the bones (in particularly rotate), enter Pose mode (or just click “Pose” for the relevant armature in the Object Outliner).
  • Zero the pose: Change to Pose mode, select all (A) and Pose > Clear Transform > All
  • The bones’ influence is disabled only in Edit mode (unless enabled in the Armature modifier).
  • When an object controlled by bones is duplicated, the vertex groups are duplicated as well, but not the bones. So both objects are controlled by the same bones, in an non-natural way (center of rotation on previous bones etc.)
  • If a vertex belongs only to one group, the weight is meaningless: If it belongs to the group, it will move 100% anyhow.
  • If a vertex belongs to more than one vertex group. its normalizes the total to 1.0. So it’s fine to have an overlap on the joints, but be careful with pushing it too far. Note that the bone after the joint is moved by virtue of parenting, so there’s no reason to assign weights after the joint. But it will weaken effect that is supposed to move that part.
  • Not that vertex groups that have no bone don’t count for proportional motion of the vertex. For a vertex that moves less than 100% on a single bone, also assign a second vertex group that belongs to a bone that will never move. This is good for transition with a fixed part.
  • Rotate bones with Individual Origins pivot point.

Textures etc.

  • Each face is related to a material. The first material is assigned to all faces. Additional ones need to be assigned.
  • Once a material is selected in the Material property button-tab, the Texture tab relates to it.
  • Projecting an image: First mark a seam in Edit mode. Select a set of edges and Mesh > Edges > Mark Seam. Then select the faces to project (possibly all) and go Mesh > UV Unwrap… > Unwrap (or possibly Project from view or some other choice).
  • When using UV projection, the Type is “Image or Movie”, the Source is the file, and under “Mapping” it says Coordiates: UV (otherwise the mapping in Material view will be wrong).
  • UV/Image Editor: Maps pieces of the image into faces. Use side-by-side with a 3D view in Edit mode. Enable “Keep UV and Edit mode mesh selection in sync” for easy selection (somewhere in the middle of the bottom bar). The mouse’s middle button + move mouse moves the image view (instead of Shift-scroll or something)
  • Multiple images can be sources for a single object, by virtue of generating multiple materials, and assigning them them to difference faces. Each material is then linked to separate textures, each based upon a different image.
  • Texture paint: A little GIMP, just in 3D. The changes are updated in the source images image(s). The big upper box is the brush selector. Most notable is “Clone”, which works like GIMP’s, with CTRL-click to select the source. Excellent for hiding seams.
  • Careful with overlapping UV mappings on a single image with Texture Paint: One stroke will affect all mapped regions.
  • Texture paint may manipulate several images in a single stroke, if this stroke covers regions sourced from different images.
  • If texture paint is responding slowly and eating a lot of CPU, try reducing the subsurface division number, if used. Too many faces aren’t good.
  • Don’t forget to save the 2D images in the end!
  • For copying a 3D shape from a 2D image, use Global Mapping on the texture, along with a Top Ortographic view. The texture remains in place no matter how the object it twisted and turned, so it’s fairly easy to drag it along the image’s edges.

Rendering

  • F12: Render Image (“Quick Render”). Also from top menu Render > Render Image. Return to 3D view with F11.
  • Shading Smooth / Flat at the Tool shelf doesn’t change the shape, but only the way light is reflected
  • If the rendering result suffers from weird shadows, and/or unexplained edge lines on a surface that’s supposed to be smooth, try in Edit Mode go Mesh > Normal > Recalculate Outside, which may fix normals that have been messed up from edits.

Cycles: How it works

If a realistic rendering result is desired, forget about Blender’s native render engine. It’s a lost battle with dirty tricks to achieve the obvious way to reach a natural appearance, and that’s to simulate the light rays. Which is what the Cycles engine does.

This is a very simplistic description of Cycles. In reality, it’s by far more clever and efficient, so the results on the real engine are better than you would expect from the description below.

For each sample (i.e. an iteration of improving the rendered image), and for each pixel to be generated on the rendered image, the render engine traces the light ray, backwards. That is, from the camera to the source of light.

The initial leg is simple, as the angle of view is known and deterministic. If this ray hits nothing, we get black. If it hits a face, it examines its material data. By hitting something, I mean the first  intersection between the ray’s line and some face in the mesh.

When hitting a face, the face’s material’s shader is activated. If it’s a pure emission of light, that’s the final station, and the pixel’s value can be calculated. If it’s any other shader, it will tell the render engine on what angle to continue, and how to modify the light source, once reached. This modification is the material’s color or the texture at the specific point that was hit.

And so it goes on, until a ray hitting nothing is reached, or a pure emitting light source. Once the final station has been reached, the aggregation of color modifications is applied, and there’s the final pixel value.

So why is it randomness involved? Why is it random?

A diffusing surface collects light from all directions, and reflects it towards the camera. Since the light tracing can only follow one direction, it’s picked at random by the shader. So each sample consists of one such ray trace for each pixel. Each time a diffusing surface is reached, there’s a lottery. Hence the randomness. Except for pass-through and purely reflective shaders (i.e. Glossy with Roughness 0), which have a deterministic ray bending pattern.

When the “Mix” shader is used, the mix rate is a real mix: Each shader gets its go, and the result is mixed. Try to mix an emission shader with a black diffusion.

So God may not play with a dice, but Cycles surely does.

Light is Everything

  • DON’T use Blender’s Lamps unless you want everything to look like plastic. There’s a huge difference between lamps and objects (e.g. planes) with an emission shader (both in results and render time). Use the latter for a realistic look.
  • In particular, a skin texture will never look right with lamp light. See below.
  • Creating an invisible light source: Create any object, set its shader to Emission, and go to the “Object” properties (the icon is a yellow cube). At the bottom, there’s “Cycles Setting”. Disable “Camera” checkbox in the Ray Visibility section.
  • To avoid seeing these emission objects when editing (they get in the way all the time), put them in a different layer. Use Ctrl-click on the relevant layer to view it along with the current one when switching to render view.

Node Editor

  • For an texture image: Add an Image Texture element, and open the file. Then to UV mapping (nothing will be visible before that). If there are multiple texture files, they are all mapped with the same UV map by default (or at all?).
  • Bump map: Image Texture > Bump (input Height) > Diffuse BSDF (input Normal) > Material Output (input Surface). Displaces the position along the normal, “Distance” says how much. With “Invert” unchecked, a high image value means outwards.
  • Use an image’s transparency: Generate a Tranparent BSDF shader, and connect it to a Mix Shader’s upper input. The lower input goes to the regular (Diffuse BSDF?) shader. The Image Texture’s Color goes as usual to the regular shader, but its Alpha output to the Mix shader’s Fac.
  • Glossy BSDF: Mirror-like reflection when Roughness is set to zero, otherwise it’s diffusing the reflection.
  • Velvet BSDF: Low angles between incident and reflection yield low reflection, so it emphasizes smooth contours. Good for combination with Diffuse shader for simulating human skin (compensate for too dark edges of the latter).
  • Emission: Not just as a light source, but also a way to fake fill light.
  • Color Ramp: Useful to turn an image into a one-dimensional range of colors, including Alpha, instead of manipulating the texture’s range.
  • The Geometry input supplies Normal (which is after smoothing, pick True Normal for without) and Incoming (which is the direction of the light ray). Along with Converter > Vector Math set to Dot Product or Cross Product, the value output with these to combined depends on the angle between the incident ray and the normal. Together with Color Ramp, this allows an arbitrary reflection pattern (use for Fac on some Mix shader).
  • The Voronoi texture (using “Cells”) is great for simulating an uneven, grainy surface.
  • To get a generally misty atmosphere, go to the World tab in Properties, and under Volume select Volume scatter with white color and Density of 0.1 to 0.2. Anisotropy should be 0.

Achieving human skin appearance

Making a model look human and alive is the worst struggle of all. I’ve seen a lot of crazy attempts to add complicated shaders and stuff to reach a natural skin appearance. Even though I haven’t managed to get a face look natural (good luck with that), these are the insights I have reached.

  • Rule zero: Use Cycles. Should be obvious.
  • Rule number one: DO NOT USE LAMPS. All generation of light should be done with objects (most likely flat planes) with (white) emission shaders. Any inclusion of lamp objects makes everything look like plastic. Rendering convergence is indeed faster with lamps, but the result is disastrous, even when lamps are used for just fill light. In short, create real studio lighting.
  • There’s no need for subsurface scattering and all those crazy shaders. These are a result of the impossible attempt to tweak the reflection to get something realistic in response to the plastic feel of lamp light. When the light is done properly, plain shaders are enough. Actually, Subsurface Scattering makes a marginal difference, and to the worse (deepens shadows, while actual skin somehow reflects in all directions).
  • The Glossy part of flat skin (e.g. a leg) should be GGX (default) with roughness ~ 0.5. Diffuse with roughness 0.4 (doesn’t matter so much), mixed 50/50. Use the texture’s color for the Gloss shader as well (or mix partly with white).
  • And here’s the really important part: Natural skin is full with small bruises and other uneven coloring that we barely notice when watching with the naked eye. It’s when this uneven coloring is gone (a woman wearing tons of makeup or a 3D model) that it looks like plastic. Therefore, the texture applied on the skin area (i.e. the coloring of the faces) should be aggressively uneven, with speckles and also wide areas of slight discoloring. Adding a leathery bump texture and/or wrinkles adds to the realistic look, but won’t get the plastic feel unless the lighting is done right and the texture is alive.
  • For the depth pattern of the skin, either use the Voronoi texture (see  this page) on leather, or consider looking for images of elephant skin or something (the cell texture is similar). This is mainly relevant if closeups are made on the skin.
  • Realistic eye: Be sure to add a cornea to the eye, mixing 90% transparent and 10% glossy shaders. The cornea’s ball should be 66% of the size of the eyeball, and brought to cover a little more than the iris. The reflection of the cornea brings the eye to life.

Animation

  • Animation adds an Animation object to the controlled object’s hierarchy (with a ArmatureAction sub-object for Armatures).
  • Key = Nailing the some properties some object for a given frame.
  • Don’t expect to change the pose and have all changes recorded.
  • Rather, in the Timeline Editor, select the desired bones of in the armature for keying (all bones of the armature, probably), pick which properties are being keyed (possibly just Rotation for plain motion) and click the key icon (“Insert Keyframe”).
  • Keying Set = The set of objects whose properties are being keyed.
  • Dope Sheet: Accurate, concise and gives control. Each channel is a property, each diamond is a key for that property. Thick lines between diamonds show that they haven’t changed along that time.
  • Selection of keys: With right-click. Selecting the top diamond (“Dope Sheet Summary”) selects all keys of a frame (the Armature’s diamond selects all keys of an armature etc.)
  • It’s possible to Copy-Paste keys with the clipboard icons at the bottom (or simply CTRL-C, CTRL-V). “Copy” relates to just the selected keys.
  • In the Dope Sheet, use Shift-D and then G (grab) to copy all keys to another frame. Also possible to just Grab keys to adjust the timing etc.
  • “Insert Keyframe” = store the properties of the current pose in the current positions. In Timeline Editor, this adds diamonds in the channels that correspond to the selected bones (or adds these channels). It doesn’t change or delete keys for bones not selected.
  • Work flow: First, select the properties that are going to be involved (all bones of an armature?), and create a key for them in the Timeline Editor. The rest of the work is done in the Dope Sheet: Scrub to the desired frame, change the pose, and Key > Insert Keyframe > All Channels (or with I). Or possibly just selected channels, to leave the other channels interpolating as before.
  • Note the difference between how Timeline and Dope Sheet stores the pose: Timeline stores the properties of the selected bone only, while the Dope Sheet allows storing “All Channels”. Assuming that all relevant properties have channels in the Dope Editor (it’s a good idea), “All Channels” captures the entire pose (and marks those that haven’t changed).
  • Careful with jumping in time by accidentally clicking in the Timeline / Dope Sheet: It overrides all changes in the pose. To avoid this, “save your work” by “Inserting Keyframes” often.
  • Don’t forget to move to a new frame before working on the next pose. If you do, copy the current frame’s keyframes into the clipboard, create a new keyframe with the current pose, and paste the previous keyframes into a slightly earlier frame. And then move (grab) the keys in time to their correct places.
  • It’s possible (but usually pointless) to set the interpolation mode in the Dope Sheet (Key > Interpolation Mode). This controls the interpolation of the selected key until the next one. The default (set in User Preferences > Editing) is Bezier, which gives a natural feel.
  • However the “Constant” interpolation can be useful for camera properties, when it’s desired to hold it still and then jump to other parameters (i.e. a “cut”).

Simulation

  • Plain Physics fluid (simple example): It’s a 3D-grid based simulation running in a limited space, which is enclosed by the object to which the Physics > Fluid physics is attached with the “Domain” type (it’s the walls of the contained as well as the limits of the simulated region). The Physics properties of this object are those determining the simulation (in particular the time scale in seconds via the End time, and the real-life size in meters). And the baking is done on this object. Other objects, which have the Physics > Fluid attached will participate according to the Types, e.g. Fluid (the object will turn into a fluid) or Obstacles (which limits the motion of the fluid).