systemd: Reacting to USB NIC hotplugging (post-up scripting)

The problem

Using Linux Mint 19, I have a network device that needs DHCP address allocation connected to a USB network dongle. When I plug it in, the device appears, but the DHCP daemon ignored eth2 (the assigned network device name) and didn’t respond to its DHCP discovery packets. But restarting the DHCP server well after plugging in the USB network card solved the issue.

I should mention that I use a vintage DHCP server for this or other reason (not necessarily a good one). There’s a good chance that a systemd-aware DHCP daemon will resynchronize itself following a network hotplug event. It’s evident that avahi-daemon, hostapd, systemd-timesyncd and vmnet-natd trigger some activity as a result of the new network device.

Most notable is systemd-timesyncd, which goes

Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.

twice, once when the new device appears, and a second time when it is configured. See sample kernel log below.

It’s not clear to me how these daemons get their notification on the new network device. I could have dug deeper into this, but ended up with a rather ugly solution. I’m sure this can be done better, but I’ve wasted enough time on this — please comment below if you know how.

Setting up a systemd service

The very systemd way to run a script when a networking device appears is to add a service. Namely, add this file as /etc/systemd/system/eth2-up.service:

[Unit]
Description=Restart dhcp when eth2 is up

[Service]
ExecStart=/bin/sleep 10 ; /bin/systemctl restart my-dhcpd
Type=oneshot

[Install]
WantedBy=sys-subsystem-net-devices-eth2.device

And then activate the service:

# systemctl daemon-reload
# systemctl enable eth2-up

The concept is simple: A on-shot service depends on the relevant device. When it’s up, what’s on ExecStart is run, the DHCP server is restarted, end of story.

I promised ugly, didn’t I: Note the 10 second sleep before kicking off the daemon restart. This is required because the service is launched when the networking device appears, and not when it’s fully configured. So starting the DHCP daemon right away misses the point (or simply put: It doesn’t work).

I guess the DHCP daemon will be restarted one time extra on boot due to this extra service. In that sense, the 10 seconds delay is possible better than restarting it soon after or while it being started by systemd in general.

So with the service activated, this is what the log looks like (the restarting of the DHCP server not included):

Nov 11 11:25:54 kernel: usb 1-12: new high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:54 kernel: usb 1-12: New USB device found, idVendor=0bda, idProduct=8153
Nov 11 11:25:54 kernel: usb 1-12: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Nov 11 11:25:54 kernel: usb 1-12: Product: USB 10/100/1000 LAN
Nov 11 11:25:54 kernel: usb 1-12: Manufacturer: Realtek
Nov 11 11:25:54 kernel: usb 1-12: SerialNumber: 001000001
Nov 11 11:25:55 kernel: usb 1-12: reset high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 kernel: r8152 1-12:1.0 eth2: v1.09.9
Nov 11 11:25:55 mtp-probe[59372]: checking bus 1, device 125: "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-12"
Nov 11 11:25:55 mtp-probe[59372]: bus: 1, device: 125 was not an MTP device
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12
Nov 11 11:25:55 systemd-networkd[65515]: ppp0: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet8: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet1: Link is not managed by us
Nov 11 11:25:55 networkd-dispatcher[1140]: WARNING:Unknown index 848 seen, reloading interface list
Nov 11 11:25:55 systemd-networkd[65515]: lo: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: eth2: IPv6 successfully enabled
Nov 11 11:25:55 systemd[1]: Starting Restart dhcp when eth2 is up...
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 vmnetBridge[1620]: Removing interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:55 kernel: userif-3: sent link down event.
Nov 11 11:25:55 kernel: userif-3: sent link up event.
Nov 11 11:25:57 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:57 systemd-networkd[65515]: eth2: Gained carrier
Nov 11 11:25:57 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:57 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.20.30.1.
Nov 11 11:25:57 avahi-daemon[1115]: New relevant interface eth2.IPv4 for mDNS.
Nov 11 11:25:57 avahi-daemon[1115]: Registering new address record for 10.20.30.1 on eth2.IPv4.
Nov 11 11:25:57 kernel: r8152 1-12:1.0 eth2: carrier on
Nov 11 11:25:57 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWADDR: index:848, addr:10.20.30.1
Nov 11 11:25:57 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:58 kernel: userif-3: sent link down event.
Nov 11 11:25:58 kernel: userif-3: sent link up event.
Nov 11 11:25:59 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv6 with address fe80::2e0:4cff:fe68:71d.
Nov 11 11:25:59 avahi-daemon[1115]: New relevant interface eth2.IPv6 for mDNS.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Gained IPv6LL
Nov 11 11:25:59 avahi-daemon[1115]: Registering new address record for fe80::2e0:4cff:fe68:71d on eth2.*.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Configured
Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:59 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).

As emphasized in bold above, there are 4 seconds between the activation of the script and systemd-networkd’s declaration that it’s finished with it.

It would have been much nicer to kick off the script where systemd-timesyncd detects the change for the second time. It would have been much wiser had WantedBy=sys-subsystem-net-devices-eth2.device meant that the target is reached when it’s actually configured. Once again, if someone has an idea, please comment below.

A udev rule instead

The truth is that I started off with a udev rule first, ran into the problem with the DHCP server being restarted too early, and tried to solve it with systemd as shown above, hoping that it would work better. The bottom line is that it’s effectively the same. So here’s the udev rule, which I kept as /etc/udev/rules.d/99-network-dongle.rules:

SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth2", RUN+="/bin/sleep 10 ; /bin/systemctl restart my-dhcpd"

Note that I nail down the device by its name (eth2). It would have been nicer to do it based upon the USB device’s Vendor / Product IDs, however that failed for me. Somehow, it didn’t match for me when using SUBSYSTEM==”usb”. How I ensure the repeatable eth2 name is explained on this post.

Also worth noting that these two commands for getting the udev rule together:

# udevadm test -a add /sys/class/net/eth2
# udevadm info -a /sys/class/net/eth2

A VoIP phone at home: The tech details on leaving your phone company

Introduction

This is some information and hard-learned wisdom I collected while setting up an Israeli phone number for use with a VoIP phone, so I can accept calls with a regular Israeli phone number and also make outbound calls. Important to note is that I did this without the local ISPs that provide this service. In particular, I did this for the sake of leaving Israeli Netvision’s service, which required a certain arrangement between the phone adapter and the ADSL line.

First and foremost: Setting up a VoIP line for a regular phone number is a time consuming task, which requires quite a bit of technical understanding in computer networks. Both service providers and hardware vendors behave as if only phone experts deal with them, and that’s the kind of support and documentation to expect. Unless you’re familiar with some internet protocols and know how to configure a firewall and work with a sniffer, you’re in for a big time frustration.

The biggest, ehm, alternative truth, related to VoIP is “phone line within minutes”. Expect it to take a few days at best. On a good day, this will be because of your own learning curve. And if you’re not the computer geek type to enjoy exploring a technical topic, odds are that it’s going to be extremely annoying.

The VoIP trinity

There are basically three components that need to be set up for this to work:

  • A phone number must be allocated or ported. This is referred to as a Direct Inward Dial or Direct Dial In service (DID / DDI). This means that some phone company knows it should relay phone calls of that number to a VoIP service provider, rather than to its own internal phone network.
  • An SIP trunk for connecting calls between yourself and the VoIP supplier over the internet. Or something of a similar nature.
  • A VoIP client: A program running on your computer / smartphone or a physical VoIP phone. Just remember that those innocent-looking handsets are actually small Linux computers with a web interface for configuring the network connection (SIP details and credentials, connection method to the Internet and whatnot). Most of the disappointed reviews on these phones are from people who expected a plug and play experience.

Typically (or always?), the first two items are a package deal, supplied by a single service provider. So you’ll sign up for the phone number, and once it has been set up, you’ll set up the VoIP connection with the same company’s servers.

Note that the DID phone number doesn’t have to be in your own country. As a matter of fact, your geographic location doesn’t matter. So if you want to supply local number to dial into for customers worldwide, this is the way. However if you want to be in contact with friends and family abroad, this is way too much work to set up.

And since I’m at it — a DID provider doesn’t necessarily relay to a VoIP network. The service might very well be plain forwarding to another regular phone number. This might be the optimal solution just to create an international presence, but without the VoIP headache.

SIP trunks…?

This is the most confusing issue. All I want is a single phone line, and suddenly I get the word “SIP trunk” everywhere. Do I really need it? I want a glass of milk, and I get the whole cow? The answer is, well, yes. But no fear, this cow supplies exactly the glass of milk you’ll need.

So first, let’s understand why it’s called an SIP trunk: For the sake of argument, say that you have an office with a number of phone lines connected to an in-office phone relay, and all internal phones can now connect with each other. For an outside world connection, a pool of voice circuits is set up, traditionally (that is, 1990-ish) through a single high-speed (~2 Mb/s) digital link to the local telephone company. This digital link is called a “trunk”, and phone calls in and out are allocated in TDM-multiplexed slots on the digital link as these calls are initiated. The telephony company allocates phone numbers, and when someone calls one of these, the trunk is used to carry both the signaling (which phone number is called) and voice into the office.

The “SIP trunk” uses an existing internet connection for the same thing. Instead of a fixed wire, UDP packets carry both signaling and voice. Instead of the regular phone company, an SIP server makes the connection between the VoIP link and real phone numbers. The in-office telephone relay registers itself on the SIP server, tells it what phone number it covers (so incoming calls are relayed to it) and proves it’s authenticity with some passphrase.

After registration, the in-office phone relay can initiate outgoing calls as required by someone in the office calling out, or accept inbound calls, in which case it will ring one of the phones in the office.

And here’s the point: In the end of the day, that VoIP handset does the same as that in-office phone relay: Requests outgoing calls and accepts incoming calls, with the same SIP protocol. That’s why some kind of trunk is set up (a SIP trunk or some other type) with the capacity of typically one phone number.

That also explains why the whole thing gets complicated: Setting up that little IP phone, you get into the shoes of a small business’ phone technician (of the rather high-end type, actually). So there are a few technical details to understand, and the service providers are somewhat adapted to work with people that do this for a living. You’re supposed to know what you’re doing.

Setting up a simple software SIP client

The protocol for maintaining the telephone signalling and session is called SIP. Hence either a piece of software running on the computer or some dedicated hardware phone can do the job. Or a combination of both.

After some looking around, I went for Linphone for this purpose. It’s simple and to the point.

# apt install linphone

How to set up an SIP connection on Linphone: Dismiss the setup assistant. Instead, go for Options > Preferences. Leave the Network settings with their default values: SIP (UDP) at port 5600, Direct connection to the Internet (even though I have iptables doing both firewall and NAT, however it allow established and related connections). In Multimedia settings, leave the echo canceling on (not that it helps much).

Then go to the “Manage SIP Accounts” tab, and go for the ” + Add” button (not the Wizard). Set Your SIP Identity to e.g. “sip:123456@core-sip-qts.avoxi.com” and SIP Proxy address to “sip:core-sip-qts.avoxi.com”. The “123456″ should be replaced with the DID phone number (which is also the first part of the credentials given by the service provider). Leave “Route” empty and registration duration is set automatically to 3600 secs, which is fine. I use Avoxi’s SIP server in this example, but see below on my choice of service provider.

Why the proxy address is used twice is beyond me (once in the identity and then as the proxy address), but this is commonly seen.

Immediately on clicking OK, Linphone prompts for the password. The password, also given by the service provider, is prompted for only on the first attempt to connect. A couple of seconds after supplying the password, the status line at the bottom of Linphone’s main windows reads “Registration on <core-sip-qts.avoxi.com> succesful”. It better be.

A debug log is available with Help > Show debug window. It shows, among a lot of other stuff, the SIP protocol exchange.

Keep in mind that linphone keeps running even after closing the main window. To really quit it, do so in the icon on the desktop’s toolbar.

A phone adapter

If you want to keep your regular phone, a SIP adapter will do the job. Even though I didn’t go this path, I considered a PAP2T from AliExpress at $20 or something. No experience with it however.

A real IP phone: Grandstream GXP1610

First, some general words: There are several SIP phones out there, and the reviews on this specific one are mixed, and there’s a good reason for that. It’s generally OK, with a lot of specified featured, but at the same time it misses on the small details. For example, it has a wall mount option, but the handset will fall off if you really try that. It can’t be flat on the table either, because the plugs are in the back. So only the table upright position is an option.

The ring tone options are rather poor for a machine that is effectively a computer. Setting it up to allow simple dialing of local phone numbers is a riddle. And the documentation is pretty lacking. There’s a lot of detail on esoteric issues, much less how to get started with the obvious stuff. Once can imagine the computer geeks adding more and more software features, but with nobody looking at the overall usage experience.

That said, it’s fine once set up, in particular if it’s intended for sporadic use, and it’s low-cost. It makes sense in an office, where there’s an IT person to handle the installation and setup.

So now to how to set it up for simple use.

First, some documentation: Download the User Guide from Grandstream’s resource page for the simple use. For configuration, download the Administration Guide. There’s also a Security Manual, which goes through a few security options with the phone.

Plug in the power supply and wait for a few seconds. The phone says “Booting” and then it boots for a while (it takes a minute, like almost exactly 60 seconds). Plug a the phone’s “LAN” Ethernet jack to the local LAN.

This is the time to mention, that the phone’s web configuration interface is on the same port as the voice communication. It’s quite common to do the configuration on a separate Ethernet port, but this is not the case. This means that anyone with access to phone from the LAN (or web?) can fiddle with its configuration. Maybe a good idea in an office with an IT department handling the phones. So setting up a firewall to prevent intrusion from outside is a good idea (if possible). There’s a separate “PC” Ethernet port, but it’s not clear what it is for.

The phone functions as a DHCP client by default, so it gets its address and displays it on the LCD (or press NextScr). Address 0.0.0.0 means that no address has been obtained with DHCP and static IP is disabled.

If DHCP isn’t enabled, press the button in the middle of the four arrows, and navigate: System > Network > IPv4 Settings, select DHCP. The menu returns to IPv4 Settings, meaning it has accepted the selection. Pressing “back” makes the phone ask if we want to reboot, so yes.

To get started, open a browser and type the IP address of the phone. The web interface asks for username and password, and it’s admin/admin, not surprisingly. The web app forces a password change if these are used. Note to self: Look for the phone-login-password.txt file.

To set up a SIP account, go to Accounts > Account 1 > General Settings and fill in the Account Name, SIP Server, SIP User ID (without the “@” and proxy server) and Authenticate Password. Click “Save and Apply”. Then check with Status > Account Status. SIP Registration should say Yes, meaning that the phone is functional (and there’s an icon on the LCD screen, see below).

Then enter Accounts > Account 1 > Audio Settings and set the Preferred Vocoder – choice 1 to G.722, then PCMA, then PCMU. These sounded best in my tests.

There’s also the Accounts > Account 1 > Call Settings which allows setting up local area codes and restrictions, but I didn’t bother — it looks like a riddle in regular expressions. I don’t expect to call out a lot from this phone, so I’ll use the full international number when necessary.

The ring tone: There are four ring tones to choose from, available on the LCD menu under Preferences > Ring Tone. Aside from the “default ring tone” there are three not-so-impressive choices.

The “default ring tone” can be configured through the web interface, but only as a composition of two frequencies. This is set as the “System Ring Tone” on Settings > Preferences > Ring Tone. The default is a plain dual tone going on and off (defined with the string “f1=440,f2=480,c=200/400;”). I changed it to something lighter with “f1=440,f2=480,c=10/30-30/170;”, which is one 100 ms tone, 300 ms pause, then 300 ms tone and 1700 ms pause. The ring volume can be adjusted with the arrow keys: Preferences > Ring Volume.

Finally, General Settings > Preferences > Date and Time (in the web GUI) for setting the correct time zone. Don’t expect it to get the daylight saving time correctly (at least not in Israel).

The registration status is indicated on the icon at the LCD screen’s upper left:

  • A filled T shape: Properly registered ready for phone calls
  • Same, but hollow icon (the “T” is absent): A LAN connection is present but no valid registration is in effect (possibly because of a rejection by remote SIP server)
  • No icon: No connection to the LAN port.

A DID / DDI provider

A Direct Inward Dial or Direct Dial In service is required to relay incoming phone calls to over an VoIP link. There are a lot of providers to choose from. I looked for one that could port an Israeli number to their service, and with a low monthly price.

To make a (very) long story short, I had two finalists: DIDWW and Avoxi. Spoiler: I went for DIDWW. But the way there was interesting.

I first started checking with DIDWW, mainly because their monthly fee was lowest (setup fee of $2.50 and then a monthly fee of $2.50 for all phone numbers, with $0.01 per minute fee, including incoming calls). But there were some worrying signs. First, I had to create an account and log in, just to see their prices. And even worse, at the very first order, one is required to top up a balance of $50, just to begin. One gets aware of this only after registering and at the last stage of making the order. So it looked like a bit of an ugly sales trick. Their “Terms and Conditions” explicitly says that this isn’t reimbursable.

So do they charge $50 upfront to make me a hostage, or is it a sign of serious intentions? I decided to check up Avoxi. $4.50 monthly at lowest plan, $0.04 / minute. More expensive, but no upfront payments. Actually, they offered a free phone line to try for a while.

This is where I’ve deleted quite a few lines describing a lot of good intentions but not so much competence by Avoxi’s support, that eventually pushed me back to DIDWW. I put the $50 on the table, and soon enough I had a working phone line to test. From there I went through the porting procedure with them. It was no fraud, it turned out. They have a pdf document with prices — it’s not something they fiddle with. And their support it quick and to the point, at least so far.

Setting up a DIDWW account

DIDWW has a proper web interface for helping yourself, but this is VoIP, and you’re supposed to behave like this is what you’ve been doing all your life. So this is a short survival guide to follow after having a phone number allocated on your account.

So here it comes: The DID number is linked with a “Phone Systems” trunk, which is apparently a powerful tool for routing phone calls between queues, DTMF menus, voice mail, fax and well, human response (see user manual). It’s a bit of a cow when needing a glass of milk (a recurring motive in this post), but this is how DIDWW offers a SIP phone connection to their numbers. Plus a lot of features to add on later.

Sign up for the “Phone Systems” product at the bottom of the dashboard to the left, “Lifetime Free” plan, which covers exactly one circuit, and launch the service in web interface. Follow their tutorial on setting up an SIP account with Phone Systems.

After this, there’s a “Phone Systems” trunk in the list of Voice IN trunks (or add one). The CLI possibilities merely allow selecting how Calling ID is displayed, so it’s not that important. Check “Map all DIDs” so the phone numbers are related to this trunk.

Then to the DIDWW Phone System’s interface: Click the menu icon at the top right, pick Add a New Contact. Add the name and then pick “Add a New Contact Method”. Pick “SIP Account”. Enable “Enable outbound calls” and select the External caller ID to use. Don’t enable “Allowed IPs”.

After clicking “Save”, pick the SIP details drop-down to see the SIP access information. And then pick “Finish”.

Now some graph games to make this phone line live. Exit the right-side settings, and drag-drop “Phone number” into the canvas. Select the desired phone number, and a name to appear on the icon on the canvas. Click Save.

Add a “Ring Group” the same way, and add a single ring destination, namely the contact you set up before. And connect a wire between the two. And that’s it! It should look something like this:

DIDWW Phone Systems GUI screenshot(click to enlarge)

At “My DID Numbers”, the relevant number should appear, with the Trunk set to the Phone Systems trunk set up for this phone number. Capacity should be 100 with a green dot. If it isn’t, click the dot and set up “Pay Per Minute”.

RTP packets came from 46.19.210.34, which is located in Ireland with a 84 ms ping from Israel, so it’s relatively fine.

The parameters are something like:

Username: h8rn7gkb0p
Password: ymqhn2f50k
Domain: sip.phone.systems

Note that if you’re using Linphone, the actual user name given to the SIP phone is sip:h8rn7gkb0p@sip.phone.systems. But when setting the same thing on a Grandstream IP phone, it’s just h8rn7gkb0p. You should know this. You’ve been doing this all your life.

Porting a phone number to DIDWW

It’s a matter of filling in web form with the desired number. Then upload a Letter of Authorization (short thing), some kind of ID (passport in my case, so it’s in English) and the latest invoice from the previous supplier. On the next business day I got an email confirming that the phone number is portable, asking me to confirm the porting fee and the operation in general on the web interface (on the “porting” tab). Clicking on the link that came with the email, I got the message “You have no Portable items”, and instead the phone number was under “In Progress”. Which makes sense, because the porting fee is zero (in Israel). So the email was somewhat misleading — there was no action necessary from my side.

Actually, there was no more to do from my side. 10 days after I submitted the application (which was on a Saturday) I got an email saying that the porting was finished. That was one day after the date given as the target during the process, but well within the official lead time. So it was really quick and painless. Phone number is up and running.

Troubleshooting

So what if it just doesn’t work? You go to the sniffer. But what should it look like? So here’s an example of packets on wire.

First, registration. Rule number one is that if the SIP server doesn’t like the user ID and/or domain of the REGISTER request, it responds with a 403 Forbidden response. It doesn’t ignore the request. So if there’s no response at all, it’s not a matter or user identity or anything of that sort. Odds are you’re talking with the wrong server.

Now to a session between the Grandstream phone and the SIP server that works with DIDWW. Only packet content is shown below. The user name shown below is not valid anymore. The phone’s IP address on the LAN it works on is supposedly 10.11.12.13. Not clear why this is exposed in the SIP session.

So the phone says hello with

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchKwF3ISZRG4VSl3lYx9geIizu9DZUGlS27c4A=", uri="sip:sip.phone.systems", response="fb33a3645a58de7ae1fab14d23196de8", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

to which the server says “nice try, but you have to prove me your love first”:

SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=3cbf29d5022d29bd5eb970c4fa286be5.83a2
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
WWW-Authenticate: Digest realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA="
Server: hedgehog v7p0
Content-Length: 0

Scary, huh? It says Unauthorized. It makes it look like an error. It isn’t. Some just say “no” to begin with.

So the phone says “I know the answer to your challenge” (using the password to produce a digest):

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA=", uri="sip:sip.phone.systems", response="9ff65eb2e0c784af08cd11cc1a7a489f", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The server is impressed by the persistence, and opens its doors:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=9ac8b8c8d68bc095abf326021301853f-0b5b
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;expires=1800;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>";reg-id=1
Server: hedgehog v7p0
Content-Length: 0

This concludes the registration.

And then, every 30 seconds, the server asks (this is specific to Phone Systems):

OPTIONS sip:h8rn7gkb0p@10.11.12.13:5060 SIP/2.0
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: sip:keepalive@sip.phone.systems;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: sip:h8rn7gkb0p@10.11.12.13:5060
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Content-Length: 0

and the phone responds with

SIP/2.0 200 OK
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: <sip:keepalive@sip.phone.systems>;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: <sip:h8rn7gkb0p@10.11.12.13:5060>;tag=394229760
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Supported: replaces, path, timer
User-Agent: Grandstream GXP1610 1.0.4.128
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The purpose of this eternal nagging is most likely to refresh any firewall’s memory on the existence of a UDP link, in particular if there’s NAT involved (more on this below). Maybe also for checking that there’s still a phone on the other end (not sure if it’s so important, from a server’s perspective).

With Linphone connected to Avoxi, it was the client that kept the UDP link alive with some short dummy UDP packets. Looks like this is down to each phone service.

Does it work with NAT?

The short answer: It does for me, out of the box. No need for any special firewall rule or something. With plain Linux iptables NAT, that is. But if your phone is behind NAT or a firewall, be sure to check that some kind of keepalive UDP packets are exchanged every minute or so. Otherwise, the firewall might forget the UDP connection and not let through an incoming call.

This page explains a typical handshake over a NAT router.

I found interest in this after failing to receive inbound calls with Linphone despite having no issues whatsoever with outbound calls. In other words, no problem dialing from the computer, but unavailable when trying to dial to it.

This had nothing to do with NAT and firewalls. The INVITE requests that start off a phone conversation are sent through the UDP link that is constantly maintained with keepalive packets. Hence the server knows at which IP address it should find the SIP client, and the NAT / Firewall remembers the UDP link. So the rule is that if the registration went through fine, there are no excuses. If sound doesn’t come through after the phone is picked up, that’s another story, however iptables should handle this well if it set to allow related connections (and it should).

One thing that surprised me was that the audio UDP (RTP) packets start streaming as soon as the phone starts ringing on the other side. This is the common practice with cellular phones, and still. Even more surprising was that even though they came from a completely different server, using a UDP port that is unrelated to anything before. How did the NAT know how to forward this?

The answer lies in a UDP packet sent from the “regular” SIP host, saying (example with Avoxi server, some numbers xxx’ed):

183 Session Progress
Via: SIP/2.0/UDP 10.1.1.22:5060;received=109.186.xx.xx;branch=z9hG4bK278239419;rport=5060
Record-Route: <sip:199.244.96.39:5060;transport=udp;lr>
Contact: sip:199.244.96.46:5070
To: <sip:9724xxxxxx@core-sip-qts.avoxi.com>;tag=rvguhtm5fjvevrnd.i
From: <sip:9723xxxxxxx@core-sip-qts.avoxi.com>;tag=680205676
Call-ID: 1519032647
CSeq: 21 INVITE
Allow: INVITE, ACK, BYE, CANCEL, INFO, SUBSCRIBE, NOTIFY, REFER, MESSAGE, OPTIONS, UPDATE
Content-Type: application/sdp
Server: Sippy
Content-Length: 240

v=0
o=Sippy 219713410032301436 1 IN IP4 199.244.96.46
s=SIP Media Capabilities
t=0 0
m=audio 49610 RTP/AVP 0 101
c=IN IP4 199.244.96.46
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
a=sendrecv
a=ptime:20

This is the last stage in the Session Description Protocol (SIP/SDP) session, which started with the INVITE request.

And then RTP/UDP packets started arrive from IP address 199.244.96.46′s port 49610 (to destination port 7078, but that doesn’t matter). So obviously this is how the NAT got prepared to let through the related link.

xhci_hcd WARN Event TRB for slot x ep y with no TDs queued

What’s this?

There’s a chance that you’re reading this because the message in the title appeared (or flooded) your kernel log. This post attempts to clarify what to do about it, depending on how much you want to get involved in the matter.

So the short answer: The said warning message is a bug related to the xHCI USB controller, but a rather harmless one: Except for the message in the kernel log, everything is fine. This was fixed in Linux kernel v4.15, so upgrading the kernel is one way out. Alternatively, patch your kernel with commit e4ec40ec4b from December 2017, which is the fix made on v4.15. Even editing away the said xhci_warn() call is fairly sensible, if the patch doesn’t apply.

Note that the xHCI controller is used for any port that is USB 3.x capable, even when a lower version USB device is connected to it (e.g. USB 2.0). So if your computer has a few USB 2.0-only ports, moving a device to such port might be your quick fix. Look for USB plugs with black plastic instead of blue.

The rest of this post dives deeply into the whereabouts of this accident. It matters if you’re developing a USB device. Note that this post is written in increasing detail level, which makes it a bit disorganized. Apologies if it’s confusing.

Why this?

The explanation is a bit tangled, and is related to the way the xHCI driver organizes the buffers for a USB transfer.

To make a long story short(er), a software transfer request (URB in Linux) is conveyed to the hardware controller as a TD (Transfer Descriptor) which is presented to the hardware in the form of one or more Transfer Request Blocks (TRBs). How many? There’s no advantage in chopping the TD into more than one TRB for a transfer request, however there are certain constraints to an TRB. Among others, each TRB points at a chunk of the buffer in continuous physical memory. The xHCI specification also requires in its section 6.4.1 that the data buffers shall not cross 64 kiB boundaries in physical memory.

So if the transfer request’s data buffer does cross 64 kiB boundaries, the TD is split into several TRBs, each pointing at a another part of the buffer supplied in the software’s transfer request. For example, see xhci_queue_bulk_tx() in the Linux kernel’s drivers/usb/host/xhci-ring.c. This is fine, and should cause no problems.

But as it turns out, if a BULK IN transfer request is terminated by the device with a short packet, and the TD consists of more than one TRB, the xHCI hardware produces more than one notification event on this matter. This is also fine and legal, however it causes the kernel (before the patch) to issue the said warning. As already said, the warning is a bug, but otherwise there’s no problem.

So for this bug to show up, there needs to be a combination of a pre-4.15 kernel, short packets and transfer requests that have data buffers that span 64 kiB boundaries.

Huh? Short packets?

A “short packet” is a USB packet that is shorter than the maximal size that is allowed for the relevant USB version (that is, 512 bytes for USB 2.0 and 1024 bytes on USB 3.x).

The idea is that the software can request a transfer of a data chunk of any size (with some maximum, but surely several kilobytes). On a BULK IN endpoint, the device is in principle expected to send that amount of data, divided into packets, which all have the maximal size, except for the last one.

However it’s perfectly legal and fine that the device sends less than requested by the software in the transfer request. In that case, the device sends a packet that is shorter than the maximal packet size, and by doing so, it marks the end of the transfer. It can even be a packet with zero bytes, if the previous packet was of maximal length. This is called a short packet, and once again, it’s not reason for any alarm. It just means that the software must check how much data it actually got when the transfer is completed.

Since the hardware and its driver often have some dedicated protocol to coordinate their data exchange, this short packet mechanism is often unnecessary and not used.

Avoiding these warnings

Linux kernel v4.15 was released in January 2018, so when writing this it’s quite expected that older kernels are still ubiquitous.

If short packets are expected from the device, and pre-patch kernels are expected to be in use, it’s wise to make sure that the buffers don’t cross the said boundaries. Keeping them small, surely below 4 kiB is a good start, as larger buffers surely span more than one page. Also, different pages can be anywhere in physical memory, causing the need to divide the buffer into several TRBs. And then there’s the 64 kiB boundaries issue. However this isn’t practical in a high-bandwidth application, as a 4 kB buffer is exhausted in 10 μs at a rate of 400 MB/s. The software has no chance to continuously generate TDs fast enough to keep up with their completion, which will result in data transmission stops, waiting for TDs to be queued.

When writing a device driver in the kernel, it’s relatively easy to control the position of the buffer in physical memory. In particular, the kernel’s __get_free_pages() function allows this. This is however not the common practice, in particular as the buffers are typically much smaller than a page, so using __get_free_pages() would have seemed to be a waste of memory. So existing drivers are subject to this problem, and there’s not much to do about it (except for applying the patch, as suggested above).

When libusb is used to access the device (through usbfs), there is a solution, assuming that the kernel is v4.6 (released May 2016) and later + libusb version 1.0.21 and later. The trick is to use the libusb_dev_mem_alloc() function to allocate memory (i.e. the zerocopy feature), which implements a shared memory mapping with the usbfs driver, so that the data buffer that is accessed from user space is directly accessed by the xHCI hardware controller. It also speeds up things slightly by avoiding memory allocation and copying of buffers on x86 platforms, which ensure cache coherency on DMA transfers. Not sure on what happens on ARM platforms, for which cache coherency needs to be maintained explicitly.

Note that physical memory continuity is only ensured within 4 kiB pages, as the mmap() call doesn’t ensure physical memory continuity. Since virtual to physical memory translation never touches the lower 12 bits of the address, staying within 4 kiB alignment in virtual alignment ensures no 4 kiB boundaries are crossed — but this doesn’t help regarding 64 kiB boundaries.

Without libusb_dev_mem_alloc(), the usbfs framework allocates a buffer with kmalloc() for each transfer, and eventually copies the data from the DMA buffer into it. No control whatsoever on how the buffer is aligned. The position of the buffer that is supplied by the user-space software makes no difference.

Zero-copy buffers: Gory details (too much information)

Introduced in Linux kernel v4.6 (commit f7d34b445, February 2016), mmap() on the usbfs file descriptor allows the memory buffer provided by the user-space program to be used directly by the hardware xHCI controller. The patch refers to this as “zerocopy”, as there is no copying of the data, however the more important part is that the buffers don’t need to be allocated and freed each time.

The idea is quite simple: The user-space software allocates a memory buffer by calling mmap() on the file descriptor of the relevant USB device. When a URB is submitted with the data buffer pointer directed to inside the mmap’ed region, the usbfs driver detects this fact, and skips memory allocation and copying of data. Instead, it uses the memory buffer directly.

This is implemented in proc_submiturb() (drivers/usb/core/devio.c) by first checking if the buffer is in an mmap’ed segment:

as->usbm = find_memory_area(ps, uurb);

“uurb” in the context of this code is the user-space copy of the URB. find_memory_area scans the list of mmap’ed regions for one that contains the address uurb->buffer (and checks that uurb->buffer_length doesn’t exceed the region). It returns NULL if no such buffer was found (which is the common case, when mmap() isn’t used at all). Moving on, we have

if (as->usbm) {
  unsigned long uurb_start = (unsigned long)uurb->buffer;

  as->urb->transfer_buffer = as->usbm->mem + (uurb_start - as->usbm->vm_start);
 } else {
  as->urb->transfer_buffer = kmalloc(uurb->buffer_length, GFP_KERNEL);
  if (!as->urb->transfer_buffer) {
    ret = -ENOMEM;
    goto error;
  }
  if (!is_in) {
    if (copy_from_user(as->urb->transfer_buffer,
		       uurb->buffer,
		       uurb->buffer_length)) {
      ret = -EFAULT;
      goto error;
    }

So if a memory mapped buffer exists, the buffer that is used against hardware is calculated from the relative position in the mmap’ed buffer.

Otherwise, it’s kmalloc’ed, and in the case of an OUT transaction, there’s a copy_from_user() call following to populate the buffer with data.

So to use this feature, the user-space software should mmap() a memory segment that is large enough to contain all data buffers, and then manage this segment, so that each transfer URB has its own chunk of memory in this segment.

In order to do this from libusb, there’s the libusb_dev_mem_alloc() API function call, defined in core.c. It calls usbi_backend.dev_mem_alloc(), which for Linux is op_dev_mem_alloc() in os/linux_usbfs.c. It’s short and concise, so here it is:

static unsigned char *op_dev_mem_alloc(struct libusb_device_handle *handle,
	size_t len)
{
	struct linux_device_handle_priv *hpriv = _device_handle_priv(handle);
	unsigned char *buffer = (unsigned char *)mmap(NULL, len,
		PROT_READ | PROT_WRITE, MAP_SHARED, hpriv->fd, 0);
	if (buffer == MAP_FAILED) {
		usbi_err(HANDLE_CTX(handle), "alloc dev mem failed errno %d",
			errno);
		return NULL;
	}
	return buffer;
}

This capability was added to libusb in its version 1.0.21 (as the docs say) with git commit a283c3b5a, also in February 2016, and by coincidence, by Steinar H. Gunderson, who also submitted the Linux kernel patch. Conspiracy at its best.

libusb: From API call to ioctl()

This is some libusb sources dissection notes. Not clear why this should interest anyone.

From the libusb sources: A simple submitted bulk transfer (libusb_bulk_transfer(), defined in sync.c) calls do_sync_bulk_transfer(). The latter function wraps an async transfer, and then calls the API’s function for that purpose, libusb_submit_transfer() (defined in io.c), which in turn calls usbi_backend.submit_transfer(). For Linux, this data structure is populated in os/linux_usbfs.c, with a pointer to the function op_submit_transfer(), which calls submit_bulk_transfer().

submit_bulk_transfer() ends up making a ioctl() call to submit one or more URBs to Linux’ usbfs interface. The code used for this ioctl is IOCTL_USBFS_SUBMITURB, which is a libusb-specific defined as

#define IOCTL_USBFS_SUBMITURB	_IOR('U', 10, struct usbfs_urb)

in linux_usbfs.h, which matches

#define USBDEVFS_SUBMITURB         _IOR('U', 10, struct usbdevfs_urb)

in the Linux kernel source’s include/uapi/linux/usbdevice_fs.h.

The ioctl() call ends up with proc_submiturb() defined in drivers/usb/core/devio.c, which moves the call on to proc_do_submiturb() in the same file. The latter function splits the transfer into scatter-gather buffers if it’s larger than 16 kiB (where did that number come from?).

And then we have the issue with memory mapping, as mentioned above.

Extra kernel gory details: URB handling

This is the traversal of a URB that is submitted via usbfs in the kernel, details and error handling dropped for the sake of the larger picture.

proc_do_submiturb() sets up a usbfs-specific struct async container for housekeeping data and the URB (as a struct urb), puts it on a locally accessible list and then submits the URB into the USB framework with a

ret = usb_submit_urb(as->urb, GFP_KERNEL);

This finishes the usbfs-specific handling of the URB submission request. usb_submit_urb() (defined in drivers/usb/core/urb.c) makes a lot of sanity checks, and eventually calls usb_hcd_submit_urb() (defined in drivers/usb/core/hcd.c). Unless the URB is directed to the root hub itself, the function registered as hcd->driver->urb_enqueue is called with the URB. The hcd is obtained with

hcd = bus_to_hcd(udev->bus);

which merely fetches the usb_hcd structure from the usb_bus structure.

Anyhow for xHCI the list of methods is mapped in drivers/usb/host/xhci.c, where urb_enqueue is assigned with xhci_urb_enqueue() (no surprises here). This function allocates a private data structure with kzalloc() and assigns urb->hcpriv with a pointer to it, and then calls xhci_queue_bulk_tx() (defined in drivers/usb/host/xhci-ring.c) for a bulk transfer URB.

xhci_queue_bulk_tx() calls queue_trb() which actually puts 16 bytes of TRB data into the Transfer Ring (per section 6.4.1 in the xHCI spec), and calls inc_enq() to move the enqueue pointer.

Once the hardware xHCI controller finishes handling the TRB (for better or worse), it queues an entry in the event ring, and issues an interrupt, which is handled by xhci_irq() (also in xhci-ring.c). After several sanity checks, this ISR calls xhci_handle_event() until all events have been handled, and then does the housekeeping tasks for confirming the events with the hardware.

The interesting part in xhci_handle_event() is that it calls handle_tx_event() if the event was a finished transmission URB. This happens to be the function that emits the warning in the title under some conditions. After a lot of complicated stuff, it calls process_bulk_intr_td() for a BULK endpoint TD. Which in turn calls finish_td(), which returns the TD back to the USB subsystem: It calls xhci_td_cleanup(), which checks if all TDs of the relevant URB have been finished. If so, xhci_giveback_urb_in_irq() is called, which in turn calls usb_hcd_giveback_urb() (defined in drivers/usb/core/hcd.c). That function launches a tasklet that completes the URB.

USB 3.0 bandwidth efficiency: Looking at real-life DATA bursts

Introduction

This post looks at the DATA and ACK packet exchange between a device and an xHCI USB 3.0 controller for the sake of explaining the actual, measured bandwidth that is observed on a BULK IN endpoint. A certain level of knowledge of the Superspeed packet protocol is assumed.

Superspeed data flow

For the sake of bandwidth efficiency, the USB 3.x spec allows (and encourages) bursts of DATA packets. This is implemented by virtue of the NumP field in the ACK packets that are sent by the receiver of DATA packets and in response to them.

The NumP field is a number saying how many packets the receiver is capable of accepting immediately after sending the ACK packet that carries it. This gives the sender of the DATA packets a go-ahead to send several packets in response to this ACK packet. In fact, an infinite flow of DATA packets is theoretically possible if the receiver keeps sending ACK packets with a sufficiently high NumP, and there’s enough data to send.

The rules that govern the data flow are rather complicated, and involve several factors. For example, due to the inherent delay of the physical bit stream, there’s a chance that when an ACK packet arrives, its NumP field is somewhat outdated because DATA packets have already been sent on the expense of a previous ACK’s NumP. The sender of DATA packets needs to compensate for these gaps.

USB remains a control freak

Even though USB 3.0 introduces a more relaxed framework for data transport (compared with USB 2.0), the concept that the host has full control over the data flow remains. In particular, any data transfer on the bus is a direct result of an outstanding request to the xHCI controller.

More precisely, any USB data transfer begins with the USB device driver setting up a data buffer and a transfer descriptor (TD) which is a data structure that contains the information on the requested data transfer. The device driver passes on this request to the USB controller (xHCI) driver, which adds it to a queue that is directly accessible by the hardware USB controller (usually after some chewing and chopping, however this isn’t relevant here). The latter performs the necessary operations to fulfill the request, and eventually reports back to the xHCI driver when the request is completed (or failed). The USB device driver is notified, and takes relevant action. For example, consuming the data that arrived from an IN endpoint.

The exchange of TDs and data between the software and hardware is asynchronous. The xHCI controller allows queuing several TDs for each endpoint, and activity on the bus on behalf of each endpoint takes place only in the presence of TDs on its queue. If there are no TDs queued for a specific endpoint, no data transfer occurs on its behalf, whether the device is ready or not.

And this is the important conclusion: For a high-bandwidth application, the software should ensure that a number of TDs are queued for the endpoint all the time. Failing to do so slows down the data flow due to momentary data flow halts while no TDs are queued.

Actual behavior

These are some anecdotal tests on a an Intel B150 chipset’s USB 3.0 xHCI controller (8086:a12f) and a Renesas Technology Corp. uPD720202 (1912:0015). These were fed with at least four TDs (BULK IN, 4096 bytes each) to handle the data flow that was monitored before the device became ready with its data, so the hardware’s optimal behavior is observed.

This is a typical sequence for the Intel USB controller:

     513.048 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3
       1.400 DATA seq=3 len=1024
       0.720 ACK  seq=2 nump=2
       2.160 ACK  seq=3 nump=1
       2.144 ACK  seq=4 nump=0
       2.008 ACK  seq=4 nump=4
       0.040 DATA seq=4 len=1024
       0.032 DATA seq=5 len=1024
       2.080 DATA seq=6 len=1024
       0.736 ACK  seq=5 nump=3
       1.384 DATA seq=7 len=1024
       0.736 ACK  seq=6 nump=2
       2.160 ACK  seq=7 nump=1
       2.144 ACK  seq=8 nump=0
       1.736 ACK  seq=8 nump=4
       0.040 DATA seq=8 len=1024

DATA packets are sent by device, and ACK by host. The numbers at the beginning of each line are the time difference with the previous line, in microseconds, measured inside the device’s logic. The timing for DATA is of the internal request for a packet (in the device’s logic), not the actual transmission, and the internal queue for such requests is two entries deep — that’s why two DATA packets are fired off right after the ACK packet’s arrival.

A DATA packet with 1024 bytes’ payload consists of a DPH (4 bytes start pattern + 16 bytes) and a DPP (4 bytes start pattern + 1024 bytes payload + 4 bytes CRC + 4 bytes end pattern), all in all 1056 bytes, which take 2.112 μs on wire. The theoretical efficiency limit is hence 1024/1056 ≈ 97%, or ~485 MB/s.

From the log above, it’s evident that there’s a turnaround time of ~2.85 μs from DATA to ACK, which is just ~ 0.74 μs beyond the time it takes to transmit the packet.

Note that the host separates the bursts for each TD: The NumP starts at 4 and counts down to 0 in the ACKs packets, so exactly 4096 bytes (the size of the TD) are transmitted in a burst. The following ACK packet, which starts off a new burst with a nump=4 is transmitted only 2 μs later, indicating that it took the USB controller some time to figure out that it has more to do with the same endpoint. In theory, it could have looked ahead for the next TD and realized that there’s enough TDs to continue the burst practically forever, but this optimization isn’t implemented.

It’s interesting to calculate the time no DATA was transmitted due to the burst stop and restart. The size of the gap isn’t easily calculated, as the times on the DATA packets are when they’re queued. To work around this, one can assume that the last byte of the 4th packet was sent 0.74 μs before the first ACK on its behalf was timed. The gap is hence 0.74 + 2.008 = 2.748 μs (the latter is the difference between the two ACKs for seq=4, the first concluding the burst, and the second starting a new one).

The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 2.748) ≈ 75.4% or ~377 MB/s. The actual speed measurement was 358 MB/s. The difference is most likely attributed to momentary shortages of TDs that are observed as occasional longer gaps (seen only in extensive traffic traces).

The same test with the Renesas USB host controller:

       4.808 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.712 ACK  seq=1 nump=3
       1.416 DATA seq=3 len=1024
       0.704 ACK  seq=2 nump=2
       2.152 ACK  seq=3 nump=1
       2.144 ACK  seq=4 nump=0
       5.488 ACK  seq=4 nump=4
       0.040 DATA seq=4 len=1024
       0.032 DATA seq=5 len=1024
       2.080 DATA seq=6 len=1024
       0.704 ACK  seq=5 nump=3
       1.448 DATA seq=7 len=1024
       0.712 ACK  seq=6 nump=2
       2.144 ACK  seq=7 nump=1
       2.152 ACK  seq=8 nump=0
       5.552 ACK  seq=8 nump=4

The turnaround for an ACK is similarly ~2.82μs from DATA to ACK, which is ~ 0.71 μs beyond the time it takes to transmit the packet. Almost the same as the previous result.

However the time between the two ACKs that make the gap in the data flow is 0.71 + 5.488 = 6.20 μs, significantly worse than the Intel chipset.

The actual efficiency is hence (4 * 2.112) / ((4 * 2.112) + 5.488) ≈60.6% or ~303 MB/s. The actual speed measurement was 262 MB/s.

Larger TD buffers

Since the bandwidth efficiency took a hit because of the short bursts, the next step is to assign larger buffers to each TD, hoping that the USB controller will produce longer bursts. Hence the software prepared 31 TDs (the maximum that the Linux controller was ready to accept in advance), each with 512 kiB of data.

The Intel controller’s response:

     501.944 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3
       1.400 DATA seq=3 len=1024
       0.712 ACK  seq=2 nump=3
       1.432 DATA seq=4 len=1024
       0.728 ACK  seq=3 nump=3
       1.432 DATA seq=5 len=1024
       0.728 ACK  seq=4 nump=3
       1.432 DATA seq=6 len=1024
       0.736 ACK  seq=5 nump=3
       1.416 DATA seq=7 len=1024
       0.736 ACK  seq=6 nump=3
       1.416 DATA seq=8 len=1024
       0.736 ACK  seq=7 nump=3
       1.432 DATA seq=9 len=1024
       0.720 ACK  seq=8 nump=3
       1.432 DATA seq=10 len=1024
       0.728 ACK  seq=9 nump=3
       1.432 DATA seq=11 len=1024
[ ... ]

this goes on like a machine until the TD is filled:

       0.736 ACK  seq=29 nump=3
       1.432 DATA seq=31 len=1024
       0.728 ACK  seq=30 nump=2
       2.152 ACK  seq=31 nump=1
       2.160 ACK  seq=0 nump=0
       1.736 ACK  seq=0 nump=4
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.080 DATA seq=2 len=1024
       0.736 ACK  seq=1 nump=3

Even though the device announces a maximal burst length of 8, and each TD can take much more than that, Intel’s controller chooses to limit itself to a NumP of 4. Since the DATA to ACK bus turnaround is ~2.85 μs (see above), which is less than the time for transmitting two DATA packets of maximal length, this limitation has no performance impact. Note that the DATA packets are queued a significant time after the arrival of ACK packets, indicating that the device wasn’t waiting for them. This is quite expected, as there are two DATA packets in flight all the time, and ACKs arriving with nump=3, so all in all there’s always another DATA packet allowed. Intel got their timing correct here.

The measured bandwidth on this run was 471 MB/s.

Renesas host controller doesn’t behave as elegant however:

       5.808 ACK  seq=2 nump=8
       0.040 DATA seq=2 len=1024
       0.032 DATA seq=3 len=1024
       2.080 DATA seq=4 len=1024
       0.704 ACK  seq=3 nump=7
       1.416 DATA seq=5 len=1024
       0.704 ACK  seq=4 nump=6
       1.464 DATA seq=6 len=1024
       0.688 ACK  seq=5 nump=5
       1.464 DATA seq=7 len=1024
       0.688 ACK  seq=6 nump=4
       1.464 DATA seq=8 len=1024
       0.696 ACK  seq=7 nump=3
       1.464 DATA seq=9 len=1024
       0.696 ACK  seq=8 nump=2
       2.160 ACK  seq=9 nump=1
       2.168 ACK  seq=10 nump=7
       0.040 DATA seq=10 len=1024
       0.032 DATA seq=11 len=1024
       2.080 DATA seq=12 len=1024
       0.704 ACK  seq=11 nump=6
       1.432 DATA seq=13 len=1024
       0.688 ACK  seq=12 nump=5
       1.464 DATA seq=14 len=1024
       0.688 ACK  seq=13 nump=4
       1.464 DATA seq=15 len=1024
       0.696 ACK  seq=14 nump=3
       1.464 DATA seq=16 len=1024
       0.688 ACK  seq=15 nump=2
       2.152 ACK  seq=16 nump=1
       2.208 ACK  seq=17 nump=7
       0.040 DATA seq=17 len=1024
       0.032 DATA seq=18 len=1024
       2.080 DATA seq=19 len=1024
       0.712 ACK  seq=18 nump=6
       1.416 DATA seq=20 len=1024
       0.704 ACK  seq=19 nump=5
       1.448 DATA seq=21 len=1024
       0.696 ACK  seq=20 nump=4
       1.464 DATA seq=22 len=1024
       0.696 ACK  seq=21 nump=3
       1.464 DATA seq=23 len=1024
       0.696 ACK  seq=22 nump=2
       2.160 ACK  seq=23 nump=1
       2.168 ACK  seq=24 nump=6
       0.040 DATA seq=24 len=1024
       0.032 DATA seq=25 len=1024
       2.080 DATA seq=26 len=1024
       0.704 ACK  seq=25 nump=5
       1.432 DATA seq=27 len=1024
       0.688 ACK  seq=26 nump=4
       1.464 DATA seq=28 len=1024
       0.688 ACK  seq=27 nump=3
       1.464 DATA seq=29 len=1024
       0.696 ACK  seq=28 nump=2

The sequence doesn’t repeat itself, so the short sequence above doesn’t show all that went on. It’s not so clear what this host controller is up to. NumP is decremented, sometimes down to 1, sometimes down to 0, and then returns to a seemingly random number (8 after nump=0, but quite random after nump=1). It seems to be a combination of attempting to make bursts of 8 DATA packets (the maximal burst length announced by the device) and a bandwidth limitation between the USB controller and the host (PCIe Gen2 x 1, which doesn’t leave a lot of spare bandwidth compared with the USB 3.0 link).

The measured bandwidth was 401 MB/s, which seems to confirm that the bottleneck is the PCIe link.

Not directly related, on termination of the test with the Renesas controller, the following line appeared in the kernel log: (Linux v5.3.0):

xhci_hcd 0000:03:00.0: WARN Successful completion on short TX for slot 1 ep 2: needs XHCI_TRUST_TX_LENGTH quirk?

Conclusion

It’s quite clear that there’s a difference between the behavior performance of USB 3.0 controllers. While both controllers work perfectly within the requirements of the spec, the handling of TDs and the creation of bursts differs dramatically.

Large buffer TDs are definitely required for a good bandwidth utilization. This was shown above at the hardware level, but is also true with regards to the software’s ability to keep the TD queue populated.

The ultimate C printf debug line

The simplest way to add printf statements that say “I was here”:

printf("File %s, Line %d\n", __FILE__, __LINE__);

The output is then simply e.g.

File bulktest.c, Line 120

So make sure everything is committed into the git repo, and bombard the code with these lines. Compile and run.

MySQL, OOM killer, overcommitting and other memory related issues

It started with an error message

This post is a bit of a coredump of myself attempting to resolve a sudden web server failure. And even more important, understand why it happened (check on that) and try avoiding it from happening in the future (not as lucky there).

I’ve noticed that there are many threads in the Internet on why mysqld died suddenly, so to make a long story short: mysqld has the exact profile that the OOM killer is looking for: Lots of resident RAM, and it’s not a system process. Apache gets killed every now and then for the same reason.

This post relates to a  VPS hosted Debian 8, kernel 3.10.0, x86_64. The MySQL server is a 5.5.62-0+deb8u1 (Debian).

As always, it started with a mail notification from some cronjob complaining about something. Soon enough it was evident that the MySQL server was down. And as usual, the deeper I investigated this issue, the more I realized that this was just the tip of the iceberg (the kind that doesn’t melt due to global warming).

The crash

So first, it was clear that the MySQL had restarted itself a couple of days before disaster:

191007  9:25:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191007  9:25:17 [Note] Plugin 'FEDERATED' is disabled.
191007  9:25:17 InnoDB: The InnoDB memory heap is disabled
191007  9:25:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191007  9:25:17 InnoDB: Compressed tables use zlib 1.2.8
191007  9:25:17 InnoDB: Using Linux native AIO
191007  9:25:17 InnoDB: Initializing buffer pool, size = 128.0M
191007  9:25:17 InnoDB: Completed initialization of buffer pool
191007  9:25:17 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
191007  9:25:17  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
191007  9:25:19  InnoDB: Waiting for the background threads to start
191007  9:25:20 InnoDB: 5.5.62 started; log sequence number 1427184442
191007  9:25:20 [Note] Server hostname (bind-address): '127.0.0.1'; port: 3306
191007  9:25:20 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
191007  9:25:20 [Note] Server socket created on IP: '127.0.0.1'.
191007  9:25:21 [Note] Event Scheduler: Loaded 0 events
191007  9:25:21 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.62-0+deb8u1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Debian)
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_options' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_options'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_posts' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_posts'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_taxonomy' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_taxonomy'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_relationships' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_relationships'

And then, two days layer, it crashed for real. Or actually, got killed. From the syslog:

Oct 09 05:30:16 kernel: OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

and

191009  5:30:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191009  5:30:17 [Note] Plugin 'FEDERATED' is disabled.
191009  5:30:17 InnoDB: The InnoDB memory heap is disabled
191009  5:30:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191009  5:30:17 InnoDB: Compressed tables use zlib 1.2.8
191009  5:30:17 InnoDB: Using Linux native AIO
191009  5:30:17 InnoDB: Initializing buffer pool, size = 128.0M
InnoDB: mmap(137363456 bytes) failed; errno 12
191009  5:30:17 InnoDB: Completed initialization of buffer pool
191009  5:30:17 InnoDB: Fatal error: cannot allocate memory for the buffer pool
191009  5:30:17 [ERROR] Plugin 'InnoDB' init function returned error.
191009  5:30:17 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
191009  5:30:17 [ERROR] Unknown/unsupported storage engine: InnoDB
191009  5:30:17 [ERROR] Aborting

191009  5:30:17 [Note] /usr/sbin/mysqld: Shutdown complete

The mmap() is most likely anonymous (i.e. not related to a file), as I couldn’t find any memory mapped file that is related to the mysql processes (except for the obvious mappings of shared libraries).

The smoking gun

But here comes the good part: It turns out that the OOM killer had been active several times before. It just so happen that the processes are being newborn every time this happens. It was the relaunch that failed this time — otherwise I wouldn’t have noticed this was going on.

This is the output of plain “dmesg”. All OOM entries but the last one were not available with journalctl, as old entries had been deleted.

[3634197.152028] OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
[3634197.273914] OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
[4487991.904510] OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
[4835006.413510] OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
[4835006.767112] OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
[4884915.371805] OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
[4884915.509686] OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
[5096265.088151] OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
[5845437.591031] OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
[5845437.608682] OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
[6896254.741732] OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
[7054957.856153] OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

Or, after calculating the time stamps (using the last OOM message as a reference):

Fri Aug 30 15:17:36 2019 OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
Fri Aug 30 15:17:36 2019 OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
Mon Sep  9 12:27:30 2019 OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
Fri Sep 13 12:51:05 2019 OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
Fri Sep 13 12:51:05 2019 OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
Sat Sep 14 02:42:54 2019 OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
Sat Sep 14 02:42:54 2019 OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
Mon Sep 16 13:25:24 2019 OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
Wed Sep 25 05:31:36 2019 OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
Wed Sep 25 05:31:36 2019 OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
Mon Oct  7 09:25:13 2019 OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
Wed Oct  9 05:30:16 2019 OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

anon-rss is the resident RAM consumed by the process itself (anonymous = not memory mapped to a file or something like that).

total-vm is the total size of the Virtual Memory in use. This isn’t very relevant (I think), as it involves shared libraries, memory mapped files and other segments that don’t consume any actual RAM or other valuable resources.

So now it’s clear what happened. Next, to some finer resolution.

The MySQL keepaliver

The MySQL daemon is executed by virtue of an SysV init script, which launches /usr/bin/mysqld_safe, a patch-on-patch script to keep the daemon alive, no matter what. It restarts the mysqld daemon if it dies for any or no reason, and should also produce log messages. On my system, it’s executed as

/usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306

The script issues log messages when something unexpected happens, but they don’t appear in /var/log/mysql/error.log or anywhere else, even though the file exists, is owned by the mysql user, and has quite a few messages from the mysql daemon itself.

Changing

/usr/bin/mysqld_safe > /dev/null 2>&1 &

to

/usr/bin/mysqld_safe --syslog > /dev/null 2>&1 &

Frankly speaking, I don’t think this made any difference. I’ve seen nothing new in the logs.

It would have been nicer having the messages in mysql/error.log, but at least they are visible with journalctl this way.

Shrinking the InnoDB buffer pool

As the actual failure was on attempting to map memory for the buffer pool, maybe make it smaller…?

Launch MySQL as the root user:

$ mysql -u root --password

and check the InnoDB status, as suggested on this page:

mysql> SHOW ENGINE INNODB STATUS;

[ ... ]

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   8192
Free buffers       6263
Database pages     1912
Old database pages 725
Modified db pages  0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1894, created 18, written 1013
0.00 reads/s, 0.00 creates/s, 0.26 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1912, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

I’m really not an expert, but if “Free buffers” is 75% of the total allocated space, I’ve probably allocated too much. So I reduced it to 32 MB — it’s not like I’m running a high-end server. I added /etc/mysql/conf.d/innodb_pool_size.cnf (owned by root, 0644) reading:

# Reduce InnoDB buffer size from default 128 MB to 32 MB
[mysqld]
innodb_buffer_pool_size=32M

Restarting the daemon, it says:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 34340864; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   2047
Free buffers       856
Database pages     1189
Old database pages 458

And finally, repair the tables

Remember those warnings that the tables were marked as crashed? That’s the easy part:

$ mysqlcheck -A --auto-repair

That went smoothly, with no complaints. After all, it wasn’t really a crash.

Some general words on OOM

This whole idea that the kernel should do Roman Empire style decimation of processes is widely criticized by many, but it’s probably not such a bad idea. The root cause lies in the fact that the kernel agrees to allocate more RAM than it actually has. This is even possible because the kernel doesn’t really allocate RAM when a process asks for memory with a brk() call, but it only allocates the memory space segment. The actual RAM is allocated only when the process attempts to access a page that hasn’t been RAM allocated yet. The access attempt causes a page fault, the kernel quickly fixes some RAM and returns from the page fault interrupt as if nothing happened.

So when the kernel responds with an -ENOMEM, it’s not because it doesn’t have any RAM, but because it doesn’t want to.

More precisely, the kernel keeps account on how much memory it has given away (system-wise and/or cgroup-wise) and make a decision. The common policy is to overcommit to some extent — that is, to allow the total allocated RAM allocated to exceed the total physical RAM. Even, and in particular, if there’s no swap.

The common figure is to overcommit by 50%: For a 64 GiB RAM computer, there might be 96 GiB or promised RAM. This may seem awfully stupid thing to do, but hey, it works. If that concept worries you, modern banking (with real money, that is) might worry you even more.

The problem rises when the processes run to the bank. That is, when the processes access the RAM they’ve been promised, and at some point the kernel has nowhere to take memory from. Let’s assume there’s no swap, all disk buffers have been flushed, all rabbits have been pulled. There’s a process waiting for memory, and it can’t go back running until the problem has been resolved.

Linux’ solution to this situation is to select a process with a lot of RAM and little importance. How the kernel does that judgement is documented everywhere. The important point is that it’s not necessarily the process that triggered the event, and that it will usually be the same victim over and over again. In my case, mysqld is the favorite. Big, fat, and not a system process.

Thinking about it, the OOM is a good solution to get out of a tricky situation. The alternative would have been to deny memory to processes just launched, including the administrator’s attempt to rescue the system. Or an attempt to shut it down with some dignity. So sacrificing a large and hopefully not-so-important process isn’t such a bad idea.

Why did the OOM kick in?

This all took place on a VPS virtual machine with 1 GB leased RAM. With the stuff running on that machine, there’s no reason in the world that the total actual RAM consumption would reach that limit. This is a system that typically has 70% of its memory marked as “cached” (i.e. used by disk cache). This should be taken with a grain of salt, as “top” displays data from some bogus /proc/vmstat, and still.

As can be seen in the dmesg logs above, the amount of resident RAM of the killed mysqld process was 120-150 MB or so. Together with the other memory hog, apache2, they reach 300 MB. That’s it. No reason for anything drastic.

Having said that, it’s remarkable that the total-vm stood at 2.4-4.3 GB when killed. This is much higher than the typical 900 MB visible usually. So maybe there’s some kind of memory leak, even if it’s harmless? Looking at mysql over time, its virtual memory allocation tends to grow.

VPS machines do have a physical memory limit imposed, by virtue of the relevant cgroup’s memory.high and memory.max limits. In particular the latter — if the cgroup’s total consumption exceeds memory.max, OOM kicks in. This is how the illusion of an independent RAM segment is made on a VPS machine. Plus faking some /proc files.

But there’s another explanation: Say that a VPS service provider takes a computer with 16 GB RAM, and places 16 VPS machines with 1 GB leased RAM each. What will the overall actual RAM consumption be? I would expect it to be much lower than 16 GB. So why not add a few more VPS machines, and make some good use of the hardware? It’s where the profit comes from.

Most of the time, there will be no problem. But occasionally, this will cause RAM shortages, in which case the kernel’s global OOM looks for a victim. I suppose there’s no significance to cgroups in this matter. In other words, the kernel sees all processes in the system the same, regardless of which cgroup (and hence VPS machine) they belong to. Which means that the process killed doesn’t necessarily belong to the VPS that triggered the problem. The processes of one VPS may suddenly demand their memory, but some other VPS will have its processes killed.

Conclusion

  • Shrinking the buffer pool of mysqld was probably a good idea, in particular if a computer-wide OOM killed the process — odds are that it will kill some other mysqld instead this way.
  • Possibly restart mysql with a cronjob every day to keep its memory consumption in control. But this might create problems of its own.
  • It’s high time to replace the VPS guest with KVM or similar.

—————————————————————-

Rambling epilogue: Some thoughts about overcomitting

The details for how overcomitting is accounted for is given on the kernel tree’s Documentation/vm/overcommit-accounting. But to make a long story short, it’s done in a sensible way. In particular, if a piece of memory is shared by threads and processes, it’s only accounted for once.

Relevant files: /proc/meminfo and /proc/vmstat

It seems like CommitLimit and Committed_AS are not available on a VPS guest system. But the OOM killer probably knows these values (or was it because /proc/sys/vm/overcommit_memory was set to 1 on my system, meaning “Always overcommit”?).

To get a list of the current memory hogs, run “top” and press shift-M as it’s running.

To get an idea on how a process behaves, use pmap -x. For example, looking at a mysqld process (run as root, or no memory map will be shown):

# pmap -x 14817
14817:   /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306
Address           Kbytes     RSS   Dirty Mode  Mapping
000055c5617ac000   10476    6204       0 r-x-- mysqld
000055c5623e6000     452     452     452 r---- mysqld
000055c562457000     668     412     284 rw--- mysqld
000055c5624fe000     172     172     172 rw---   [ anon ]
000055c563e9b000    6592    6448    6448 rw---   [ anon ]
00007f819c000000    2296     320     320 rw---   [ anon ]
00007f819c23e000   63240       0       0 -----   [ anon ]
00007f81a0000000    3160     608     608 rw---   [ anon ]
00007f81a0316000   62376       0       0 -----   [ anon ]
00007f81a4000000    9688    7220    7220 rw---   [ anon ]
00007f81a4976000   55848       0       0 -----   [ anon ]
00007f81a8000000     132       8       8 rw---   [ anon ]
00007f81a8021000   65404       0       0 -----   [ anon ]
00007f81ac000000     132       4       4 rw---   [ anon ]
00007f81ac021000   65404       0       0 -----   [ anon ]
00007f81b1220000       4       0       0 -----   [ anon ]
00007f81b1221000    8192       8       8 rw---   [ anon ]
00007f81b1a21000       4       0       0 -----   [ anon ]
00007f81b1a22000    8192       8       8 rw---   [ anon ]
00007f81b2222000       4       0       0 -----   [ anon ]
00007f81b2223000    8192       8       8 rw---   [ anon ]
00007f81b2a23000       4       0       0 -----   [ anon ]
00007f81b2a24000    8192      20      20 rw---   [ anon ]
00007f81b3224000       4       0       0 -----   [ anon ]
00007f81b3225000    8192       8       8 rw---   [ anon ]
00007f81b3a25000       4       0       0 -----   [ anon ]
00007f81b3a26000    8192       8       8 rw---   [ anon ]
00007f81b4226000       4       0       0 -----   [ anon ]
00007f81b4227000    8192       8       8 rw---   [ anon ]
00007f81b4a27000       4       0       0 -----   [ anon ]
00007f81b4a28000    8192       8       8 rw---   [ anon ]
00007f81b5228000       4       0       0 -----   [ anon ]
00007f81b5229000    8192       8       8 rw---   [ anon ]
00007f81b5a29000       4       0       0 -----   [ anon ]
00007f81b5a2a000    8192       8       8 rw---   [ anon ]
00007f81b622a000       4       0       0 -----   [ anon ]
00007f81b622b000    8192      12      12 rw---   [ anon ]
00007f81b6a2b000       4       0       0 -----   [ anon ]
00007f81b6a2c000    8192       8       8 rw---   [ anon ]
00007f81b722c000       4       0       0 -----   [ anon ]
00007f81b722d000   79692   57740   57740 rw---   [ anon ]
00007f81bc000000     132      76      76 rw---   [ anon ]
00007f81bc021000   65404       0       0 -----   [ anon ]
00007f81c002f000    2068    2052    2052 rw---   [ anon ]
00007f81c03f9000       4       0       0 -----   [ anon ]
00007f81c03fa000     192      52      52 rw---   [ anon ]
00007f81c042a000       4       0       0 -----   [ anon ]
00007f81c042b000     192      52      52 rw---   [ anon ]
00007f81c045b000       4       0       0 -----   [ anon ]
00007f81c045c000     192      64      64 rw---   [ anon ]
00007f81c048c000       4       0       0 -----   [ anon ]
00007f81c048d000     736     552     552 rw---   [ anon ]
00007f81c0545000      20       4       0 rw-s- [aio] (deleted)
00007f81c054a000      20       4       0 rw-s- [aio] (deleted)
00007f81c054f000    3364    3364    3364 rw---   [ anon ]
00007f81c0898000      44      12       0 r-x-- libnss_files-2.19.so
00007f81c08a3000    2044       0       0 ----- libnss_files-2.19.so
00007f81c0aa2000       4       4       4 r---- libnss_files-2.19.so
00007f81c0aa3000       4       4       4 rw--- libnss_files-2.19.so
00007f81c0aa4000      40      20       0 r-x-- libnss_nis-2.19.so
00007f81c0aae000    2044       0       0 ----- libnss_nis-2.19.so
00007f81c0cad000       4       4       4 r---- libnss_nis-2.19.so
00007f81c0cae000       4       4       4 rw--- libnss_nis-2.19.so
00007f81c0caf000      28      20       0 r-x-- libnss_compat-2.19.so
00007f81c0cb6000    2044       0       0 ----- libnss_compat-2.19.so
00007f81c0eb5000       4       4       4 r---- libnss_compat-2.19.so
00007f81c0eb6000       4       4       4 rw--- libnss_compat-2.19.so
00007f81c0eb7000       4       0       0 -----   [ anon ]
00007f81c0eb8000    8192       8       8 rw---   [ anon ]
00007f81c16b8000      84      20       0 r-x-- libnsl-2.19.so
00007f81c16cd000    2044       0       0 ----- libnsl-2.19.so
00007f81c18cc000       4       4       4 r---- libnsl-2.19.so
00007f81c18cd000       4       4       4 rw--- libnsl-2.19.so
00007f81c18ce000       8       0       0 rw---   [ anon ]
00007f81c18d0000    1668     656       0 r-x-- libc-2.19.so
00007f81c1a71000    2048       0       0 ----- libc-2.19.so
00007f81c1c71000      16      16      16 r---- libc-2.19.so
00007f81c1c75000       8       8       8 rw--- libc-2.19.so
00007f81c1c77000      16      16      16 rw---   [ anon ]
00007f81c1c7b000      88      44       0 r-x-- libgcc_s.so.1
00007f81c1c91000    2044       0       0 ----- libgcc_s.so.1
00007f81c1e90000       4       4       4 rw--- libgcc_s.so.1
00007f81c1e91000    1024     128       0 r-x-- libm-2.19.so
00007f81c1f91000    2044       0       0 ----- libm-2.19.so
00007f81c2190000       4       4       4 r---- libm-2.19.so
00007f81c2191000       4       4       4 rw--- libm-2.19.so
00007f81c2192000     944     368       0 r-x-- libstdc++.so.6.0.20
00007f81c227e000    2048       0       0 ----- libstdc++.so.6.0.20
00007f81c247e000      32      32      32 r---- libstdc++.so.6.0.20
00007f81c2486000       8       8       8 rw--- libstdc++.so.6.0.20
00007f81c2488000      84       8       8 rw---   [ anon ]
00007f81c249d000      12       8       0 r-x-- libdl-2.19.so
00007f81c24a0000    2044       0       0 ----- libdl-2.19.so
00007f81c269f000       4       4       4 r---- libdl-2.19.so
00007f81c26a0000       4       4       4 rw--- libdl-2.19.so
00007f81c26a1000      32       4       0 r-x-- libcrypt-2.19.so
00007f81c26a9000    2044       0       0 ----- libcrypt-2.19.so
00007f81c28a8000       4       4       4 r---- libcrypt-2.19.so
00007f81c28a9000       4       4       4 rw--- libcrypt-2.19.so
00007f81c28aa000     184       0       0 rw---   [ anon ]
00007f81c28d8000      36      28       0 r-x-- libwrap.so.0.7.6
00007f81c28e1000    2044       0       0 ----- libwrap.so.0.7.6
00007f81c2ae0000       4       4       4 r---- libwrap.so.0.7.6
00007f81c2ae1000       4       4       4 rw--- libwrap.so.0.7.6
00007f81c2ae2000       4       4       4 rw---   [ anon ]
00007f81c2ae3000     104      12       0 r-x-- libz.so.1.2.8
00007f81c2afd000    2044       0       0 ----- libz.so.1.2.8
00007f81c2cfc000       4       4       4 r---- libz.so.1.2.8
00007f81c2cfd000       4       4       4 rw--- libz.so.1.2.8
00007f81c2cfe000       4       4       0 r-x-- libaio.so.1.0.1
00007f81c2cff000    2044       0       0 ----- libaio.so.1.0.1
00007f81c2efe000       4       4       4 r---- libaio.so.1.0.1
00007f81c2eff000       4       4       4 rw--- libaio.so.1.0.1
00007f81c2f00000      96      84       0 r-x-- libpthread-2.19.so
00007f81c2f18000    2044       0       0 ----- libpthread-2.19.so
00007f81c3117000       4       4       4 r---- libpthread-2.19.so
00007f81c3118000       4       4       4 rw--- libpthread-2.19.so
00007f81c3119000      16       4       4 rw---   [ anon ]
00007f81c311d000     132     112       0 r-x-- ld-2.19.so
00007f81c313e000       8       0       0 rw---   [ anon ]
00007f81c3140000      20       4       0 rw-s- [aio] (deleted)
00007f81c3145000      20       4       0 rw-s- [aio] (deleted)
00007f81c314a000      20       4       0 rw-s- [aio] (deleted)
00007f81c314f000      20       4       0 rw-s- [aio] (deleted)
00007f81c3154000      20       4       0 rw-s- [aio] (deleted)
00007f81c3159000      20       4       0 rw-s- [aio] (deleted)
00007f81c315e000      20       4       0 rw-s- [aio] (deleted)
00007f81c3163000      20       4       0 rw-s- [aio] (deleted)
00007f81c3168000    1840    1840    1840 rw---   [ anon ]
00007f81c3334000       8       0       0 rw-s- [aio] (deleted)
00007f81c3336000       4       0       0 rw-s- [aio] (deleted)
00007f81c3337000      24      12      12 rw---   [ anon ]
00007f81c333d000       4       4       4 r---- ld-2.19.so
00007f81c333e000       4       4       4 rw--- ld-2.19.so
00007f81c333f000       4       4       4 rw---   [ anon ]
00007ffd2d68b000     132      68      68 rw---   [ stack ]
00007ffd2d7ad000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 r-x--   [ anon ]
---------------- ------- ------- -------
total kB          640460   89604   81708

The KBytes and RSS column’s Total at the bottom matches the VIRT and RSS figures shown by “top”.

I should emphasize that this a freshly started mysqld process. Give it a few days to run, and some extra 100 MB of virtual space is added (not clear why) plus some real RAM, depending on the setting.

I’ve marked six anonymous segments that are completely virtual (no resident memory at all) summing up to ~360 MB. This means that they are counted in as 360 MB at least once — and that’s for a process that only uses 90 MB for real.

My own anecdotal test on another machine with a 4.4.0 kernel showed that putting /proc/sys/vm/overcommit_ratio below what was actually committed (making /proc/meminfo’s CommitLimit smaller than Committed_AS) didn’t have any effect unless /proc/sys/vm/overcommit_memory was set to 2. And when I did that, the OOM wasn’t called, but instead I had a hard time running new commands:

# echo 2 > /proc/sys/vm/overcommit_memory
# cat /proc/meminfo
bash: fork: Cannot allocate memory

So this is what it looks like when memory runs out and the system refuses to play ball.

AudioCodes MP202B hacking notes (Netvision)

This is a no-go

These are notes during my failed attempts to turn an old (2011?) Israeli ISP provider’s SIP adapter (MP202B, provided by Netvision) into one that works with any SIP provider. The first step is to get access of its configuration, and that’s exactly where it stopped.

Netvision has taken action to prevent any manipulations, in particular by keeping the access password secret (or maybe there is no password at all?). Some will call it customer retention.

So here is my war story. Sometimes you win, sometimes you lose.

Trying the nice way

Plugged the adapter’s LAN Ethernet jack to my laptop through the USB Ethernet adapter. The adapter served as a DHCP server, and gave the laptop the address 10.100.102.2. The default gateway as well as nameserver were set to 10.100.102.1.

So far so good. I opened the browser, and attempted to open the http://mp202.home, as the manual says. The browser complained that page can’t be opened, and indeed, a DNS query was made, giving the A address records 10.100.102.1 and 10.192.0.1 for that domain, however the attempt to open a TCP connection on port 80 of either of these addresses was responded with a flat TCP RST.

But it does work with http://mp202.home:8080 instead.

Telnet also worked, prompting for user and password.

However I failed to login, both from telnet and via the web interface. The default admin/admin pair didn’t work, and neither did the user/wel-012! pair mentioned a lot in forums, specifically for the adapter from Netvision. Word has it in old forums that Netvision changes the password all the time to prevent users from fiddling with their equipment. Peculiarly, when typing “user” at telnet’s prompt, it doesn’t ask for password, but goes on asking for username again.

So reset to factory defaults. The instructions in the Manual were to hold the RST pin for 7 seconds and up, but that did nothing. So I powered the device off, disconnected the network cable, and repowered with the RST pin held for a few seconds. And that clearly made a difference — in particular, the device stopped attempting to contact the SIP server as it did before.

Also: The LAN server now answered to port 80 for http://mp202.home.

But none of the user/password combinations I could think about worked.

Opening the MP202B

The screws are under the small rubber pieces that hold the adapter, four of them. There’s just a small PCB, with one interesting pin header, labeled JP1. Three pins. As this is obviously a Linux system, there’s almost a UART header. And this is the case.

So JP1 is UART, 3.3V, plain 115200 8N1. Plain USB adapters work. Connection:

  • Pin 1: Tx (driven by adapter). White wire
  • Pin 2: GND
  • Pin 3: Rx (pulled up by device). Green wire.

For some reason, the adapter didn’t boot up when pin 3 was connected to the green wire. So disconnect, let it boot (the boot messages are visible), connect the green wire and reboot by a quick power cycle.

As one can guess from the introduction, this didn’t help much either: Except for a lot of boot messages (see below), the serial port requires a login, exactly the same as the telnet.

The one thing I could possibly have done is to tftp an image of the mounted file system to my computer, mount it there, change the /etc/passwd file and then write it back to the flash. Maybe that would have worked.

However this is where it stopped to be fun, and my time is worth more than that.

Speaking of fun, a serial port session log follows. Note how clean and elegant the boot messages are. Or: I don’t know why it always ends up like this on closed hood systems.

Session log

Flash found: MX29LV640DBTC
Press ESC for monitor... 22
(ac494bl) help
reboot            version           info              fa
printenv          setenv            unsetenv          defragenv
fmt               boot              boot_recovery     dm
oclk              help              tftp              memtest           

(ac494bl) boot_recovery

Booting...
Uncompressing Linux.................................................................................................................Ok
Booting the kernel at 0x94020790

LINUX started...
CPU revision is: 00018448
Primary instruction cache 16kb, linesize 16 bytes (4 ways)
Primary data cache 16kb, linesize 16 bytes (4 ways)
Linux version 2.4.21openrg-rmk1 #5 Mon Sep 7 13:40:37 IDT 2009

 leds re-enabled @@@

Determined physical RAM map:
 memory: 14000000 @ 00000000 (reserved)
 memory: 00020000 @ 14000000 (ROM data)
 memory: 00f60000 @ 14020000 (usable)
On node 0 totalpages: 3968
zone(0): 3968 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: console=ttyS0,115200 root=/dev/ram0 rw
calculating r4koff... 00098968(625000)
CPU frequency 125.00 MHz
Calibrating delay loop... 124.92 BogoMIPS
Freeing Adam2 reserved memory [0x14001000,0x0001f000]
Memory: 11972k/15872k available (1131k kernel code, 3900k reserved, 2388k data, 4k init)
Dentry cache hash table entries: 2048 (order: 2, 16384 bytes)
Inode cache hash table entries: 1024 (order: 1, 8192 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer-cache hash table entries: 1024 (order: 0, 4096 bytes)
Page-cache hash table entries: 4096 (order: 2, 16384 bytes)
Checking for 'wait' instruction...  available.
POSIX conformance testing by UNIFIX
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd

Random: 0x323bdaca
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with no serial options enabled
ttyS00 at 0xa8610e00 (irq = 15) is a 16550A
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 512 bind 1024)
IPv4 over IPv4 tunneling driver
GRE over IPv4 tunneling driver
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
VFS: Mounted root (tmpfs filesystem).
Mount of initial cramfs successful
Freeing init cramfs memory: 64k freed
Freeing unused kernel memory: 4k freed
Mounting cramfs image at 'cramfs'
Algorithmics/MIPS FPU Emulator v1.5
insmod: add-symbol-file PATH/kos_lib.o 0xc0000060 -s .data 0xc000ddc0 -s .bss 0xc000eb90
insmod: add-symbol-file PATH/avaavalanche flash device: 0x800000 at 0x10000000.
lanche_flash_mod-----> [jedec_probe_chip:669]mfr=0x00C2, id=0x22CB, base=0x0000, osf=0x0002.
.o 0xc0011060 -sFound: Macronix MX29LV640DBTC
 .data 0xc001196Physically mapped flash:0: Found 1 x16 devices at 0x0 in 16-bit mode
0 -s .bss 0xc001 Amd/Fujitsu Extended Query Table vÿ.ÿ at 0x0000
1a80
number of CFI chips: 1
avalanche flash device: 0x800000 at 0x1d000000.
-----> [jedec_probe_chip:669]mfr=0x0000, id=0x0000, base=0x0000, osf=0x0002.
-----> [jedec_probe_chip:669]mfr=0x0000, id=0x0000, base=0x0000, osf=0x0002.
-----> [jedec_probe_chip:669]mfr=0x9090, id=0x9090, base=0x0000, osf=0x0002.
-----> [jedec_probe_chip:669]mfr=0x9090, id=0x9090, base=0x0000, osf=0x0002.
-----> [jedec_probe_chip:669]mfr=0x9090, id=0x9090, base=0x0000, osf=0x0004.
-----> [jedec_probe_chip:669]mfr=0x9090, id=0x9090, base=0x0000, osf=0x0004.
JEDEC: Found no Physically mapped flash:1 device at location zero
registering mtd failed
failed initializing  flash dev 1
Looking for mtd device :mtd0:
Found a mtd0 image (0x10000), with size (0x560000).
Creating 1 MTD partitions on "Physically mapped flash:0":
0x00010000-0x00570000 : "mtd0"
Looking for mtd device :mtd1:
insmod: add-symbol-file PATH/avalanche_cpmac.o 0xc0013060 -s .data 0xc0024240 -s .bss 0xc0024bf0
loading license key: AUDIOCODES
Press ESC to enter BOOT MENU mode.
[mt_rg_conf_compat:2333] saved_version=4.10.4.5.2, rg_version=4.1.7.3.7, saved_ext_version=3.0.1_p041_build_19, external_version=2.6.2_p10_build_2 .
[mt_rg_conf_compat:2335] UPDATING SERIAN NUMBER .....
ACL_GetSerialNumberFromMAC: mac_str = 00:90:8F:2F:9B:92
ACL_GetSerialNumberFromMAC: iserial_num=3120018 .
[mt_rg_conf_compat:2350] check_branch=-1!!!
[mt_rg_conf_compat:2353] restoring default configuration - new ACL branch !!!

Going to reboot!
Restarting system.

Flash found: MX29LV640DBTC
Press ESC for monitor... 221
(ac494bl) help
reboot            version           info              fa
printenv          setenv            unsetenv          defragenv
fmt               boot              boot_recovery     dm
oclk              help              tftp              memtest           

(ac494bl) version

PSPBoot 2.3.0.4
Compiled gcc rev: 3.3.2 [Aug  5 2008 12:29:56]
Built for MP202 board in Little Endian mode.

Optional modules included (+) or not (-):
 +tibinary +elf -gzip -ffs +tftp -ftp -dhcp -pcapp
(ac494bl) info

CHIP ID: TNETV1050 (0x7), REV: 0x2

MIPS Processor   : 4KEc rev: 2.2.0
Cache mode       : write-back, write-allocate.
Instruction cache: Associativity: 4, Line size: 16, Total size: 16KB
Data cache       : Associativity: 4, Line size: 16, Total size: 16KB

Last reset cause: Software reset (memory controller also reset)

EMIF is running at the same speed of the processor.
Processor running in little endian mode.
Processor clock is synchronous to internal bus clock.

(ac494bl) fa
Current Flash Allocation:

section :   PSBL, base : 0xb0000000, size :      50368 bytes
section :    ENV, base : 0xb07f0000, size :      65536 bytes

unallocated Space Start: 0xb0010000
unallocated Space End  : 0xb07f0000

(ac494bl) printenv

HWA_0        	00:01:02:03:04:05
HWA_ESWITCH  	00:90:8F:04:C0:8F
MAXLDRVER    	2.3.0.4
MODETTY0     	115200,n,8,1,hw
MODETTY1     	115200,n,8,1,hw
CPUFREQ      	125000000
SYSFREQ      	125000000
PROMPT       	(ac494bl)
mtd2         	0x90000000,0x90010000
mtd0         	0x90010000,0x90570000
mtd3         	0x907f0000,0x90800000
mtd4         	0x90570000,0x907f0000
BOOTRECOVERY 	m:f:"mtd4"
BOOTCFG      	m:f:"mtd0"
IPA          	192.168.1.1
IPA_GATEWAY  	192.168.1.254
SUBNET_MASK  	255.255.255.0
IPA_SVR      	192.168.1.2
MEMSZ        	0x00f80000
FLASHSZ      	0x00800000
BANNER       	1
FACTORY_DFT  	0

(ac494bl) boot
[boot 84] attempt (0) trying to boot from (BOOTCFG)

Booting...

sizeof(char): 00000001
sizeof(short): 00000002
sizeof(int): 00000004
sizeof(size_t): 00000004
sizeof(long): 00000004
sizeof(addr_t): 00000004
CONFIG_MTD_PHYSMAP_START: 90000000
CONFIG_MTD_PHYSMAP_LEN: 00800000
KERNEL_START: 94020000
virt_to_program(KERNEL_START): 94020000
HEAD_ADDR: 94C00000
CRAMFS_TARGET_ADDR: 943C2000
CRAMFS_LEN: 00300000
head_start_addr: 94C0000C
start_addr: 90010018
program_to_phys(start_addr): 90010018
offset: FB41000C
&_compressed_kernel_start: 94C02170
&_compressed_kernel_end: 94CDD602
compressed_kernel_start: 9001217C
compressed_kernel_len: 000DB492
Uncompressing Linux...
from: 9001217C
output_data: 94020000
output_len: 00BE0000
..........................................................................................................res: 00358000
Done
&_rofs_start: 94CDE000
&_rofs_end: 94FDE000
test: 90010018
_PA(test): 90010018
_PA(CONFIG_MTD_PHYSMAP_START): 90000000
_PA(CONFIG_MTD_PHYSMAP_START) + CONFIG_MTD_PHYSMAP_LEN: 90800000
test: 90010018
CramFS in Flash
Check CRC ... ||\-/|\-/|\-/177EC94A OK
cramfs_start_src: 00000000
CramFS dump (offsets in words):
+0: 28CD3D45
+1: 00300000
+2: 0000A003
kernel: 94020790
Kernel dump:
+0: 3C1C9415
+1: 279C2000
+2: 27881FE0
Booting the kernel

LINUX started...
rofs_addr: 900ee00c (100ee00c), len: 300000, flash_start_addr: 90000000 (10000000) flash_len: 800000
CramFS in Flash (prom)
rofs_start_addr: 00000000 len: 0
CPU revision is: 00018448
Primary instruction cache 16kB, physically tagged, 4-way, linesize 16 bytes.
Primary data cache 16kB 4-way, linesize 16 bytes.
Linux version 2.4.21openrg-rmk1 #122 Wed Dec 15 18:44:20 IST 2010

 leds re-enabled @@@

Determined physical RAM map:
 memory: 14000000 @ 00000000 (reserved)
 memory: 00020000 @ 14000000 (ROM data)
 memory: 00f60000 @ 14020000 (usable)
Using memsize of f80000
On node 0 totalpages: 3968
zone(0): 3968 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: cramfs=flash mtdparts=phys_mapped_flash:0x800000@0(Gateway),0x300000@0x000ee00c(mainfs) slram=initfs,0x14368000,+65536 console=ttyS0,115200 root=/dev/ram0 rw
calculating r4koff... 00098968(625000)
CPU frequency 125.00 MHz
Calibrating delay loop... 124.51 BogoMIPS
Freeing Adam2 reserved memory [0x14001000,0x0001f000]
Memory: 11964k/15872k available (1216k kernel code, 3908k reserved, 2184k data, 4k init)
Dentry cache hash table entries: 2048 (order: 2, 16384 bytes)
Inode cache hash table entries: 1024 (order: 1, 8192 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer-cache hash table entries: 1024 (order: 0, 4096 bytes)
Page-cache hash table entries: 4096 (order: 2, 16384 bytes)
Checking for 'wait' instruction...  available.
POSIX conformance testing by UNIFIX
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
/proc/fs/cramfs_block_uncompressed created
/proc/fs/cramfs_location created

Random: 0x665ef687
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with no serial options enabled
ttyS00 at 0xa8610e00 (irq = 15) is a 16550A
AVALANCHE flash device: 800000 at 10000000
-----> [jedec_probe_chip:669]mfr=0x00C2, id=0x22CB, base=0x0000, osf=0x0002.
Found: Macronix MX29LV640DBTC
Avalanche Physically mapped flash: Found 1 x16 devices at 0x0 in 16-bit mode
 Amd/Fujitsu Extended Query Table vÿ.ÿ at 0x0000
number of CFI chips: 1
Using command line partition definition
Creating 2 MTD partitions on "Avalanche Physically mapped flash":
0x00000000-0x00800000 : "Gateway"
0x000ee00c-0x003ee00c : "mainfs"
mtd: partition "mainfs" doesn't start on an erase block boundary -- force read-only
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 512 bind 1024)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
VFS: Mounted root (tmpfs filesystem).
making node /dev/mtdblock3 (initfs)
Mount of initial cramfs successful
Error making node /dev/mtdblock2
making node /dev/mtdblock2 (mainfs)
Freeing unused kernel memory: 4k freed
AVALANCHE flash device: 800000 at 10000000
-----> [jedec_probe_chip:669]mfr=0x00C2, id=0x22CB, base=0x0000, osf=0x0002.
Found: Macronix MX29LV640DBTC
Avalanche Physically mapped flash: Found 1 x16 devices at 0x0 in 16-bit mode
 Amd/Fujitsu Extended Query Table vÿ.ÿ at 0x0000
number of CFI chips: 1
Using command line partition definition
Creating 2 MTD partitions on "Avalanche Physically mapped flash":
0x00000000-0x00800000 : "Gateway"
0x000ee00c-0x003ee00c : "mainfs"
mtd: partition "mainfs" doesn't start on an erase block boundary -- force read-only
Algorithmics/MIPS FPU Emulator v1.5
Version: 3.0.1_p041_build_19
Platform: AudioCodes MP20xB Board With SILABS Slic
Compilation Time: 15-Dec-10 16:38:49
Tag: Ntag-4_10_4_X_X
Compilation Flags: CONFIG_RG_VOIP_RV_MGCP=n CONFIG_RG_VOIP_RV_SIP=y BUILD=build.REV_B_SIP DIST=MP20XB_SILABS_VGW LIC=../jpkg_audiocodes.lic
User Information: audiocodes@localhost.localdomain /home/audiocodes/work/mp20x_4.10/amil_MP20x_4.10_dev6/CMBU/tulip/rg

insmod: add-symbol-file build/debug/one_module.o 0xc0011060 -s .main_flow 0xc00543e0 -s .data 0xc005f8b0 -s .bss 0xc0061760
Loading license 28fddd6dd30365cff7d05e4bb8a26f22278422545c33a022d98c810b0bb80b20e41b129a325365e4f9c05b80ddd328a75e03274f4deaca09dbd709b6c122b06cd49fd349748a7384cffce9e3297ce45d.Audiocodes
loading license key: Audiocodes

insmod: add-symbol-file build/debug/avalanche_cpmac.o 0xc009c060 -s .data 0xc00ad2b0 -s .bss 0xc00adc60
loading license key: Audiocodes
Press ESC to enter BOOT MENU mode.
ata_pages_init,52 --
[mt_rg_conf_compat:3778] saved_version=4.10.4.5.2, rg_version=4.10.4.5.2, saved_ext_version=3.0.1_p041_build_19, external_version=3.0.1_p041_build_19 .
[mt_rg_conf_compat:3779] UPDATING SERIAL NUMBER .
ACL_GetSerialNumberFromMAC: mac_str = 00:90:8F:2F:9B:92
ACL_GetSerialNumberFromMAC: iserial_num=3120018 .
[mt_rg_conf_compat:3793] updating remote access path
MAPS:
00400000-0090d000 r-xp 00000000 1f:02 158032     /mnt/cramfs/bin/openrg
10000000-1009c000 rw-p 0050d000 1f:02 158032     /mnt/cramfs/bin/openrg
1009c000-10158000 rwxp 00000000 00:00 0
7ffInitializing env_params module
f7000-7fff8000 rwxp 00000000 00:00 0

insmod: add-symbol-file build/debug/ac49x_env_mod.o 0xc00c0060 -s .data 0xc00c0300 -s .bss 0xc00c0360

insmod: add-symbol-file build/debug/ac49x_led_mod.o 0xc00c2060 -s .data 0xc00c3f30 -s .bss 0xc00c3f70

insmod: add-symbol-file build/debug/log_chardev.o 0xc00c6060 -s .data 0xc00c6960 -s .bss 0xc00c69e0

insmod: add-symbol-file build/debug/igmp_proxy_mod.o 0xc00c8060 -s .main_flow 0xc00cdd64 -s .data 0xc00ce2a0 -s .bss 0xc00ce3a0

insmod: add-symbol-file build/debug/tcp_mss.o 0xc00d0060 -s .data 0xc00d06e0 -s .bss 0xc00d0710

insmod: add-symbIPV4 device driver registered
ol-file build/debug/rg_ipv4.o 0xc00d2060 -s .data 0xc00d2290 -s .bss 0xc00d22e0

insmod: add-symbol-file build/debug/pppoe_relay.o 0xc00d4060 -s .data 0xc00d6db0 -s .bss 0xc00d6db0

insmod: add-symbol-file build/debug/rg_pppoe_relay.o 0xc00eb060 -s .data 0xc00eb7e0 -s .bss 0xc00eb870

insmod: add-symbol-file build/debug/ac49x_dsp_mod.o 0xc00ed060 -s .data 0xc0106450 -s .bss 0xc0185bf0
dsp memory at block 14c00000
starting vlynq_vdsp configuration...
Failed to initialize the vlynq @ 0xa8611c00.
error msg = Error in configuring clocks..
failed on AcVlynq_Init_SoC_VlynqHi__to__AC498_VlynqLow possibly no vlynq
putting vlynq back in reset...

insmod: add-symbol-file build/deDSP-TELE: SPCR1 reg value = 0x10800FF .
bug/ac49x_mpi_moDSP-TELE: Telephony Interface driver installed.
d.o 0xc018b060 -s .data 0xc018bc10 -s .bss 0xc018bc60

insmod: add-symbol-file build/debug/silabs_daa_3050_drv_mod.o 0xDAA Driver init_module !!!
c018d060 -s .data 0xc0190190 -s .bss 0xc0190280

insmod: add-symbol-file build/debug/silabs_fxs_3215_drv_mod.o 0xc0192060 -s .data 0xc0197fe0 -s .bss 0xc0198e70

insmod: add-symbol-file build/debug/rtp.o 0xc019b060 -s .data 0xc019e1c0 -s .bss 0xc019e1f0

insmod: add-symbol-file build/debug/phone_mod.o 0xc01a0060 -s .data 0xc01a1090 -s .bss 0xc01a10e0

insmod: add-symbol-file build/debug/ac494_mod.o 0xc01a3060 -s .data 0xc01ab120 -s .bss 0xc02277d0

****** AC49x init_module *****

insmod: add-symbol-file build/debug/qos_ingress.o 0xc0238060 -s .main_flow 0xc02Freeing modfs memory: 2036k freed
38980 -s .data 0xc0238b50 -s .bss 0xc0238bf0
factory dft = 0
mt_rg_conf_bad_reboot_verify(). Reboot reason 1
 1 consecutive reboots
getMp20XProdName:116 Need to check g_rg_prod_name (MP20X)
[init:234]
starting voip task watchdog
[AutoDialer_Start: 821]->
[init: 781]->
[init_connection_list: 273]-> num of object list(3)
[init_connection_list: 292]allocating (2) object
[init_connection_obj: 248]dialer type is DHCP
[init_connection_obj: 249]dialer retries(4)
[init_connection_obj: 250]dialer retries timeout(5)
[init_connection_list: 292]allocating (1) object
[init_connection_obj: 220]dialer UserName(fprod-13@INetvision)
[init_connection_obj: 221]dialer Password(827456)
[init_connection_obj: 248]dialer type is PPPOE
[init_connection_obj: 249]dialer retries(4)
[init_connection_obj: 250]dialer retries timeout(5)
[init_connection_list: 292]allocating (0) object
[init_connection_obj: 220]dialer UserName(fprod-12)
[init_connection_obj: 221]dialer Password(676261)
[init_connection_obj: 236]dialer ServerIp(cable.netvision.net.il)
[init_connection_obj: 248]dialer type is L2TP
[init_connection_obj: 249]dialer retries(4)
[init_connection_obj: 250]dialer retries timeout(5)
[init_connection_list: 309]<-
[test_status_init: 325]mgt_int_conn_diag_t(0x10194110) size(464)
[test_status_init: 337]mgt_icd_status_t(0x10193190) size(20) test_id(1)
[test_status_init: 337]mgt_icd_status_t(0x10193428) size(20) test_id(2)
[test_status_init: 337]mgt_icd_status_t(0x10193440) size(20) test_id(4)
[test_status_init: 337]mgt_icd_status_t(0x10193458) size(20) test_id(16)
[init: 812]JL need to add led init here!!!
[init: 813]<-
[AutoDialer_Start: 833]<-
VoIP task started
[test_phys_conn_cb: 757]->
[test_start: 480]->
[test_start: 482]<-
[get_test_results: 494]->
[get_test_results: 498]physical link test is in progress
[get_test_results: 502]
tests results are: status(ICD_FAILURE) spesific(ICD_PHYS_LINK_ETH)
[get_test_results: 504]<-
[test_phys_conn_cb: 761]Failed detecting physical connection
[controller: 516]->
[controller: 528]Auto dialer detection failed. restarting process. 0 retries left
[reset_dialer_list_parameters: 427]->
[reset_dialer_list_parameters: 435]<-
[controller: 532]<-
[test_phys_conn_cb: 769]<-
[test_phys_conn_cb: 757]->
[test_start: 480]->
[test_start: 482]<-
[get_test_results: 494]->
[get_test_results: 498]physical link test is in progress
[get_test_results: 502]
tests results are: status(ICD_FAILURE) spesific(ICD_PHYS_LINK_ETH)
[get_test_results: 504]<-
[test_phys_conn_cb: 761]Failed detecting physical connection
[controller: 516]->
[controller: 536]Auto dialer detection process failed
[controller: 538]JL need to change led status to slow red flash and set dialer to dhcp

Auto Dialer Detect Statistics Summery
-------------------------------------

dialer name: L2TP
status: not tested
time spent on dialer connection: 0 sec

dialer name: PPPOE
status: not tested
time spent on dialer connection: 0 sec

dialer name: DHCP
status: not tested
time spent on dialer connection: 0 sec

total running time: 0 seconds
[uninit: 402]->
[uninit_connection_list: 376]->
[uninit_connection_list: 391]<-
[uninit: 415]<-
[controller: 546]<-
[test_phys_conn_cb: 769]<-
[voip.c:1507] checking SLICIOC_R_FXS_READ_INIT_DONE for dev /dev/ac494_slic1
[voip.c:1526] init_done = 0.
[voip.c:1535] setup a new timer .
Username: Failed detecting physical connection

Failed detecting physical connection

Auto dialer detection process failed

[mt_voip_ipc 870] voip is ready for ipc
voip: new configuration parameters loaded
[voip_conf_get_number_of_dsp_devices:3342] iNumO----------i = DSP_REGIONAL_SETTINGS_CALL_PROGRESS_CALL_WAITING_TONE ----------
fDspDevices=1, iac494_configure_dsp: about to configure time slots, iNumOfAnalogChannels=2.
First=1.
ac494_configure_dsp: THE END !!!!
TimerInit: pTimerObj->ticksPerSecond=0x64 .
CountryName=USA
ACL_Si321x_GetCountryIndex: Found a match (country=USA, index=0) .
Si321x_ZsynthSetup
Si3215-16 Rev C
CountryName=USA
ACL_Si321x_GetCountryIndex: Found a match (country=USA, index=0) .
Si321x_ZsynthSetup
Si3215-16 Rev C
[phone_api.c:60] slic_status_polling ...
[phone_api.c 74] polling on slic status(1)

[phone_api.c 106] passed slic_status_polling
[voip.c:1507] checking SLICIOC_R_FXS_READ_INIT_DSilabs drv: SLICIOC_W_FXS_WRITE_CALIBRATION_RESULTS_PRECONFIG (channel=0, scFXSCalibrationWriteCounter=1) .
ONE for dev /dev Reg98=16, Reg99=17, Reg100=17, Reg101=17, Reg102=5, Reg103=54, Reg105=216, Reg107=8
/ac494_slic1
[vslic_ioctl:cmd BLOCKED=4004eea1 runing=8004ee90
oip.c:1526] initerror reading sys_ioctl
_done = 1.
[voip.c:1543] FXS Calibration is notSilabs drv: SLICIOC_W_FXS_WRITE_CALIBRATION_RESULTS_PRECONFIG (channel=1, scFXSCalibrationWriteCounter=2) .
 required, use pslic_ioctl:cmd BLOCKED=4004eea1 runing=8004ee90
reconfiured paraerror reading sys_ioctl
ms.
 [voip.c 1 Reg98=15, Reg99=15, Reg100=17, Reg101=14, Reg102=7, Reg103=87, Reg105=43, Reg107=7
393] mt_voip_slic_calibration_done_parameters_poll: Calibration Results:
 Reg98=16, Reg99=17, Reg100=17, Reg101=17, Reg102=5, Reg103=54, Reg105=216, Reg107=8
 [voip.c 1393] mt_voip_slic_calibration_done_parameters_poll: Calibration Results:
 Reg98=15, Reg99=15, Reg100=17, Reg101=14, Reg102=7, Reg103=87, Reg105=43, Reg107=7
RV conf using localAddress 10.100.102.1
RV conf using userDomain 10.100.102.1
calling rvIppSipStackInitialize
rvIppSipStackInitialize returned OK
**** mt_voip_ext_conf_poll: num_of_fxos_lines=2, num_of_fxo_lines=0 

 acl_rg_conf_voip_ext_commit
memory_watchdog: interval=10000, memory_margin=3500 .
rmt_cfg_start>>
automatic dialer detection process done!
automatic dialer detection process failed!

The boot loader is PSP Boot, of course.

Add-on icons vanished after Firefox upgrade

Don’t get me wrong: I didn’t upgrade Firefox. It upgraded itself suddenly into Firefox 68.o Quantum. How did I notice? Not because it became any better, but because suddenly things stopped working. And I got logged out from several sites.

For example, my Adblocker and other add-on suddenly vanished. Checking the add-on settings, the relevant add-ons were installed and enabled, so what?

Well, it turns out that I had to manually add the icons to the toolbar. As simple as that: Right-click the toolbar where the icons usually were, select Customize… and drag the icons where they should be.

Why Firefox upgraded itself is beyond me. Whenever possible, I always opt this out (doesn’t look like Firefox has such option). So if any of the Firefox maintainers ever read this: If it was a security issue, patch it. Don’t push a whole upgrade I didn’t ask for. A want my computer as stable and repeatable as you want your own car. In case you care about it, that is.

Intel FPGA’s Stratix 10: My impressions and notes

Introduction

These are a few random things I wrote down as I worked with the Stratix 10 Development Kit, with focus on its PCIe interface. Quite obviously, it’s mostly about things I found noteworthy about this specific FPGA and its board, compared with previous hardware I’ve encountered.

Generally speaking, Stratix 10 is not for the faint-hearted: It has quite a few special issues that require attention when designing with it (some detailed below), and it’s clearly designed with the assumption that if you’re working with this king-sized beast, you’re most likely part of some high-end project, being far from a novice in the FPGA field.

Some National Geographic

Even though I discuss the development kit further below, I’ll start with a couple of images of the board’s front and back. This 200W piece of logic has a liquid cooler and an exceptionally noisy fan — none of which are shown in Intel’s official images I’ve seen. In other words, it’s not as innocent as it may appear from the official pics.

There are no earplugs in the kit itself, so it’s recommended to buy something of that sort along with it. One could only wish for a temperature controlled fan. I mean, measuring the temperature of the liquid would probably have done the job. Some silence when the device isn’t working hard.

So here’s what the board looks like out of the box (in particular DIP switches in the default positions). Click images to enlarge.

Front side of Stratix 10 Development Kit

Front side of Stratix 10 Development Kit

Back side of Stratix 10 Development Kit

Back side of Stratix 10 Development Kit

 

“Hyperflex”

The logic on the Stratix 10 FPGAs has been given this rather promising name, implying that there’s something groundbreaking about it. However synthesizing a real-life design for Stratix 10, I experienced no advantage over Cyclone 10: All of the hyper-something phases got their moment of glory during the project implementation (Quartus Pro 19.2), but frankly speaking, when the design got the slightest heavy (5% of the FPGA resources, but still a 256-bit wide bus everywhere on a 250 MHz clock), timing failed exactly as it would on a Cyclone 10.

Comparing with Xilinx, it feels a bit like Kintex-7 (mainline speed grade -2), in terms of the logic’s timing performance. Maybe if the logic design is tuned to fit the architecture, there’s a difference.

Assuming that this Hyperflex thing is more than just a marketing buzz, I imagine that the features of this architecture are taken advantage of in Intel’s own IP cores for certain tasks (with extensive pipelining?). Just don’t expect anything hyper to happen when implementing your own plain design.

PCIe, Transceivers and Tiles

It’s quite common to use the term “tiles” in the FPGA industry to describe sections on the silicon die that belong to a certain functionality. However the PCIe + transceiver tiles on a Stratix 10 are separate silicon dies on the package substrate, connected to the main logic fabric (“HyperFlex”) through Intel’s Embedded Multi-die Interconnect Bridge (EMIB) interface. Not that it really matters, but anyhow.

H, L and E tiles provide Gigabit transceivers. H and L tiles come with exactly one PCIe hard IP each, E-tiles with 100G Ethernet. There might be one or more of these tiles on a Stratix 10 device. It seems like the L tile will vanish with time, as it has weaker performance in almost all parameters.

All tiles have 24 Gigabit transceivers. Those not used by the hard IP are vacant for general purpose, even though some might become unusable, subject to certain rules (given in the relevant user guides).

And here comes the hard nut: PCIe has a minimal data interface of 256 bits with the application logic. The other possibility is 512 bits. This can be a significant burden when porting a design from earlier FPGA families, in particular if they were based upon a narrower data interface.

Xillybus supports the Stratix 10 device family, however.

PCIe unsupported request error

Quite interestingly, there were correctable (and hence practically harmless) errors on the PCIe link consistently when booting a PC with the official development kit, with a production grade (i.e. not ES) H-tile FPGA. This is what plain lspci -vv gave me, even before the application logic got a chance to do anything:

01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb (rev 01)
        Subsystem: Altera Corporation Device ebeb
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d0100000 (64-bit, prefetchable) [size=256]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit-
                Address: 00000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #1, Speed 5GT/s, Width x16, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-

As highlighted above, Unsupported Request correctable errors took place on the link. Even though this is harmless, it’s nevertheless nothing that should happen on a properly working PCIe link.

Note that I ran the PCIe link on Gen2 only, even though it supports Gen3. Not that it should matter.

Reset release IP

According to Intel’s Configuration Guide for Stratix 10 for Quartus Design Suite 19.2, one can’t rely on the device’s consistent wakeup, but the nINIT_DONE signal must be used to reset all logic:

“The entire device does not enter user mode simultaneously. Intel requires you to include the Intel Stratix 10 Reset Release IP on page 22 to hold your application logic in the reset state until the entire FPGA fabric is in user mode. Failure to include this IP in your design may result in intermittent application logic failures.”

Note nINIT_DONE is asserted (low) when it’s fine to run the logic, so it’s effective an active HIGH reset. It’s so easy to get confused, as the “n” prefix triggers the “active low reset” part of an FPGA designer’s brain.

Failing to have the Reset Release IP included in the project results the following critical warning during synthesis (Quartus Pro 19.2):

Critical Warning (20615): Use the Reset Release IP in Intel Stratix 10 designs to ensure a successful configuration. For more information about the Reset Release IP, refer to the Intel Stratix 10 Configuration User Guide.

The IP just exposes the nINIT_DONE signal as an output and has no parameters. It boils down to the following:

wire ninit_done;
altera_s10_user_rst_clkgate init_reset(.ninit_done(ninit_done));

One could instantiate this directly, but it’s not clear if this is Quartus forward compatible, and it won’t silence the critical warning.

However Quartus Pro 18.0 doesn’t issue any warning if the Reset Release IP is missing, and neither is this issue mentioned in the related configuration guide. Actually, the required IP isn’t available on Quartus Pro 18.0. This issue obviously evolved with time.

Variable core voltage (SmartVID)

Another ramp-up in the usage complexity is the core voltage supply. The good old practice is to set the power supply to whatever voltage the datasheet requires, but no, Stratix 10 FPGAs need to control the power supply, in order to achieve the exact voltage that is required for each specific device. So there’s now a Power Management User Guide to tackle this issue.

This has a reason: As the transistors get smaller, so does the tolerance of the process get a larger impact. To compensate for these tolerances, and not take a hit on the timing performance, each device has its own ideal core voltage. So if you’ve gone as far as using a Stratix 10 FPGA, what’s connecting a few I2C wires to the power supply and let it pick its favorite voltage?

The impact on the FPGA design is the need to inform the tools which pins to use for this purpose, what I2C address to use, which power supply to expect on the other end, and other parameters. This takes the form of a few extra lines, as shown below for the Stratix 10 Development Kit:

set_global_assignment -name USE_PWRMGT_SCL SDM_IO14
set_global_assignment -name USE_PWRMGT_SDA SDM_IO11
set_global_assignment -name VID_OPERATION_MODE "PMBUS MASTER"
set_global_assignment -name PWRMGT_BUS_SPEED_MODE "400 KHZ"
set_global_assignment -name PWRMGT_SLAVE_DEVICE_TYPE LTM4677
set_global_assignment -name PWRMGT_SLAVE_DEVICE0_ADDRESS 4F
set_global_assignment -name PWRMGT_SLAVE_DEVICE1_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE2_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE3_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE4_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE5_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE6_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE7_ADDRESS 00
set_global_assignment -name PWRMGT_PAGE_COMMAND_ENABLE ON
set_global_assignment -name PWRMGT_VOLTAGE_OUTPUT_FORMAT "AUTO DISCOVERY"
set_global_assignment -name PWRMGT_TRANSLATED_VOLTAGE_VALUE_UNIT VOLTS

It’s among the things that are easy when they work, but when designing your own board and something goes wrong with the I2C bus, well, well.

“Self service licensing”

The Stratix 10 Development Kit includes a one-year license for Quartus Pro, which is activated on Intel’s website. It’s recommended to start this process as soon as possible, as it has a potential of getting tangled and time consuming. In particular, be sure to know which email address was reported to Intel along with the purchase of the Kit, and that you have a fully verified account for that email address on Intel’s website.

That’s because the self-service licensing procedure is possible only from the Intel account that is registered with a specific email address. This email address is the one that the distributor reported when forwarding the order for the development kit to Intel. In my case, they used an address they had on record from a previous purchase I made from the same distributor, and it didn’t even cross my mind to try it.

Be sure to fill in the detailed registration form and to confirm the email address. Access to the licensing area is denied otherwise. It continues to be denied for a few days after filling in the details. Probably a matter of validation by a human.

The serial number that needs to be fed in (or does it? see below) is the one that appears virtually everywhere (on the PCB itself, on the package, on the outer box with which the package arrived), and has the form of e.g. 10SHTPCIe0001234. However the instructions said it should be “printed on the side of the development kit box below the bottom bar code”. Well, there is nothing printed under the bottom bar code. It’s not so difficult to find it, as it says “serial number”, but when the registration fails, this misleading direction adds a level of confusion.

Since the serial number is so out in the open, it’s quite clear why another form of authentication is needed. Too bad that the email issue wasn’t mentioned in the instructions.

In my case, there was no need to feed any serial number. Once the Intel account was validated (a few days after filling in the registration details), the license simply appeared on the self-service licensing page. As I contacted Intel’s licensing support twice throughout the process, it’s possible that someone at Intel’s support took care of pairing the serial number  with my account.

Development kit’s power supplies

I put this section last, because it’s the one that is quite pointless reading, frankly speaking. The bottom line is simple, exactly like the user guide says: If you use the board stand-alone, use the power supply that came along with it. If the board goes into the PCIe slot, connect both J26 and J27 to the computer’s ATX power supply, or the board will not power up.

J27 is a plain PCIe power connector (6 pins), and J26 is an 8-pin power connector. On my plain ATX power supply there was a PCIe power connector with a pair of extra pins attached with a cable tie (8 pins total). It fit in nicely into J26, it worked, no smoke came out, so I guess that’s the way it should be done. See pictures below, click to enlarge.

ATX power supply connected to Stratix 10 Development Kit, front side

ATX power supply connected to Stratix 10 Development Kit, front side

ATX power supply connected to Stratix 10 Development Kit, back side

ATX power supply connected to Stratix 10 Development Kit, back side

Now to the part you can safely skip:

As the board is rated at 240 W and may draw up to 20A from its internal +12V power supply, it might be interesting understand how the power load is distributed among the different sources. However the gory details have little practical importance, as the board won’t power up when plugged in as a PCIe card unless power is applied both to J26 and J27 (the power-up is sequencer set up this way, I guess). So this is just a little bit of theory.

There are three power groups, each having a separate 12V power rail: 12V_GROUP1, 12V_GROUP2 and 12V_GROUP3.

12V_GROUP2 will feed 12V_GROUP1 and 12V_GROUP3 with current if their voltage is lower than its own, by virtue of an emulated ideal diode. It’s as if there was two ideal diodes connected with their anodes on 12V_GROUP2 and one diode’s cathode on 12V_GROUP1, and cathode on 12V_GROUP3.

These voltage rails are in turn fed by external connectors, through emulated ideal diodes as follows:

  • J26 (8-pin aux voltage) feeds 12V_GROUP1
  • J27 (6-pin PCIe / power brick) feeds 12V_GROUP2
  • The PCIe slot’s 12V supply feeds 12V_GROUP3

The PCIe slot’s 3.3V supply is not used by the board.

This arrangement makes sense: If the board is used standalone, the brick power supply is connected to J27, and feeds all three groups. When used in a PCIe slot, the slot itself can only power 12V_GROUP3, so by itself, the board can’t power up. Theoretically speaking, J27 needs to be connected to the computer’s power supply through a PCIe power connector, at the very least. For the higher power applications, J26 should be connected as well to the power supply, to allow for the higher current flow. In practice, J27 alone won’t power the board up, probably as a safety measure.

The FPGA’s core voltage is S10_VCC, which is generated from 12V_GROUP1 — this is the heavy lifting, and it’s not surprising that it’s connected to J26, which is intended for the higher currents.

The ideal diode emulation is done with LTC4357 devices, which measure the voltage between the emulated diode’s anode and cathode. If this voltage is slightly positive, the device opens a external power FET by applying voltage to its gate. This FET’s drain and source pins are connected to the emulated diode’s anode and cathode pins, so all in all, when there’s a positive voltage across it, current flows. This reduces the voltage drop considerably, allowing efficient power supply OR-ing, as done extensively on this development kit.

The board’s user guide advises against connecting the brick power supply to J27 when the board is in a PCIe slot, but also mentions the ideal diode mechanism (once again, it won’t power up at all this way). This is understandable, as doing so will cause current to be drawn from the PCIe slot’s 12V supply when its voltage is higher that the one supplied by J27, even momentarily. With the voltage turbulence that is typical to switching power supplies, the currents may end up swinging quite a lot in an unfortunate combination of power supplies.

So even though it’s often more comfortable to control the power of the board separately from the hosting computer’s power, or to connect J27 only if the board is expected to draw less than 75W, both possibilities are eliminated. Both the noisy fan and the board’s refusal to power up unless fed properly prepare the board for the worst case power consumption scenario.

KiCad notes: The one-timer PCB maker’s hints for the next time

Intro

These are my notes to self for the next time I’ll have a PCB design to make. This is definitely not my expertise.

My original intention was to start this post with some kind of apologize, saying I only needed to make a simple board, so KiCad was good enough for this purpose. However it seems like KiCad is about to become the preferred tool many PCB designers quite soon, if it’s not there already. It’s surely handy and relatively easy to learn (at least for me, an experienced Electrical Engineer with no previous hands-on in board design). Does it match up with the expensive tools out there for a heavy and complicated design? I don’t know.

Everything in KiCad is text based in a simple manner. Important for version control, and using it to monitor my changes. Also very handy for fiddling with the files to get the things right on spot. In particular useful when creating footprints.

All said here relates to KiCad 5.1.4 under Linux Mint 19.

Getting started

  • The KiCad version that went along with Mint 19 was heavily outdated. Compare what packages are available directly from KiCad.
  • When upgrading from 4.x to 5.x, be sure to uninstall the previous version and purge it (apt purge) before installing the new one. Also, delete .config/kicad/ as it contains the paths to libraries for symbols and footprints.
  • The ‘Footprint library path “…” does not exist’ error may be a result of not cleaning up an old installation as mentioned in the previous item.
  • The official documentation, in particular the Getting Started with KiCad guide, is definitely the right place to start and refresh memory.

EESchema

  • Before starting to work: Verify that the page size is correct and that the grid is 50 mils or so, no lower: View > Grid Settings… to ensure this.
  • Basic components: In the “Device” library.
  • Aiming the cursor on something accurately and right-clicking is usually the start for modifying virtually anything.
  • Better still: Note the hotkey and use it while hovering over the relevant place.
  • “Drag” keeps connections. “Move” doesn’t. Use “Drag”.
  • If wires don’t snap to the ports, and components are placed regardless to the grid, it’s most likely that the grid is too dense.
  • Useful hotkeys: Hover-Delete (well, delete), Hover-G (drag, think Blender) and Hover-C (duplicate).
  • “Value”: Each component has a value assigned to it. The value appears on the assembly layer (F.Fab or B.Fab) of the PCB. In other words, the person doing the assembly of the board will see this value in the component’s position, and select what to put there accordingly. For resistors and capacitors, it’s obviously the value of the component. For other components, a useful hint on the nature of the device should be there.
  • “Reference”: The symbol’s unique identifier in the schematics and netlist. U1, J15, C225 etc. By default, they’re C?, R? etc. until Annotate Schematic… is done.
  • Net names: Designated with “Place Net Label” or by pressing “l”. Makes things easier during layout, and differential pairs must be with “+” and “-” suffixes (or _p/_n), or PCBNew won’t play ball later on.
  • Footprints are assigned on each component’s Properties (and are not associated with symbols, unlike other tools), or more conveniently, with Tools > Assign Footprints. The association affects the schematic (.sch) file, appearing as a string property consisting of the library name and the footprint’s name, separated by a colon. Don’t associate to library footprints, but copies of them, defined explicitly on a local library. Or upgrading KiCad may change the design. See below on footprints.
  • Plain holes are “virtual components”. Add a MountingHole symbol for each, and assign a matching footprint (hole only).

Symbol Editor

  • Read the Symbol Guidelines in KiCad Library Convention document (KLC), after the daunting naming convention part (start at S3, maybe). It’s how to get it right.
  • Check the grid. 50-100 mils, or the wires won’t snap to ports when the symbol is used (official guideline is 100 mils).
  • Symbol Editor: When drawing a rectangle, click at the corner points, don’t hold the mouse button.
  • It doesn’t hurt setting the “Description” property for each symbol — it appears on the library’s list as well as the output netlist.
  • If a pin has multiple pads assigned to it, KiCad expects them to have track wires on the PCB connecting between them all.

Understanding footprints

Footprints are a set of graphical shapes in different layers that serve several purposes, all of which much be addressed when setting up a custom footprint:

  • Putting the pads in place: Front copper layer (F. Cu, and also B. Cu with thru pads) and solder mask (F. Mask). Actually, the solder mask layer is the hole in the solder mask. Even though the shape of these is the same, the solder mask is widened slightly when Gerber files are created.
  • Solder paste: When used, where the stencil holes should allow solder paste to be applied (F. Paste). As with solder mask, the shape is usually the same as the pads, but the datasheet may require a different pattern.
  • Holes (drilling): These are a side effect of thru-hole pads, and appear in the footprint layers as is there was copper there.
  • Silk Screen: Descriptive information on the board, in particular the component reference (F. SilkS)
  • Assembly: Mark the place where the device actually is (F. Fab). The placement’s origin should also be the footprint’s (x,y) origin for automatic placement. The reference designator (“%R”) is written inside the component’s boundaries, and its “value” (what it is) typically underneath.
  • Clearance: Ensure that nothing else is placed in the space that the device requires. This no-place space is called “courtyard” (F. CrtYd). The idea is simple: If, during layout (with PCBnew), the courtyards of two components overlap, it’s a DRC error at the very least, and possible also a refusal to generate Gerber files. So the courtyard should be the area that the component must occupy exclusively.

The layer assignment assumes that the component is placed on the front. When placing on the back side, the Layout Editor flips the layer assignment as necessary.

Holes is a bit tricky, because they are an implicit result of pads with holes. In particular, NPTH is Non-Plated Thru Holes, and appear in the footprints as pads of this type.

Footprints

There’s another post of mine, which discusses pads and footprint manipulation.

  • Rule number one: Copy or imitate an existing library footprint rather than reinventing the wheel. Preferably imitate one that is likely to be heavily used.
  • Read the Footprint Guidelines in KiCad Library Convention document (KLC). Skip the naming convention, and start at F4 (or so). This might be the best source on the planet for setting up a footprint. Odds are that these rules are tuned based upon real-life experience of a lot of people.
  • Never ever use a footprint directly from the libraries to a component in the design. That makes the outcome dependent on the KiCad version running.
  • Rather, create a project-dedicated footprint library (a directory with a .pretty suffix) in the Footprint editor, and put it in the Project’s library table. Then open the desired footprint in Footprint editor, and save it into the project’s library.
  • Note that the 3D model still points at a KiCad library even after this, but that’s less important.
  • Monitor the changes with git. It’s quite straightforward and allows a safe double-check.
  • Don’t hesitate editing the footprint file manually, in particular for nailing exact positions. Exit the Footprint Editor and re-enter it, and then load the footprint. Using “Revert” for updating changes won’t work.
  • To verify that a footprint is OK:
    • At the very least view each layer separately, and make sure it contains what it should. Drill holes are covered by absence of material in the copper layers.
    • Verify that SMT pads cover F.Cu, F.Mask and F.Paste (check each separately!). For thru pins, also on the back.
    • Read through the text file (with _mod suffix) and look for unexpected stuff (one pad having different dimensions for no reason etc.)
  • Pads may not have a pad number. This is useful for e.g. stencil pattern not matching the shape of the pad.
  • If multiple pads have the same pin number assigned to them, track wires must be connected between them during the layout.
  • Pads and other elements may very well not be on the footprint nor layout grid. I find myself calculating their X/Y position rather than placing them, so any position goes.
  • For adding anything other than a pad, the drawing or text falls on the layer marked with a small arrow in the list of layers.
  • Text substitutions: %R is replaced with the reference, %V with the value.
  • Even though the name of the footprint is shown as the “value” in the Footprint Editor, it will be the component’s “value” field in Pcbnew. So it’s useful information during assembly of the board.
  • Right-click on a pad, select “Array” to produce several identical pads with equal spacing. In extreme cases, write a simple script to manipulate the footprint file directly (exotic positions or pad naming).
  • Use the measurement tool for assurance. It snaps to the grid.
  • Any line thickness is implemented as a filled circle drawing along the line. This makes rounded corners.
  • Pads can have arbitrary shapes. See this post.

PCBnew (layout)

  • When starting a new project, remove the *.kicad_pcb file as well as .config/kicad directory, and start from scratch. KiCad’s defaults are usually good, so if a detail is overlooked, better have the default than some previous setting.
  • File > Board Setup… and set the board’s thickness, trace parameters etc. It’s also a good idea to define the net classes (in particular high frequency nets).
  • PCBnew depends on the current netlist as well as the footprint libraries. They are updated only on explicit request, so each time the schematics or footprints in use are updated, the changes must be synchronized by user request. As for the schematics, there an “Update PCB from Schematics” button. Or export the netlist from EESchema in default format and load it into PCBnew. It’s fine to do this iteratively (and even finer when covered by version control to cover up for mishaps).
  • Grid: Pads may be off the grid, and there’s no problem selecting them as a start and end point for a route. The track itself will generally go on the grid, except for leaving it in order to reach a pad in a sane manner (usually with a 45-degree detour, not on a grid point). The grid pitch should hence be the approximately most used track width + the track clearance if a dense design is desired. Possibly make it in line with crucial footprints’ pitch, to keep straight traces, aligned with the grid.
  • Ratsnest: Straight lines showing connections between pads. View all at once with View > Show Ratsnest (default on) or turn off, and view individual lines with “Display local ratsnets” icon on the toolbar to the right.
  • The origin’s position has no significance, expect that it places the grid. However the absolute position values appear in the Gerber files, but this can be fixed by adding an auxiliary axis (more on this below).
  • There are no 90 degree turns on tracks (except for power lines), only 45 degrees (and possibly smaller angles when aligning to the grid). Feels a bit awkward in the beginning but sharp turns are evil.
  • There is no dragging of components (footprints), only moving them, which results in disconnection of all tracks going to it.
  • While drawing tracks, press “v” for a via, press “/” to change the posture (the position of the 45 degree detour, if present).
  • Always start and end a track by clicking at the center of a pad (the “anchor”), in particular if it’s off-grid.
  • Ground / power planes: Create a fill zone, which is a polygon which is designated to a net. After setting up the zone, select it, right-click and pick Zones > Fill. This may not work unless there is already a track with the desired net going through the area. The track may remain, as it’s harmless. It’s probably the via, which connects the area to the net that does the trick. Once set up, a via is enough to connect to the plane.
  • Repeat the zone fill after each change that may affect it.
  • If the zone fill gets too close to a specific footprint’s pads, open the “Local Clearance and Settings” tab on the footprint’s Properties, and set a larger (actually, non-zero) clearance.
  • In the end, tracks are a sequence of segments with a start and end position, width, layer and net designation in the *.kicad_pcb file (with “segment” keyword), vias are elements with the “via” keyword etc. Everything is absolutely positioned. Therefore moving a component (actually, its footprint) disconnects everything. All in all, PCBnew is just a smart drawing program for geometric shapes.
  • Differential pairs: Route > Differential Pair. Then Route > Tune Differential Pair Skew / Phase, in order to check the skew and possibly tune it (or possibly move a component if that helps).
  • Setting the parameters for track length and skew tuning: After selecting the track to tune, instead of moving the mouse to get those humps, right click and select Length Tuning Settings (or CTRL-L). The amplitude and length of these humps can be set, as well as the target length (instead of the default 100 mm).
  • There’s a measurement tool (“Add dimension”) which is handy to verify certain distances, in particular the dimension of the board. The tool shown an indication when it’s on a line end, so one can be sure the accurate measurement is made. Best done on a separate drawing layer.
  • Optionally, add an auxiliary axis origin to the layout, placing the (0,0) point of Gerber and drill files. This changes the aux_axis_origin parameter of the PCB design from (0 0) to whatever it was set to, and the coordinates in the Gerber files will have an offset relative to the layout files. Useful if the entire layout was drawn with some meaningless origin position. Then check “Use auxiliary axis as origin” when generating Gerbers and drill files.

3D viewer

I didn’t understand the point of a 3D view at first, but it’s surprisingly helpful, in particular during PCB layout — even though it’s also handy with the Footprint Editor. There’s something reassuring seeing the silk screen as white text on a green solder mask, and the drill holes as see-throughs. Click the mouse’s middle button to move the scenery.

Launch with View > 3D viewer (or Alt-3).

Generate Gerber files

You have reviewed the DRCs in EESchema and PCBnew, right…?

So to the final check before taping out. Export the files in PCBnew with File > Plot… and as follows (for a two-layer design):

Screenshot of dialog box for producing Gerber Files in KiCad(click image to enlarge)

and then generate a drill file (click “Generate Drill File”, of course).

Screenshot of dialog box for producing drill files in KiCad(click image to enlarge)

All the settings above are KiCad’s defaults, by the way. Changes in these settings are stored in the *.kicad_pcb file for the next time these windows are opened (and not KiCad-globally).

Note that two drill files are (usually) generated: One for plated holes (vias and pads) and one for non-plated holes (typically mechanical holes for mounting).

A component positioning file can be generated with Fabrication Outputs > Footprint Position (.pos) file, however this file uses the internal footprints and their names. In particular, if the footprint’s origin isn’t where the automatic placement machine thinks it should be, the device may end up in the wrong place.

Viewing Gerber files

KiCad’s Gerber is fine, but it’s recommended to install gerbv as well (plain “apt install gerbv” on Mint 19). It’s part of gEDA, and it allows changing the order of the layers, but even more important, to export the current view to pdf. KiCad’s viewer prints one layer per page, and it’s hardly helpful. Use the export feature, don’t print — the latter generates a microscopic printout.

The main motivation for printing layers is to print the fabrication layer (preferably without values, KiCad has that option) on top of the edge cut and possibly a copper layer. For manual assembly, this gives a good indication of what goes where. To run it, just go

$ gerbv &

For KiCad’s viewer, load both the Gerber files and the drill files into Gerbview, and take a close look. The layers are displayed in the order they were loaded, so load the drill files first. As they are usually loaded with multiple selection of files, that’s probably alphabetical order, putting the bottom layer above the front.

Right-click on the layer list and select “Sort Layers if X2 mode” (should be “in X2 mode”, I guess) for layers ordered in physical order.

The active layer is always visible (drawn last).

Most important: Verify that the drill holes, plated and not, are in the correct places.

The DCodes are the aperture with which each line is drawn. An aperture is like a paintbrush, which is defined in each Gerber file for its own drawings. It’s usually a filled circle or rectangle with specific dimensions. The benefit of having these on the screen is to identify tracks with the same width and pads with the same dimensions, if they have the same DCode and are on the same layer. The same DCode on different layers (and hence Gerber files) has no meaning. For example, D10 and D11 are typically the DCodes of the first apertures defined in Gerber file, and are practically on every layer.

Create assembly sheet

If the design is going to be hand-assembled, better have a simple drawing that tells what goes where. To do this, plot the Gerber files as before, but uncheck “Plot footprint values” (but keep “Plot footprint references” checked, of course). Check F.Fab layer for plotting.

Recall that PCBnew remembers this in the *.kicad_pcb file, so be sure to revert it (with git) afterwards.

Next, open the files with gerbv (not KiCad’s Gerber viewer). Display the F_Fab layer, which should have the component references printed small inside boxes, along with the edge cuts and a copper layer. Change the colors for convenience.

Then File > Export > SVG… and generate a file. Import that file into a Libreoffice Writer document, and export that document as a pdf. Why not directly as pdf? Because the output pdf is broken, and even when I managed to open it, turning it into a pdf with a decent size turned out impossible.

A similar SVG export can be done from PCBNew: Pick File > Export > SVG… and select “Board area only” under “SVG Page Size”. For Pagination, pick “All layers in a single file”. However I haven’t figured out how to get rid of the values, and they clutter everything.

Manufacture

  • Holes: Manufacturers typically make the hole slightly larger than required, as commented on this thread. The difference is typically no more than 0.1mm, so it’s not a big deal.
  • PCB material: FR-4 is de-facto industry standard for rigid boards. 1 oz copper thickness is by far the most common. Pick thicker only for high currents. Board thickness: 0.6/0.8/1.0/1.2/1.6/2.0 mm. Tolerance is typically ±0.1 mm.
  • Drill size: 0.2 – 6.3mm (but don’t go lower than 0.4 mm = 16 mils for basic PCB)
  • Trace width: 6 mil seems to be a common minimal trace width, but Protel used to set 10 mil as the default (the latter seems to carry up to 1A, but check with a calculator). Unless there are current to carry, stick to these figures in general.
  • Minimal trace space: 3 mil seems to be possible, but 10 mils should be chosen unless there’s a reason to push it. Minimal trace to board edge clearance should be 0.3 mm, but standard edge connector footprints tend to take 0.5mm for their pads. So better go for the latter.
  • All in all, it seems like 10 mils for minimal trace width and space, and 0.4mm mils for vias are the sane choice. If the board gets tight, go smaller.
  • If the board has gold fingers, it’s ENIG (Electroless Nickel Immersion Gold) surface finish (sometimes referred to as just Immersion Gold).
  • The assembly is the expensive part. Manufacturing should be at around $5-10 for 5 copies for a small board.
What to submit:
  • Gerber files in RS-274x format for board outline, copper layers, silk screen(s), solder masks and solder paste. NC drill sometimes required in Gerber format instread of Excellon.
  • NC Drill file, in Excellon format
  • For assembly, this page summarizes it well. An assembly sheet as described above is a good idea.

Finding a manufacturer

My board involved small SMT components, so no chance to do the assembly myself. Actually, 10 simple components, on a plain FR4 53 x 22 mm board with two 1 oz layers and a gold finger (hence ENIG finish). The price for a turn-key project is at least $100. Don’t expect less.

And the minimal order quantity is 5. Everywhere.

The place to make a comparison is  PCB shopper. I also suggest look at Manufacturing Reports for some painfully detailed reviews on PCB quality. If you need assembly, select service by assembly costs, as it’s going to be the dominant part.

So these are the estimated prices I got for a small board.

  • Smart Prototyping, assembly at $80.
  • Elecrow, assembly at about $106.
  • PCBWay, assembly at $30 as a Turnkey project, $88 as kitted or partly kitted. Something tells me they’re going to charge a lot for the devices. They do send stuff with regular E-Packet = 16 days.
  • ShenZhen2U, assembly at $50 (pretty much fixed price), allowing plain mail shipping. Headquarters in Hong Kong, but apparently the factory is in Shenzhen (China).

I have no affiliation with any of these companies, and this reflects the situation as of September 2019.

These costs didn’t change much when I changed the parameters of the assembly task. It’s sensible to assume it’s mostly the setup fee.

So I sent the files to PCBWay and ShenZhen2U. My personal impressions:

  • PCBWay: Nice web interface, quick and professional response (an hour or so, during Chinese daytime). The price difference between the turnkey and the kitted was indeed a hint to their tendency to make up for the low assembly cost by adding fees everywhere: The components were 50-100% more expensive than Mouser (in low quantity prices). When I asked about it, I got some BS answer that it’s more expensive for them, taxes etc. PCB manufacturing at $39, which is pretty high. And they add $8 Paypal fee and $20 for shipping and handling with E-packet. And then there’s an “assembly promotion discount” of $30, ending up at $108 next to the “Confirm & Pay” button.
  • ShenZhen2U: Slightly slower response (1 day each time), the website is slightly less useful. The flow was that I ordered the assembly first, then a few emails back and forth (I filled in their BOM form, where I also defined the PCB, and then got an accurate quote) and then the order at the website was updated manually to reflect their quote. They stood behind the $50 assembly cost. Their BOM costs are only slightly higher than Mouser, and they charge $10 for E-packet shipping. PCB manufacturing at $26, which is pretty low. All in all $110, when paid though Paypal (including a Paypal fee of $4.86, which they were clear about and is reasonable).

So there we have it: Almost exactly the same bottom line. One company reached it after fiddling with the prices, and the second just listed them upfront. This is why I went for ShenZhen2U, even though its review on Manufacturing Reviews complained a lot about dirt and bad silkscreen. The parallel review for PCBWay complained about traces being too thin, which is much more serious IMHO.

To be continued as the events unfold…

What to track with git

For a project named projname, this is the list of files I found sufficient to track:

.gitignore
fp-lib-table
projname.dcm
projname.kicad_pcb
projname.lib
projname.pretty/*.kicad_mod
projname.pro
projname.sch
sym-lib-table

as for .gitignore itself, this is my somewhat bloated file, which eventually did the job.

*-save.pro
*-cache.lib
*-rescue.lib
*-save.kicad_pcb
_autosave-*.*
*.kicad_pcb-bak
*.000
*.net
*.net~
*.bak
*.sch-bak
*.bck
*.gbr
*.drl
*.pos
*.wrl
*.csv
*.rpt
*.step
*.stp
*.svg
*.ps
*.pdf
*.dxf
*.plt
*.cad
fp-info-cache
*~

This ignores files produced by KiCad as well as backups. Plus a lot of garlic against files that were never seen.