APC Smart UPS 750 battery replacement notes

Introduction

This post continues my notes on Smart UPS 750, three years later, when it was time to replace the batteries (because they barely held for 13 minutes). It should have been simple, but if I wrote this lengthy post about it, there was clearly something going on.

Note that UPSes and their batteries is not my field. These are just my notes as I found my way through.

So for short, the main takeaways are these:

  • Update the time of last battery replacement with the UPS’ front panel interface (somewhere under Configuration). This makes the UPS realize there are new batteries inside, which changes the way it calculates the estimated runtime.
  • Two standard 12V / 7AH lead acid batteries can be used instead of APC’s original battery pack. But check that the terminals are 6mm wide.
  • The displayed battery runtime is not reliable.

And now, the deep dive.

Replacing the batteries

The process for battery replacement with non-APC batteries is shown in this video, but it’s really not complicated. Yank off the front panel, then the pull down the metal panel behind the former, and pull out the batteries gently. Use the harness that connects the two existing batteries on the new ones, push them in and you’re done. Plus some packing tape to keep the two batteries together.

However the original batteries’ contact terminals are about 6mm wide, contrary to the ones on the battery I bought, which were considerably smaller. So even though there was no problem connecting the batteries, it wasn’t all that reassuring that the contacts were smaller. It seems like there are two standards for the terminal with, 6 mm being one of them.

This is a picture taken from above, showing the original pair of batteries I pulled out from the UPS (click to enlarge):

Original battery pack for APC Smart UPS 750

The blue thing in the middle contains a fuse, and the black connector at the top mates with the UPS.

But when I powered up the UPS, the expected runtime shown on the display was just 13 minutes, even though the charge level appeared as 100%. I was surprised to see a 100% charge level on batteries that were just installed, and even more disappointed with the expected runtime. Could it be that bad? Both APC’s runtime chart and my own simple energy calculation (see below) pointed at one hour at least with the load I had. And it didn’t improve after letting the UPS work for a few hours.

My first though was that I had been sold exceptionally junky batteries. But I bought them at a reputable electronics shop, and they carried a timestamp indicating they were fresh.

And then it occurred to me that I should tell the UPS that I had replaced batteries. So I went to the part in the UPS’ configuration menu for setting the month and year of the last battery change, and did that. And to my surprise, the runtime was adjusted to 1hr 12 minutes right away. There a few posts out there (this, for example) on how to “reset the battery constant” manually. It seems like this relates to the same thing.

Cute, I thought. But is that figure correct? So I let the UPS run on battery for a while. The estimated runtime went down in pace with the wall clock, but then suddenly, after 23 minutes, it took the power down.

So I reconnected the UPS back to power, and let the battery charge until it reached 100% again. At which point it reported:

$ apcaccess
APC      : 001,027,0652
DATE     : 2021-08-29 20:21:36 +0300
HOSTNAME : ruhe
VERSION  : 3.14.14 (31 May 2016) debian
UPSNAME  : ruhe
CABLE    : USB Cable
DRIVER   : USB UPS Driver
UPSMODE  : Stand Alone
STARTTIME: 2021-08-29 18:30:15 +0300
MODEL    : Smart-UPS 750
STATUS   : ONLINE
BCHARGE  : 100.0 Percent
TIMELEFT : 23.0 Minutes
MBATTCHG : 5 Percent
MINTIMEL : 3 Minutes
MAXTIME  : 300 Seconds
ALARMDEL : 30 Seconds
BATTV    : 26.6 Volts
NUMXFERS : 0
TONBATT  : 0 Seconds
CUMONBATT: 0 Seconds
XOFFBATT : N/A
STATFLAG : 0x05000008
MANDATE  : 2018-05-22
SERIALNO : AS1821351109
NOMBATTV : 24.0 Volts
FIRMWARE : UPS 09.3 / ID=18
END APC  : 2021-08-29 20:22:01 +0300

Smart UPS or what? If the battery died after 23 minutes last time, how much has it left when fully charged? Let me think… 23 minutes!

And yet, that sounds way too short for a new battery. More than 24 hours later, the same runtime estimation remained, going up and down a minute or so occasionally. So that’s that.

It could be correct, however. The way to find out is to try again after a month or so. For that, there’s battery calibration. Which for my UPS means “let the battery drain and measure its way down until it’s empty”. Haven’t tried that yet, but it seems more or less like unplugging power from the UPS. Note that the load will lose power at the end of this process. So the computer needs to be taken down safely, and then held in a state where a power failure won’t hurt (e.g. stuck in some boot menu). This way, it remains as an electrical load, but nothing bad happens when the power goes down.

Battery calibration is launched from the front panel menu as well.

Why calibration makes sense requires some deep diving into lead acid battery theory. Which is where this post goes next. Once again, lead acid batteries is really not my expertise. For a concise technical background, I recommend reading Power Sonic’s Technical Manual.

A 7AH battery doesn’t really give 7AH

The amount of charge (and energy) that a lead-acid battery supplies until it’s discharged depends dramatically on the discharging current. The capacity printed on the battery is given for a 20-hours discharge, or using the jargon, 0.05C. That “C” is 7 taken from the 7AH figure, so 7AH are obtained if the discharge current is 0.35A. For larger currents, expect much less energy out of the battery.

For example, my specific case: The load is 70W (at 142VA) according to the UPS itself. I’ll assume that the low power factor thing can be ignored, i.e. that the fact that the VA figure is twice the consumed power makes no difference. This low power factor is natural to switching power supplies, as they draw more current when the voltage is low, so their behavior is far from a plain resistor (unless specifically compensated to mitigate this effect). I’ll also assume that the UPS is 100% efficient on its voltage conversion, which is complete rubbish, but for the heck of it.

So for two 12V batteries in series it goes 70W/24V =~ 2.9A, which is about 0.4C (2.9 / 7 =~ 0.4). A ballpark figure can be taken from Figure 4 in Power Sonic’s Technical Manual, showing that the voltage starts to drop after about an hour, and reaches the critical value somewhere after an hour and a half. Note that I have different batteries.

Also from Table 2 of the same Manual, we have that the actual capacity of a 7AH battery, when drained with a 4.34A current, is 4.34AH (one hour). The current is higher than 2.9A, but given that the UPS isn’t really 100% efficient, it’s likely that the real discharge current is closer to 4A than to 2.9A. So that explains why the UPS said 1:12 hours when I updated the battery replacement date.

Now, it could be that different batteries behave differently on higher currents. I really don’t know. I couldn’t find data on my “Bulls Power” batteries. So maybe they could meet the 7AH specification for a 20-hours discharge, and then perform really poorly with higher, real-life currents. I have no idea.

Not directly related, but anyhow: The power consumption goes above double (165W, 217VA) when compiling a Linux kernel with 12 processes. The power factor improved considerably, in line with Corsair’s promise to attain power level of unity at full capacity (which is 850W, a long way to go).

Knowing the battery’s charge level

How does one estimate how much energy a lead-acid battery has? The answer is unpleasant, yet simple: There is really no way to measure it from the battery electronically. After reading quite some material on the subject, that became evident to me: There are plenty of papers describing exotic algorithms for estimating a battery’s health and charge level, and their abundance and variety proves that there’s really no way to tell, except for draining it.

Actually, there is one way that is considered reliable, which is measuring the open circuit voltage (OCV) after the battery has been disconnected for a while (some say a few hours, battery manufacturers typically require 24 hours). Letting the battery rest allows it to reach a chemical equilibrium, at which point the voltage reflects its charge level. This is surely true for a fresh battery.

As for batteries with some history, the picture is less clear, and I haven’t managed to figure out if the OCV voltages remain the same, and if the voltage vs. charge percentage relate to the original charge capacity, or the one that is available after the battery is worn out.

For example, Power Sonic claims that the OVC goes from 1.94V/cell to 2.16V/cell for 0% to 100% charge respectively. As a 12-volt battery has 6 cells, this corresponds to 11.64V to 12.96V. These figures are quite similar to those presented by another manufacturer.

But what does 100% charge mean? 7AH or as much as is left when the battery has worked for some time? My anecdotal measurement of the batteries I took out from the UPS was 12.99V after letting them rest. In other words, they presented a OCV voltage corresponding to 100% charge, even though they had much less than 7AH.

So how does a UPS estimate the remaining runtime? Well, the simple way is to let the battery run out once, and there you have a number. Clearly, Smart UPS uses this method.

Are there any alternatives? In theory, the UPS could let the battery rest for 24 hours, and measure its OCV. This is possible, because most of the time the UPS doesn’t need the battery. But even my anecdotal measurement shows that a 100% charge-like reading doesn’t mean much.

For other types of batteries (Li-ion in particular), measuring the current on the battery, in and out (Coulomb Counting), gives an idea on how much charge it contains. This doesn’t work with lead acid batteries, because the recommended way to maintain a standby battery, is to continuously float charge it. That means holding a constant voltage (say, 2.25V per cell, that is 13.5V for a 12V battery, or 27V on a battery pair, as in SmartUPS 750).

As this voltage is higher than the OCV at rest, this causes a small trickle current (said to be about 0.001C), which compensates for the battery’s self discharge. Even if it overcharges the battery slightly, the gases that are released are recycled internally in a sealed battery, so there’s no damage.

Hence the recommended strategy for charging a lead-acid battery is to charge it quickly as long as its voltage / current pair indicates that it’s far from being fully charged, and then apply a constant, known and safe voltage. This allows it to charge completely slowly, and then maintain the charge without any risk for overcharging. Odds are that this is what the UPS does.

But makes Coulomb Counting impossible: During the float charge phase (that is, virtually all the time) the current may and may not actually charge the battery.

Why recalibrate

It’s not clear what my Smart 750 UPS did with the batteries when recharging after they were completely empty. Even if did Coulomb Counting, it has no way to tell how efficient the battery will perform during discharge, while draining a current that is much higher than 0.05C.

I’m not even 100% sure that it did charge the battery fully in any of the cases. Even though lead-acid batteries have a pretty known charging voltage graph, which indicates when the battery is about to become full, the UPS might have played safe, and went for a float charge early. Maybe it doesn’t fast-charge a battery more than it has knowingly discharged it. If that’s the case, the remaining charge is acquired slowly by float charging.

As the UPS has no way to know whether the charging current is just a float charge or if the battery actually gains energy, it won’t update the estimated runtime.

So it’s possible that the UPS actually filled the battery properly to begin with, or that it did that during float charge phase. Or the batteries installed may be pure junk. This way or another, a battery calibration (or just let it run on battery until it dies out) a while later is the definite answer, as it covers the float charge possibility.

Using firejail to throttle network bandwidth for wget and such

Introduction

Occasionally, I download / upload huge files, and it kills my internet connection for plain browsing. I don’t want to halt the download or suspend it, but merely calm it down a bit, temporarily, for doing other stuff. And then let it hog as much as it want again.

There are many ways to do this, and I went for firejail. I suggest reading this post of mine as well on this tool.

Firejail gives you a shell prompt, which runs inside a mini-container, like those cheap virtual hosting services. Then run wget or youtube-dl as you wish from that shell.

It has practically access to everything on the computer, but the network interface is controlled. Since firejail is based on cgroups, all processes and subprocesses are collectively subject to the network bandwidth limit.

Using firejail requires setting up a bridge network interface. This is a bit of container hocus-pocus, and is necessary to get control over the network data flow. But it’s simple, and it can be done once (until the next reboot, unless the bridge is configured permanently, something I don’t bother).

Setting up a bridge interface

Remember: Do this once, and just don’t remove the interface when done with it.

You might need to

# apt install bridge-utils

So first, set up a new bridge device (as root):

# brctl addbr hog0

and give it an IP address that doesn’t collide with anything else on the system. Otherwise, it really doesn’t matter which:

# ifconfig hog0 10.22.1.1/24

What’s going to happen is that there will be a network interface named eth0 inside the container, which will behave as if it was connected to a real Ethernet card named hog0 on the computer. Hence the container has access to everything that is covered by the routing table (by means of IP forwarding), and is also subject to the firewall rules. With my specific firewall setting, it prevents some access, but ppp0 isn’t blocked, so who cares.

To remove the bridge (no real reason to do it):

# brctl delbr hog0

Running the container

Launch a shell with firejail (I called it “nethog” in this example):

$ firejail --net=hog0 --noprofile --name=nethog

This starts a new shell, for which the bandwidth limit is applied. Run wget or whatever from here.

Note that despite the –noprofile flag, there are still some directories that are read-only and some are temporary as well. It’s done in a sensible way, though so odds are that it won’t cause any issues. Running “df” inside the container gives an idea on what is mounted how, and it’s scarier than the actual situation.

But be sure to check that the files that are downloaded are visible outside the container.

From another shell prompt, outside the container go something like (doesn’t require root):

$ firejail --bandwidth=nethog set hog0 800 75
Removing bandwith limit
Configuring interface eth0
Download speed  6400kbps
Upload speed  600kbps
cleaning limits
configuring tc ingress
configuring tc egress

To drop the bandwidth limit:

$ firejail --bandwidth=nethog clear hog0

And get the status (saying, among others, how many packets have been dropped):

$ firejail --bandwidth=nethog status

Notes:

  • The “eth0″ mentioned in firejail’s output blob relates to the interface name inside the container. So the “real” eth0 remains untouched.
  • Actual download speed is slightly slower.
  • The existing group can be joined by new processes with firejail –join, as well as from firetools.
  • Several containers may use the same bridge (hog0 in the example above), in which case each has its own independent bandwidth setting. Note that the commands configuring the bandwidth limits mention both the container’s name and the bridge.

Working with browsers

When starting a browser from within a container, pay attention to whether it really started a new process. Using firetools can help.

If Google Chrome says “Created new window in existing browser session”, it didn’t start a new process inside the container, in which case the window isn’t subject to bandwidth limitation.

So close all windows of Chrome before kicking off a new one. Alternatively, this can we worked around by starting the container with.

$ firejail --net=hog0 --noprofile --private --name=nethog

The –private flags creates, among others, a new volatile home directory, so Chrome doesn’t detect that it’s already running. Because I use some other disk mounts for the large partitions on my computer, it’s still possible to download stuff to them from within the container.

But extra care is required with this, and regardless, the new browser doesn’t remember passwords and such from the private container.

Octave: Creating images from plots for web page

This should have been a trivial task, but it turned out quite difficult. So these are my notes for the next time. Octave 4.2.2 under Linux Mint 19, using qt5ct plugin with GNU plot (or else I get blank plots).

So this is the small function I wrote for creating a plot and a thumbnail:

function []=toimg(fname, alt)

grid on;

saveas(gcf, sprintf('%s.png', fname), 'png');
print(gcf, sprintf('%s_thumb.png', fname), '-dpng', '-color', '-S280,210');

disp(sprintf('<a href="/media/%s.png" target="_blank"><img alt="%s" src="/media/%s_thumb.png" style="width: 280px; height: 210px;"></a>', fname, alt, fname));

The @alt argument becomes the image’s alternative text when shown on the web page.

The call to saveas() creates a 1200x900 image, and the print() call creates a 280x210 one (as specified directly). I take it that print() will create a 1200x900 without any specific argument for the size, but I left both methods, since this is how I ended up after struggling, and it’s better to have both possibilities shown.

To add some extra annoyment, toimg() always plots the current figure, which is typically the last figure plotted. Which is not necessarily the figure that has focus. As a matter of fact, even if the current figure is closed by clicking the upper-right X, it remains the current figure. Calling toimg() will make it reappear and get plotted. Which is really weird behavior.

The apparently only way around this is to use figure() to select the desired current figure before calling ioimg(), e.g.

>> figure(4);

The good news is that the figure numbers match those appearing on the windows’ titles. This also explains why the numbering doesn’t reset when closing all figure windows manually. To really clear all figures, go

>> close all hidden

Other oddities

  • ginput() simply doesn’t work. The workaround is to double-click any point (with left button) and the coordinates of this point are copied into the clipboard. Paste it anywhere. Odd, but not all that bad.
  • Zooming in with right-click and then left-click doesn’t affect axis(). As a result, saving the plot as an image is not affected by this zoom feature. Wonky workaround: Use the double-click trick above to obtain the coordinates of relevant corners, and use axis() to set them properly. Bonus: One gets the chance to adjust the figures for a sleek plot. If anyone knows how to save a plot as it’s shown by zooming, please comment below.

 

Looping on file wildcards in Octave

So I have written a function, showfile() for Octave 4.2.2 on Linux, which accepts a file name as its argument. And now I want to run it on all files in the current directory that match a certain pattern. How?

So first, obtain the list of files, and put it in a variable:

>> x=ls('myfiles*.dat');

This creates a matrix of chars, with each row containing the name of one file. The number of columns of this matrix is the length of longest file name, with the other rows padded with spaces (yes, ASCII 0x20).

So to call the function on all files:

>> for i=1:rows(x) ; showfile(strtrim(x(i,:))); end

The call to strtrim() removes the trailing spaces (those that were padded), so that the argument is the actual file name. If the real file name contains leading or trailing spaces, this won’t work (but who does that?). Spaces in the middle of the file name are OK, as strtrim() doesn’t touch them.

When dovecot silently stops to deliver mails

After a few days being happy with not getting spam, I started to suspect that something is completely wrong with receiving mail. As I’m using fetchmail to get mail from my own server running dovecot v2.2.13, I’m used to getting notifications when fetchmail is unhappy. But there was no such.

Checking up the server’s logs, there were tons of these messages:

dovecot: master: Warning: service(pop3-login): process_limit (100) reached, client connections are being dropped

Restarting dovecot got it back running properly again, and I got a flood of the mails that were pending on the server. This was exceptionally nasty, because mails stopped arriving silently.

So what was the problem? The clue is in these log messages, which occurred about a minute after the system’s boot (it’s a VPS virtual machine):

Jul 13 11:21:46 dovecot: master: Error: service(anvil): Initial status notification not received in 30 seconds, killing the process
Jul 13 11:21:46 dovecot: master: Error: service(log): Initial status notification not received in 30 seconds, killing the process
Jul 13 11:21:46 dovecot: master: Error: service(ssl-params): Initial status notification not received in 30 seconds, killing the process
Jul 13 11:21:46 dovecot: master: Error: service(log): child 1210 killed with signal 9

These three services are helper processes for dovecot, as can be seen in the output of systemctl status:

            ├─dovecot.service
             │ ├─11690 /usr/sbin/dovecot -F
             │ ├─11693 dovecot/anvil
             │ ├─11694 dovecot/log
             │ ├─26494 dovecot/config
             │ ├─26495 dovecot/auth
             │ └─26530 dovecot/auth -w

What seems to have happened is that these processes failed to launch properly within the 30 second timeout limit, and were therefore killed by dovecot. And then attempts to make pop3 connections seem to have got stuck, with the forked processes that are made for each connection remaining. Eventually, they reached the maximum of 100.

The reason this happened only now is probably that the hosting server had some technical failure and was brought down for maintenance. When it went up again, all VMs were booted at the same time, so they were all very slow in the beginning. Hence it took exceptionally long to kick off those helper processes. The 30 seconds timeout kicked in.

The solution? Restart dovecot once in 24 hours with a plain cronjob. Ugly, but works. In the worst case, mail will be delayed for 24 hours. This is a very rare event to begin with.

Critical Warnings after upgrading a PCIe block for Ultrascale+ on Vivado 2020.1

Introduction

Checking Xillybus’ bundle for Kintex Ultrascale+ on Vivado 2020.1, I got several critical warnings related to the PCIe block. As the bundle is intended to show how Xillybus’ IP core is used for simplifying communication with the host, these warnings aren’t directly related, and yet they’re unacceptable.

This bundle is designed to work with Vivado 2017.3 and later: It sets up the project by virtue of a Tcl script, which among others calls the upgrade_ip function for updating all IPs. Unfortunately, a bug in Vivado 2020.1 (and possibly other versions) causes the upgraded PCIe block to end up misconfigured.

This bug applies to Zynq Ultrascale+ as well, but curiously enough not with Virtex Ultrascale+. At least with my setting there was no problem.

The problem

Having upgraded an UltraScale+ Integrated Block (PCIE4) for PCI Express IP block from Vivado 2017.3 (or 2018.3) to Vivado 2020.1, I got several Critical Warnings. Three during synthesis:

[Vivado 12-4739] create_clock:No valid object(s) found for '-objects [get_pins -filter REF_PIN_NAME=~TXOUTCLK -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]'. ["project/pcie_ip_block/source/ip_pcie4_uscale_plus_x0y0.xdc":127]
[Vivado 12-4739] get_clocks:No valid object(s) found for '--of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] get_clocks:No valid object(s) found for '--of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]

and another seven during implementation:

[Vivado 12-4739] create_clock:No valid object(s) found for '-objects [get_pins -filter REF_PIN_NAME=~TXOUTCLK -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]'. ["project/pcie_ip_block/source/ip_pcie4_uscale_plus_x0y0.xdc":127]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group '. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group '. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]
[Vivado 12-5201] set_clock_groups: cannot set the clock group when only one non-empty group remains. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-5201] set_clock_groups: cannot set the clock group when only one non-empty group remains. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]

The first warning in each group points at this line in ip_pcie4_uscale_plus_x0y0.xdc, which was automatically generated by the tools:

create_clock -period 4.0 [get_pins -filter {REF_PIN_NAME=~TXOUTCLK} -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]

And the other at these two lines in pcie_ip_block_late.xdc, also generated by the tools:

set_clock_groups -asynchronous -group [get_clocks -of_objects [get_ports sys_clk]] -group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]
set_clock_groups -asynchronous -group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]] -group [get_clocks -of_objects [get_ports sys_clk]]

So this is clearly about a reference to a non-existent logic cell supposedly named gen_channel_container[1200], and in particular that index, 1200, looks suspicious.

I would have been relatively fine with ignoring these warnings had it been just the set_clock_groups that failed, as these create false paths. If the design implements properly without these, it’s fine. But failing a create_clock command is serious, as this can leave paths unconstrained. I’m not sure if this is indeed the case, and it doesn’t matter all that much. One shouldn’t get used to ignoring critical warnings.

Looking at the .xci file for this PCIe block, it’s apparent that several changes were made to it while upgrading to 2020.1. Among those changes, these three lines were added:

<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT">GTHE4_CHANNEL_X49Y99</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_CONTAINER">1200</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_QUAD_INX">3</spirit:configurableElementValue>

Also, somewhere else in the XCI file, this line was added:

<spirit:configurableElementValue spirit:referenceId="PARAM_VALUE.MASTER_GT">GTHE4_CHANNEL_X49Y99</spirit:configurableElementValue>

So there’s a bug in the upgrading mechanism, which sets some internal parameter to select the a nonexistent GT site.

The manual fix (GUI)

To rectify the wrong settings manually, enter the settings of the PCIe block, and click the checkbox for “Enable GT Quad Selection” twice: Once for unchecking, and once for checking it. Make sure that the selected GT hasn’t changed.

Then it might be required to return some unrelated settings to their desired values. In particular, the PCI Device ID and similar attributes change to Xilinx’ default as a result of this. It’s therefore recommended to make a copy of the XCI file before making this change, and then use a diff tool to compare the before and after files, looking for irrelevant changes. Given that this revert to default has been going on for so many years, it seems like Xilinx considers this a feature.

But this didn’t solve my problem, as the bundle needs to set itself correctly out of the box.

Modifying the XCI file? (Not)

The immediate thing to check was whether this problem applies to PCIe blocks that are created in Vivado 2020.1 from scratch inside a project which is set to target KCU116 (which is what the said Xillybus bundle targets). As expected, it doesn’t — this occurs just on upgraded IP blocks: With the project that was set up from scratch, the related lines in the XCI file read:

<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT">GTYE4_CHANNEL_X0Y7</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_CONTAINER">1</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_QUAD_INX">3</spirit:configurableElementValue>

and

<spirit:configurableElementValue spirit:referenceId="PARAM_VALUE.MASTER_GT">GTYE4_CHANNEL_X0Y7</spirit:configurableElementValue>

respectively. These are values that make sense.

With this information at hand, my first attempt to solve this was to add the four new lines to the old XCI file. This allowed using the XCI file with Vivado 2020.1 properly, however synthesizing the PCIe block on older Vivado versions failed: As it turns out, all MODELPARAM_VALUE attributes become instantiation parameters for pcie_uplus_pcie4_uscale_core_top inside the PCIe block. However looking at the source file (on 2020.1), these parameters are indeed defined (only in those generated in 2020.1), and yet they are unused, like many other instantiation parameters in this module. So apparently, Vivado’s machinery generates an instantiation parameter for each of these, even if they’re not used. Those unused parameters are most likely intended for scripting.

So this trick made Vivado instantiate the pcie_uplus_pcie4_uscale_core_top with instantiation parameters that it doesn’t have, and hence its synthesis failed. Dead end.

I didn’t examine the possibility to deselect “Enable GT Quad Selection” in the original block, because Vivado 2017.3 chooses the wrong GT for the board without this option.

Workaround with Tcl

Eventually, I solved the problem by adding a few lines to the Tcl script.

Assuming that $ip_name has been set to the name of the PCIe block IP, this Tcl snippet rectifies the bug:

if {![string equal "" [get_property -quiet CONFIG.MASTER_GT [get_ips $ip_name]]]} {
  set_property -dict [list CONFIG.en_gt_selection {true} CONFIG.MASTER_GT {GTYE4_CHANNEL_X0Y7}] [get_ips $ip_name]
}

This snippet should of course be inserted after updating the IP core (with e.g. upgrade_ip [get_ips]). The code first checks if the MASTER_GT is defined, and only if so, it sets it to the desired value. This ensures that nothing happens with the older Vivado versions. Note the “quiet” flag of get_properly, which prevents it from generating an error if the property isn’t defined. Rather, it returns an empty string if that’s the case, which is what the result is compared against.

Setting MASTER_GT this way also rectifies GT_CONTAINER correctly, and surprisingly enough, this doesn’t change anything it shouldn’t, and in particular, the Device IDs remain intact.

However the disadvantage with this solution is that the GT to select is hardcoded in the Tcl code. But that’s fine in my case, for which a specific board (KCU116) is targeted by the bundle.

Another way to go, which is less recommended, is to emulate the check and uncheck of “Enable GT Quad Selection”:

if {![string equal "" [get_property -quiet CONFIG.MASTER_GT [get_ips $ip_name]]]} {
  set_property CONFIG.en_gt_selection {false} [get_ips $ip_name]
  set_property CONFIG.en_gt_selection {true} [get_ips $ip_name]
}

However turning the en_gt_selection flag off and on again also resets the Device ID to default as with manual toggling of the checkbox. And even though it set the MASTER_GT correctly in my specific case, I’m not sure whether this can be relied upon.

Microsoft Windows: Atomic ops and memory barriers

Introduction

This post examines what the Microsoft’s compiler does in response to a variety of special functions that implement atomic operations and memory barriers. If you program like a civilized human being, that is with spinlocks and mutexes, this is a lot of things you should never need to care about.

I’ve written a similar post regarding Linux, and it’s recommended to read it before this one (even though it repeats some of the stuff here).

To make a long story short:

  • The Interlocked-something functions do not just guarantee atomicity, but also function as a memory barrier to the compiler
  • Memory fences are unnecessary for the sake of synchronizing between processors
  • The conclusion is hence that those Interlocked functions also work as multiprocessor memory barriers.

Trying it out

The following code:

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedExchangeAdd(&atomic_thingy, 0x1234);
  x += *p;

  return x + y;
}

When compiling this as “free” (i.e. optimized), this yields:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 10              mov         edx,dword ptr [eax]
  0000000A: 56                 push        esi
  0000000B: B9 34 12 00 00     mov         ecx,1234h
  00000010: BE 00 00 00 00     mov         esi,offset _atomic_thingy
  00000015: F0 0F C1 0E        lock xadd   dword ptr [esi],ecx
  00000019: 8B 00              mov         eax,dword ptr [eax]
  0000001B: 03 C1              add         eax,ecx
  0000001D: 03 C2              add         eax,edx
  0000001F: 5E                 pop         esi
  00000020: 5D                 pop         ebp
  00000021: C2 04 00           ret         4

First thing to note is that InterlockedExchangeAdd() has been translated into a “LOCK XADD”, which indeed fetches the previous value into ECX and stores the updated value into memory. The previous value is stored in ECX after this operation, which is @y in the C code — InterlockedExchangeAdd() returns the previous value.

XADD by itself isn’t atomic, which is why the LOCK prefix is added. More about LOCK below.

What is crucially important to note, is that putting InterlockedExchangeAdd() between the two reads of *p prevents the optimizations of these two reads into one. @p isn’t defined as volatile, and yet it’s read from twice (ptr [eax]).

Another variant, now trying InterlockedOr():

LONG atomic;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedOr(&atomic, 0x1234);
  x += *p;

  return x + y;
}

Once again, compiled as “free”, turns into this:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 4D 08           mov         ecx,dword ptr [ebp+8]
  00000008: 8B 11              mov         edx,dword ptr [ecx]
  0000000A: 53                 push        ebx
  0000000B: 56                 push        esi
  0000000C: 57                 push        edi
  0000000D: BE 34 12 00 00     mov         esi,1234h
  00000012: BF 00 00 00 00     mov         edi,offset _atomic
  00000017: 8B 07              mov         eax,dword ptr [edi]
  00000019: 8B D8              mov         ebx,eax
  0000001B: 0B DE              or          ebx,esi
  0000001D: F0 0F B1 1F        lock cmpxchg dword ptr [edi],ebx
  00000021: 75 F6              jne         00000019
  00000023: 8B F0              mov         esi,eax
  00000025: 8B 01              mov         eax,dword ptr [ecx]
  00000027: 5F                 pop         edi
  00000028: 03 C6              add         eax,esi
  0000002A: 5E                 pop         esi
  0000002B: 03 C2              add         eax,edx
  0000002D: 5B                 pop         ebx
  0000002E: 5D                 pop         ebp
  0000002F: C2 04 00           ret         4

This is quite amazing: The OR isn’t implemented as an atomic operation, but rather it goes like this: The previous value of @atomic is fetched into EAX and then moved to EBX. It’s ORed with the constant 0x1234, and then the cmpxchg instruction writes the result (in EBX) into @atomic only if its previous value was the same as EAX. If not, the previous value is stored in EAX instead. In the latter case, the JNE loops back to try again.

I should mention that cmpxchg compares with EAX and stores the previous value to it if the comparison fails, even though this register isn’t mentioned explicitly in the instruction. It’s just a thing that cmpxchg works with EAX. EBX is the register to compare with, and it therefore appears in the instruction. Confusing, yes.

Also here, *p is read twice.

It’s also worth noting that using InterlockedOr() with the value 0 as a common way to grab the current value yields basically the same code (only the constant is generated with a “xor esi,esi” instead).

So if you want to use an Interlocked function just to read from a variable, InterlockedExchangeAdd() with zero is probably better, because it doesn’t create all that loop code around it.

Another function I’d like to look at is InterlockedExchange(), as it’s used a lot. Spoiler: No surprises are expected.

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedExchange(&atomic_thingy, 0);
  x += *p;

  return x + y;
}

and this is what it compiles into:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 10              mov         edx,dword ptr [eax]
  0000000A: 56                 push        esi
  0000000B: 33 C9              xor         ecx,ecx
  0000000D: BE 00 00 00 00     mov         esi,offset _atomic_thingy
  00000012: 87 0E              xchg        ecx,dword ptr [esi]
  00000014: 8B 00              mov         eax,dword ptr [eax]
  00000016: 03 C1              add         eax,ecx
  00000018: 03 C2              add         eax,edx
  0000001A: 5E                 pop         esi
  0000001B: 5D                 pop         ebp
  0000001C: C2 04 00           ret         4

And finally, what about writing twice to the same place?

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG y;
  *p = 0;
  y = InterlockedExchangeAdd(&atomic_thingy, 0);
  *p = 0;

  return y;
}

Writing the same constant value twice to a non-volatile variable. This calls for an optimization. But the Interlocked function prevents it, as expected:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 4D 08           mov         ecx,dword ptr [ebp+8]
  00000008: 83 21 00           and         dword ptr [ecx],0
  0000000B: 33 C0              xor         eax,eax
  0000000D: BA 00 00 00 00     mov         edx,offset _atomic_thingy
  00000012: F0 0F C1 02        lock xadd   dword ptr [edx],eax
  00000016: 83 21 00           and         dword ptr [ecx],0
  00000019: 5D                 pop         ebp
  0000001A: C2 04 00           ret         4

Writing a zero is implemented by ANDing with zero, so it’s a bit confusing. But it’s done twice, before and after the “lock xadd”. Needless to say, these two writes are fused into one by the compiler without the Interlocked statement in the middle (I’ve verified it with disassembly, 32 and 64 bit).

Volatile?

In Microsoft’s definition for the InterlockedExchangeAdd() function (and all other similar ones) is that the first operand is a pointer to a LONG volatile. Why volatile? Does the variable really need to be?

The answer is no, if all accesses to the variable is made with Interlocked-something functions. There will be no compiler optimizations, not on the call itself, and the call itself is a compiler memory barrier.

And it’s a good habit to stick to these functions, because of this implicit compiler memory barrier: That’s usually what we want and need, even if we’re not fully aware of it. Accessing a shared variable almost always has a “before” and “after” thinking around it. The volatile keyword doesn’t protect against reordering optimizations by the compiler (or otherwise).

But if the variable is accessed without these functions in some places, the volatile keyword should definitely be used when defining that variable.

More about LOCK

It’s a prefix that is added to an instruction in order to ensure that it’s performed atomically. In many cases, it’s superfluous and sometimes ignored, as the atomic operation is ensured anyhow.

From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 3-537, on the “LOCK” prefix for instructions:

Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

The manual elaborates further on the LOCK prefix, but says nothing about memory barriers / fences. This is implemented with the MFENCE, SFENCE and LFENCE instructions.

The LOCK prefix is encoded with an 0xf0 coming before the instruction in the binary code, by the way.

Linux counterparts

For x86 only, of course:

  • atomic_set() turns into a plain “mov”
  • atomic_add() turns into “lock add”
  • atomic_sub() turns into “lock sub”
I’m not sure that there are any Windows functions for exactly these.

Is a memory barrier (fences) required?

Spoiler: Not in x86 kernel code, including 32 and 64 bits. Surprise. Unless you really yearn for a headache, this is the right place to stop reading this post.

The theoretical problem is that each processor core might reorder the storing or fetching of data between registers, cache and memory in any possible way to increase performance. So if one one processor writes to X and then Y, and it’s crucial that the other processor sees the change in Y only when it also sees X updated, a memory barrier is often used. In the Linux kernel, smp_wmb() and smp_rbm() are used in conjunction to ensure this.

For example, if X is some data buffer, and Y is a flag saying that the data is valid. One processor fills the buffer X and then sets Y to “valid”. The other processor reads Y first, and if it’s valid, it accesses the buffer X. But what if the other processor sees Y as valid before it sees the data in X correctly? A store memory barrier before writing to Y and a read memory barrier before reading from X ensures this.

The thing is, that the Linux kernel’s implementation of smp_wmb() and smp_rbm() for x86 is a NOP. Note that it’s only the smp_*() versions that are NOPs. The non-smp fences turn into opcodes that implement fences. Assuming that the Linux guys know what they’re doing (which is a pretty safe assumption in this respect) they’re telling us that the view of ordering is kept intact across processor cores. In other words, if I can assure that X is written before Y on one processor, then Intel promises me that on another processor X will be read with the updated value before Y is seen updated.

Looking at how Microsoft’s examples solve certain multithreading issues, it’s quite evident that they trust the processor to retain the ordering of operations.

Memory fences are hence only necessary to ensure the ordering on a certain processor on x86. On different architectures (e.g. ARM v7) smp_wmb() and smp_rbm() actually do produce some code.

When are these fences really needed? From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 4-22, on the “MFENCE” instruction:

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction).

That didn’t answer much. I searched for fence instructions in the disassembly of a Linux kernel compiled for x86_64. A lot of fence instructions are used in the initial CPU bringup, in particular after setting CPU registers. Makes sense. Then there are several other fence instructions in drivers, which aren’t necessarily needed, but who has the guts to remove them?

Most interesting is where I didn’t find a single fence instruction: In any of the object files generated in kernel/locking/. In other words, Linux’ mutexes and spinlocks are all implemented without any fence. So this is most likely a good proof that they aren’t needed for anything else but things related to the CPU state itself. I guess. For a 64-bit x86, that is.

Going back to Microsoft, it’s interesting that their docs for userspace Interlocked functions say “This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order”, but not those for kernel space. Compare, for example InterlockedOr() for applications vs. the same function for kernel. Truth is I didn’t bother to do the same disassembly test for application code.

Some barriers functions

(or: A collection of functions you probably don’t need, even if you think you do)

  • KeFlushWriteBuffer(): Undocumented and rarely mentioned, intended for internal kernel use. Probably just makes sure that the cache has been flushed (?).
  • KeMemoryBarrier(): Calls _KeMemoryBarrier(). But in wdm.h, there’s an implementation of this function, calling FastFence() and LoadFence(). But these are just macros for __faststorefence and _mm_lfence. Looked at next.
  • _mm_lfence() : Turns into an lfence opcode. Same as rmb() in Linux.
  • _mm_sfence(): Turns into an sfence opcode. Same as wmb() in Linux.
  • _mm_mfence(): Turns into an mfence opcode.

I’ve verified that the _mm_*fence() builtins generated the said opcodes when compiled for x86 and amd64 alike. See some experiments on this matter below.

The deprecated _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() produce no code at all. MemoryBarrier() ends up as a call to _MemoryBarrier().

Experimenting with fence instructions

(or: A major waste of time)

This is the code compiled:

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  x += *p;
  _mm_lfence();
  x += *p;

  return x;
}

With a “checked compiation” this turns into:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 51                 push        ecx
  00000006: C7 45 FC 00 00 00  mov         dword ptr [ebp-4],0
            00
  0000000D: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000010: 8B 4D FC           mov         ecx,dword ptr [ebp-4]
  00000013: 03 08              add         ecx,dword ptr [eax]
  00000015: 89 4D FC           mov         dword ptr [ebp-4],ecx
  00000018: 0F AE E8           lfence
  0000001B: 8B 55 08           mov         edx,dword ptr [ebp+8]
  0000001E: 8B 45 FC           mov         eax,dword ptr [ebp-4]
  00000021: 03 02              add         eax,dword ptr [edx]
  00000023: 89 45 FC           mov         dword ptr [ebp-4],eax
  00000026: 8B 45 FC           mov         eax,dword ptr [ebp-4]
  00000029: 8B E5              mov         esp,ebp
  0000002B: 5D                 pop         ebp
  0000002C: C2 04 00           ret         4

OK, this is too much. There is no ptimization at all. So let’s look at the “free” compilation instead:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 08              mov         ecx,dword ptr [eax]
  0000000A: 0F AE E8           lfence
  0000000D: 8B 00              mov         eax,dword ptr [eax]
  0000000F: 03 C1              add         eax,ecx
  00000011: 5D                 pop         ebp
  00000012: C2 04 00           ret         4

So clearly, the fence command made the compiler read the value from memory twice, as opposed to optimizing the second read away. Note that there’s no volatile keyword involved. So except for the fence, there’s no reason to read from *p twice.

The exact same result is obtained with _mm_mfence().

Trying with _mm_sfence() yields an interesting result however:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 00              mov         eax,dword ptr [eax]
  0000000A: 0F AE F8           sfence
  0000000D: 03 C0              add         eax,eax
  0000000F: 5D                 pop         ebp
  00000010: C2 04 00           ret         4

*p is read into eax once, then the fence, and then it’s added by itself. As opposed to above, where it was read into eax before the fence, then read again into ecx, and then added eax with ecx.

So the compiler felt free to optimize the two reads into one, because the store fence deals only with writes into memory, not reads. Given that there’s no volatile keyword used, it’s fine to optimize reads, which is exactly what it did.

The same optimization occurs if the fence command is removed completely, of course.

For the record, I’ve verified the equivalent behavior on the amd64 target (I’ll spare you the assembly code).

Windows trusting many more Root Authorities than certmgr shows

This baffled me for a while: I used certmgr to see if a Windows 10 machine had a root certificate that was needed to certify a certain digital signature, and it wasn’t listed. But then the signature was validated. And not only that, the root certificate was suddenly present in certmgr. Huh?

Here’s a quick demonstration. This is the “before” screenshot of the Certificate Manager window (click to enlarge):

Windows Certificate Manager before examining .cab file

Looking at the registry, I found 11 certificates in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\SystemCertificates\AuthRoot\Certificates\ and 12 certificates in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\SystemCertificates\ROOT\Certificates\, so it matches exactly certmgr’s view of 23 root certificates.

And so I had a .cab file with a certificate that requires Certum’s root certificate for validation. Clear from the screenshot above, it’s not installed.

Then I right-clicked that .cab file, selected Properties, then the “Digital Signature Tab”, selected the certificate and clicked Details, and boom! A new root certificate was automatically installed (click to enlarge):

Windows Certificate Manager after examining .cab file

And suddenly there are 12 certificates in the AuthRoot part of the registry instead of 11. Magic.

And here’s the behind the scenes of that trick.

Microsoft publishes a Certificate Trust List (CTL), which every computer downloads automatially every now and then (once a week, typically). It contains the list of root authorities that the computer should trust, however apparently they are imported into the registry only as needed. This page describes the concept of CTL in general.

I don’t know where this is stored on the disk, however one can download the list and create an .sst file, which opens certmgr when double-clicked. That lists all certificates of the downloaded CTL. 425 of them, as of May 2021, including Certum of course:

> certutil -generateSSTFromWU auth.sst

So it seems like Windows installs certificates from the CRL as necessary to validate certificate chains. This includes the GUI for examining certificates, verifying with signtool, as well as requiring the certificate for something actually useful.

There’s also a utility called CTLInfo out there, which I haven’t tried. It apparently displays the CTL currently loaded in the system, but I haven’t tried it out.

There’s another post in Stackexchange on this matter.

Besides, I’ve written a general post on certificates, if all this sounded like Chinese.

Attestation signing of Windows device driver: An unofficial guide

Introduction

This is my best effort to summarize the steps to attestation signing for Windows drivers (see Microsoft’s main page on this). I’m mostly a Linux guy with no connections inside Microsoft, so everything written below is based upon public sources, trial and (a lot of) error, some reverse engineering, and speculations. This couldn’t be further away from the horse’s mouth, and I may definitely be wrong occasionally (that is, more than usual).

Also, the whole topic of attestation signature seems to be changing all the time, so it’s important to keep in mind that this reflects the situation of May 10th 2021. Please comment below as things change or whenever I got things wrong to begin with.

Attestation signing replaces the method that was available until April 2021, which was signing the driver locally by its author, just like any code signing. With attestation signing, Microsoft’s own digital signature is used to sign the driver. To achieve that, the driver’s .inf and .sys files are packed in a .cab file, signed by the author, and submitted to Microsoft. Typically 10 minutes later, the driver is signed by Microsoft, and can be downloaded back by the author.

Unfortunately, the signature obtained this way is recognized by Windows 10 only. In order to obtain a signature that works with Windows 7 and 8, the driver needs to get through an HLK test.

Signing up to the Hardware Program

This seemingly simple first step can be quite confusing and daunting, so let’s begin with the most important point: The only piece of information that I found present in Microsoft’s output (i.e. their signature add-ons), which wasn’t about Microsoft itself, was the company’s name, as I stated during the enrollment. In other words, what happens during the sign-up process doesn’t matter so much, as long as it’s completed.

This is Microsoft’s general how-to page for attestation signing in general, and this one about joining the hardware program. It wasn’t clear to me from these what I was supposed to do, so I’ll try to offer some hints.

The subscription to the Hardware Program can begin when two conditions are met:

  • You have the capability to sign a file with an Extended Validation (EV) code signing certificate.
  • You have an Azure Active Directory Domain Services managed domain (“Azure AD”).

Obtaining an EV certificate is a bureaucratic process, and it’s not cheap. But at least the other side tells you what to do, once you’ve paid. I went for ssl.com as their price was lowest, and working with them I got the impression that the company has hired people who actually know what they do. In short, recommended.

So what’s the Domain Services thing? Well, this is the explanation from inside the Partner web interface (once it has already been set up): “Partner Center uses Azure Active Directory for identity and access management”. That’s the best I managed to find on why this is necessary.

For a single user scenario, this boils down to obtaining a domain name like something.onmicrosoft.com from Microsoft. It doesn’t matter if the name turns out long and daunting: It doesn’t appear anywhere else, and you’re not going to type it manually.

So here’s what to do: First thing first, create a fresh Microsoft account. Not really necessary if you already have one, but there’s going to be quite some mail going its way (some of which is promotional, unless you’re good at opting out).

Being logged into that account, start off on Azure’s main page. Join the 12-month free trial. It’s free, and yet you’ll need to supply a valid credit card number in the process. As of writing this, I don’t know what happens after 12 months (but see “Emails from Azure calling for an upgrade” below on developments).

The next step is to create that domain service. I believe this is Microsoft’s page on the topic, and this is the moment one wonders why there’s talk about DNSes and virtual networks. Remember that the only goal is to obtain the domain name, not to actually use it.

And here comes the fuzzy part, where I’m not sure I didn’t waste time with useless operations. So you may try following this, as it worked for me. But I can’t say I understand why these seemingly pointless actions did any good. I suspect that the bullets in italics below can be skipped — maybe it’s just about creating an Azure account, and not necessarily allocate resources?

So here are the steps that got me going:

  • Log in to your (new?) Azure account.
  • Go to Azure’s Portal (or click the “Portal” link at the top bar on Azure’s main page)
Maybe skip these steps (?):
  • Click “Create a resource” (at the top left) and pick Azure AD Domain Services.
  • For Resource Group I created a new one, “the_resource_group”. I guess the name doesn’t matter.
  • The DNS name doesn’t matter, apparently. yourcompany.onmicrosoft.com or something. It’s not going to appear anywhere.
  • I set SKU set to Standard, as it appeared to be the least expensive one.
  • After finishing the setup, it took about an hour for Azure to finish the configuration. Time for a long and well deserved coffee break.
  • But then it complained that I need to set up DNSes or something. So I went along with the automatic fix.

(end of possibly useless steps)

  • There’s this thing on the Register for the Hardware Program page saying that one should log in with the Global administrator account. This page defines Azure AD Global administrator as “This administrator role is automatically assigned to whomever created the Azure AD tenant”. So apparently for a fresh Azure account, it’s OK as is.
  • At this point, you’re hopefully set to register to the Hardware Developer Program. After clicking “Next” on the landing page, you’ll be given the choice of “Sign in to Azure AD” or “Create a new directory for free”. The Azure AD is already set up, so log in with the account just created.
  • A word about that “Create a new directory for free” option. To make things even more confusing, this appears to be a quick and painless shortcut, however in my case I got “This domain name is not available” for any domain name I tried with. Maybe I missed something, but this was a dead end for me. This is the page I didn’t manage to get through. Maybe your luck is better than mine. So this is why I created the Azure AD first, and then went for registration.
  • Going on with the registration, you’re given a file to sign with your EV certificate. I got a .bin file, but in fact it had .exe or .sys format. So it can be renamed to .exe and used with cloud signature services (I used eSigner). BUT do this only if you’re going to sign the .cab files with the same machinery, or you’ll waste a few hours wondering what’s going on. Or read below (“When the signature isn’t validated”) why it was wrong in my case.
  • And this is the really odd thing: Inside the Microsoft Partner Center, clicking the “your account” button (at the top right) it shows the default directory in use. At some point during the enrollment procedure, the link with the Azure AD I created was made (?), but for some reason, the default directory shown was something like microsoftazuremycompany.onmicrosoft.com instead of mycompany.onmicrosoft.com, which is the domain I created before. This didn’t stop me from signing a driver. So if another directory was used, why did I create one earlier?

After all this, I was set to submit drivers for signature: From this moment on, the entry point for signing drivers is the Microsoft Partner Center dashboard’s main page.

Emails from Azure calling for an upgrade

To make a long story short, quite a few emails arrived on behalf of Microsoft Azure, urging me to “upgrade” my account, i.e. to allow charging my credit card for Azure services. I ignored them all, and had no issues continuing to sign drivers.

And now to the details.

A day after signing up to Azure, I discovered that $20 had been reduced from my promotional free trial credit. Apparently, I had enabled stuff that would have cost real money. So I deleted the resources I had allocated in Azure. This includes deleting the mycompany.onmicrosoft.com domain, which was obviously ignored by the Partner Center. It was about deleting the the resource group (which contained 7 elements, with the domain included): Just clicking on the resource group in the main portal page, and then Delete Resource Group at the top. It took several minutes for Azure to digest that.

About a month later, I got a notification from Azure urging me to upgrade my account: It went

You’re receiving this email because your free credit has expired. Because of this, your Azure subscription and services have been disabled. To restore your services, upgrade to pay-as-you-go pricing.

Scary, heh? Does “services have been disabled” mean that I’m about to lose the ability to sign drivers?

Once again, “upgrade” is a cute expression for giving permission to charge the credit card that I had to give during submission. The details of which can’t be deleted from the account, unless I submit another, valid one, instead.

As a side note, it turned out that I had a Network Watcher group activated. Maybe I missed it earlier, and maybe it was somehow added. So I deleted it as well. But it’s not clear if this was related to the fact that the credits expired, whatever that means.

A few days on, another mail from Azure, basically the same, urging me to upgrade. One day after that, came an invoice. How come? I haven’t approved any payment. So it was an invoice on 0.00 USD. Zero. Why it was sent to me, is still unclear.

And finally, roughly two months after the initial subscription, I got a “We’re sorry to see you go” email from Azure, saying “Your free credit expired on (this and this date), and because of this we’ve deleted your subscription and any associated data and services”. Uhhm. What about driver signing? Well, I’ve spoiled the suspension above.

Two weeks after getting this last email, I deleted all cookies on my browser that were related to Microsoft, logged into my account at the Partner Center and submitted a driver for signature. The process went through smoothly.

Checking my Azure cloud account, it seemed to had been reset to its starting state, even with a suggestion to start another $200 free trial credit round. Detaching my credit card was however still impossible.

So apparently, there’s no problem just ignoring these emails, and continue signing forever. Emphasis on “apparently”.

Overview of the signature process

To make a long story short, you prepare a .cab file with the driver’s files, sign it with your EV Certificate, upload it to the Hardware Dashboard, and get it back after 10 minutes with Microsoft’s digital signatures all over the place.

So instead of signing the driver yourself, you pack the whole thing neatly, and send it to Microsoft for adding the .cat file and signing the drivers. And yet, you must sign the .cab file to prove that you’re the one taking responsibility for it. It’s Microsoft’s signature on the driver in the end, but they know who to bash if something goes wrong.

.cab files are exactly like .zip files, in the sense that they contain a directory tree, not just a bunch of files. Unfortunately, when looking at .cab files with Windows’ built-in utilities, the directory structure isn’t presented, and it looks like a heap of files. This holds true both when double clicking a .cab file and when using expand -D, from Windows XP all the way to Windows 10. Ironically enough, double-clicking a .cab file with Linux desktop GUI opens it correctly as a directory tree.

It’s important to consider .cab files like .zip, with hierarchy, because the way the driver is submitted is by organizing the files in directories exactly as they appear in the driver package for release, minus the .cat file. So what really happens is that Microsoft uncompresses the .cab file like a .zip, adds the .cat file and then performs the digital signing. It then compresses it all back into a .zip file and returns it back to you. The files remain in the same positions all through.

I guess the only reason .zip files aren’t uploaded instead of .cab, is that signtool doesn’t sign zips.

Some people out there, who missed this point, got the impression that the signing is done for each architecture separately. That’s possible, but there’s no reason to go that way. It’s just a matter of preparing the file hierarchy properly.

Preparing the .cab file

For reference, this is Microsoft’s short page on makecab and very long page on .cab files (which begins with cabarc, but goes on with makecab).

First, set up a .ddf file, looking something like this:

.Set CabinetFileCountThreshold=0
.Set FolderFileCountThreshold=0
.Set FolderSizeThreshold=0
.Set MaxCabinetSize=0
.Set MaxDiskFileCount=0
.Set MaxDiskSize=0
.Set CompressionType=MSZIP
.Set Cabinet=on
.Set Compress=on

;Specify file name for new cab file
.Set CabinetNameTemplate=thedriver.cab
.Set DiskDirectoryTemplate= ; Output .cab files into current directory

.Define pathtodriver=thedriver-dir

.Set DestinationDir=thedriver
%pathtodriver%\thedriver.inf
.Set DestinationDir=thedriver\i386
%pathtodriver%\i386\thedriver.sys
.Set DestinationDir=thedriver\amd64
%pathtodriver%\amd64\thedriver.sys

The .cab file is then generated with something like

> makecab /f thedriver.ddf

“makecab” is in Window’s execution path by default.

In my case of transitioning from self-signed drivers to attestation signature, there was already a script that generated the directory ready for releasing the driver. So the change I made was not to copy the .cat file into that directory, and instead of signing the .cat file, create a .cab.

The .ddf file above relates to a driver released for Intel architecture, 32 and 64 bits. The subdirectories in the driver package are i386 and amd64, respectively, as defined in the .inf file.

Changes you should make to the .ddf file:

  • Replace all “thedriver” with the name of your driver (i.e. the name of the .inf and .sys files).
  • Set “pathtodriver” to where the driver package is. Note that makecab’s /d flag allows setting variables, so the Define directive can be removed, and instead go something like
    > makecab /d pathtodriver=..\driverpackage thedriver.ddf
  • Then possibly modify the files to be included. Each DestinationDir assignment tells makecab the directory position to place the file(s) that appear after it. This should match the structure of your release package’s directory structure.
  • If the line doesn’t start with a dot, it’s the path to a file to copy into the .cab file. The path can be absolute (yuck) or relative to the current directory.

All in all, the important thing is to form a directory tree of a driver for release in the .cab file.

The .ddf file shown above is a working example, and it includes only the .inf and .sys files. Including a .cat file is pointless, as Microsoft’s signature machinery generates one of its own.

As for .pdb files, it’s a bit more confusing: Microsoft’s main page includes .pdb files in the list of “typical CAB file submissions” (actually, .cat it listed too there), and then these files don’t appear in the .ddf file example on the same page. The graphics showing a tree for multiple package submissions is inconsistent with both.

A .pdb files contains the symbol map of the related .sys file, allowing the kernel debugger to display meaningful stack traces and disassemblies, in particular when analyzing a bugcheck. These files are not included in a driver release, not mentioned in the .inf file, not referenced in the .cat file and are therefore unrelated to the signature of the driver. Technically, Microsoft doesn’t need these files to complete an attestation signature.

Microsoft nevertheless encourages submitters of drivers to include .pdb files. When these file are missing in a driver submission, a popup shows up in the web interface saying “This submission does not include symbols. It is recommended to upload symbols within each driver folder”. This however doesn’t stop the process, and not even delay it, in case you’re slow on confirming it. So it’s up to you if you want to include .pdb’s.

Submitting the .cab file

The command for signing the .cab file is:

> signtool.exe sign /fd sha256 thedriver.cab

Note that timestamping is not required, but won’t hurt. The whole idea with timestamping is to make the signature valid after the certificates expire, but the .cab file is examined soon after the signature is made, and after that it has no more importance.

Note that ssl.com also offers an eSigner tool for signing the .cab file with a simple web interface. Just be sure to have registered with a signature made in eSigner as well, or things will go bad, see “When the signature isn’t validated” below. Or add eSigner’s certificate to the existing subscription.

Then the submission itself:

  • In Microsoft Partner Center’s dashboard, click at “Drivers” on the left menubar. It might be necessary to click “Hardware” first to make this item appear.
  • Click the “Submit new hardware” button at the top left to get started.
  • Give the submission a name — it’s used just for your own reference, and surely won’t appear in the signed driver package.
  • Drag the signed cab file to where it says to.
  • The web interface requires selecting Windows releases in a lot of checkboxes. More on this just below.
  • Click “Submit” to start the machinery. Once it finishes, successfully or not, it sends a notification mail (actually, three identical mails or so. Not clear why not only one).
  • If and when the entire process is completed successfully, the driver can be downloaded: Under “Packages and signing properties”, there’s a “More” link. Click it, and a button saying “Download signed files” appears. So click it, obviously.

Now to the part about selecting Windows versions. It’s an array of checkboxes. This is a screenshot of this stage (click to enlarge):

Selecting OS targets for Attestation Signing

First, the easy part: Don’t check the two at the top saying “Perform test-signing for X”. It says “Leave all checkboxes blank for Attestation Signing” in fine print above these.

Now we’re left with a whole lot of Windows 10 release numbers and architectures. From a pure technical point of view, there’s no need for this information to perform the signature, since the .inf file contains the information of which architectures are targeted.

Rather, this is the “attestation” part: Just above the “Submit” button, it says “You have completed quality testing of your driver for each of the Operating Systems selected above”. So this is where you testify which platforms you’ve tested the driver with. The deal is that instead of going through the via dolorosa of HLK tests, Microsoft signs the driver for you in exchange to this testimony. Or should I say, attestation.

Just to spell it out: The signature can’t and doesn’t limit itself to specific operating system builds, and it would be insane doing so, as it wouldn’t cover future Windows releases.

I have to admit that in the beginning I misunderstood this part, and tried to select as much as possible. And because my driver wasn’t compiled for arm64, and I had clicked versions saying “ARM64″, the submission was rejected with “thedriver.inf does not have NTARM64 decorated model sections” (in UniversalLog.txt). It was bit of a computer game to check the right boxes and avoid the wrong ones.

So no need to be greedy. Common sense is to test the driver on one operating system release for each architecture. In the example above, it’s for a driver released for Intel architecture, 32 and 64 bits. The checkbox selection reflects testing it with Windows 10 release 1607, 32- and 64-bit architecture. This is the proper way to go.

And yet, for the heck of it I tried submitting the same driver package with a single OS checked (1607 x64). To my surprise, the package was accepted and signed despite my declaration that it hadn’t been tested for the 32-bit version, even though a .sys file for that architecture was part of the package.

All in all, there must be a match between the architectures targeted by the driver (as listed in the .inf file) and those inferred by the selection of OSes. Nevertheless, it seems like Microsoft lets you get away with not testing all of them. In short, checking just one checkbox may be enough, even if the driver supports multiple architectures.

Looking at the signed zip

After receiving back the signed driver, I examined the files. My findings were:

  • The .inf file is left completely intact (bytewise identical to the one in the .cab file).
  • A signed .cat file was added.
  • All .sys files were signed as well (contrary to what most of us do when releasing drivers). This makes the driver eligible for inclusion during boot.

Looking at the digital signatures with an ASN.1 dump utility, it’s appears like the only place there’s something not pointing at Microsoft, is an non-standard spcSpOpusInfo entry in the crypto blob, where the company’s name, appears in wide char format in the programName field (no, I’m not mistaken). This appears to be taken from the “Publisher display name” as it appears in the Account Settings in the Microsoft Partner Center dashboard.

So all in all, there are almost no traces to the fact that the driver’s origin isn’t Microsoft. Except for that entry in the crypto blob, which is most likely invisible unless the signature is analyzed as an ASN.1 file or string searched (with a tool that detects wide char strings). So it appears like all information, except for that “Publisher display name” remains between you and Microsoft.

When the signature isn’t validated

Sometimes, the process fails at the “Preparation” stage. As always on failures, the web interface suggest downloading a “full error report”. That report is a file named UniversalLog.txt file. If it says just “SignatureValidationFailed”, something went wrong with the signature validation.

The solution for this is to make sure that the certificate that was used for signing the .cab file is registered: Within Microsoft Partner Center, click the gear icon at the top right, select “Account Settings” and pick “Manage Certificates” at the left menu bar. That’s where the relevant certificate should be listed. The first time I got to this page, I saw the same certificate twice, and deleted one of those.

In my case the problem was that during the registration, I had made the signature with the cloud app (eSigner), but signed the driver with a local USB key dongle. As it turned out, these have different certificates.

So the solution was to delete the registered certificate from the account, and register the new one by signing a file with the local USB dongle. Doing this is a good idea in any case, because if something is wrong with the signature produced by signtool, it will fail the registration as well. So whether this renewed registration succeeds or fails, it brings you closer to the solution.

Sample certificate chains

For reference, these are examples of certificate chains: One properly signed .cab file and one for a the .cat file that has been attestation signed my Microsoft.

Note the /pa flag, meaning Default Authenticode Verification Policy is used. Or else verification may fail. Also note that the file isn’t timestamped, which is OK for submission of attestation signing.

> signtool verify /pa /v thefile.cab

Verifying: thefile.cab

Signature Index: 0 (Primary Signature)
Hash of file (sha256): 388D7AFB058FEAE3AEA48A2E712BCEFEB8F749F107C62ED7A41A131507891BD9

Signing Certificate Chain:
    Issued to: Certum Trusted Network CA
    Issued by: Certum Trusted Network CA
    Expires:   Mon Dec 31 05:07:37 2029
    SHA1 hash: 07E032E020B72C3F192F0628A2593A19A70F069E

        Issued to: SSL.com EV Root Certification Authority RSA R2
        Issued by: Certum Trusted Network CA
        Expires:   Mon Sep 11 02:28:20 2023
        SHA1 hash: 893E994B9C43100155AE310F34D8CC962096AE12

            Issued to: SSL.com EV Code Signing Intermediate CA RSA R3
            Issued by: SSL.com EV Root Certification Authority RSA R2
            Expires:   Wed Mar 22 10:44:23 2034
            SHA1 hash: D2953DBA95086FEB5805BEFC41283CA64C397DF5

                Issued to: THE COMPANY LTD
                Issued by: SSL.com EV Code Signing Intermediate CA RSA R3
                Expires:   Fri May 03 13:09:33 2024
                SHA1 hash: C15A6A7986AE67F1AE4B996C99F3A43F98029A54

File is not timestamped.

Successfully verified: thefile.cab

Number of files successfully Verified: 1
Number of warnings: 0
Number of errors: 0

One possibly confusing situation is to check if the root certificate exists before ever running this verification on a fresh Windows installation. It may not be there, but then the verification is successful, and the root certificate appears from nowhere. That rare situation is explained in this post.

Next up is the attestation signed .cat file:

> signtool.exe verify /kp /v thedriver.cat

Verifying: thedriver.cat

Signature Index: 0 (Primary Signature)
Hash of file (sha256): ED5231781724DEA1C8DE2B1C97AC55922F4F85736132B36660FE375B44C42370

Signing Certificate Chain:
    Issued to: Microsoft Root Certificate Authority 2010
    Issued by: Microsoft Root Certificate Authority 2010
    Expires:   Sat Jun 23 15:04:01 2035
    SHA1 hash: 3B1EFD3A66EA28B16697394703A72CA340A05BD5

        Issued to: Microsoft Windows Third Party Component CA 2014
        Issued by: Microsoft Root Certificate Authority 2010
        Expires:   Mon Oct 15 13:41:27 2029
        SHA1 hash: 1906DCF62629B563252C826FDD874EFCEB6856C6

            Issued to: Microsoft Windows Hardware Compatibility Publisher
            Issued by: Microsoft Windows Third Party Component CA 2014
            Expires:   Thu Dec 02 15:25:28 2021
            SHA1 hash: 984E03B613E8C2AE9C692F0DB2C031BF3EE3A0FA

The signature is timestamped: Mon May 10 03:10:15 2021
Timestamp Verified by:
    Issued to: Microsoft Root Certificate Authority 2010
    Issued by: Microsoft Root Certificate Authority 2010
    Expires:   Sat Jun 23 15:04:01 2035
    SHA1 hash: 3B1EFD3A66EA28B16697394703A72CA340A05BD5

        Issued to: Microsoft Time-Stamp PCA 2010
        Issued by: Microsoft Root Certificate Authority 2010
        Expires:   Tue Jul 01 14:46:55 2025
        SHA1 hash: 2AA752FE64C49ABE82913C463529CF10FF2F04EE

            Issued to: Microsoft Time-Stamp Service
            Issued by: Microsoft Time-Stamp PCA 2010
            Expires:   Wed Jan 12 10:28:27 2022
            SHA1 hash: AAE5BF29B50AAB88A1072BCE770BBE40F55A9503

Cross Certificate Chain:
    Issued to: Microsoft Root Certificate Authority 2010
    Issued by: Microsoft Root Certificate Authority 2010
    Expires:   Sat Jun 23 15:04:01 2035
    SHA1 hash: 3B1EFD3A66EA28B16697394703A72CA340A05BD5

        Issued to: Microsoft Windows Third Party Component CA 2014
        Issued by: Microsoft Root Certificate Authority 2010
        Expires:   Mon Oct 15 13:41:27 2029
        SHA1 hash: 1906DCF62629B563252C826FDD874EFCEB6856C6

            Issued to: Microsoft Windows Hardware Compatibility Publisher
            Issued by: Microsoft Windows Third Party Component CA 2014
            Expires:   Thu Dec 02 15:25:28 2021
            SHA1 hash: 984E03B613E8C2AE9C692F0DB2C031BF3EE3A0FA

Successfully verified: thedriver.cat

Number of files successfully Verified: 1
Number of warnings: 0
Number of errors: 0

Doing the same with the .sys file yields exactly the same result, with slight and meaningless differences in the timestamp.

Clearly, the certificate chain ends with “Microsoft Root Certificate Authority 2010″ rather than the well-known “Microsoft Code Verification Root”, which is the reason the attestation signature isn’t recognized by Windows 7 and 8.

Microsoft as a Certificate Authority, approving itself all through the chain. It’s quite odd this happened only now.

Generation of a certificate request from an existing P12 certificate

The goal

The envisioned work flow for certificate generation is that the end user requests a certificate from a CA by first generating a public / private key pair, and then sending a request for having the public key certified by the CA. This way, the CA is never exposed to the private key.

This is contrary to the common procedure today, where the end user gets the private key from the CA, mostly because the requirement is often that the private key must be on an external hardware device, out of reach even to the end user itself.

Because of the original vision of the flow, openssl’s way of generating a certificate is in two steps: First, create a request file, which contains the public key and the Subject information. The second step takes the request file as input, and generates a certificate, using the secret key of the CA, plus the related CA certificate, so that its data is copied into the generated certificate’s information about the Issuer.

But what if I already have a certificate, and I want another one, for the exact same public key and the same Subject? This post is about exactly that, when the previous certificate is in .p12 format.

For a general tutorial on certificates, there’s this post.

Steps

Extract information from existing certificate:

$ openssl pkcs12 -in my-certificate.p12 -nodes -out oldcert.pem

This command prompts for the password of the secret key in the .p12 file, and then creates a PEM file with two sections: One for the certificate, and one for the secret key. Note the -nodes argument, which outputs the secret key without password protection. Makes the process easier, but obviously riskier as well.

To watch the certificate part that was extracted in textual format:

$ openssl x509 -in oldcert.pem -text

Inspired by this page, generate an CSR with:

$ openssl x509 -x509toreq -in oldcert.pem -out CSR.csr -signkey oldcert.pem

Note that cert.pem is used twice: Once as the reference for creating a CSR, and once for grabbing the key. I’m prompted for the password again, because the private key is opened. (I used the “key to happiness” one).

The CSR.csr contains some textual information as well as a PEM formatted part, which is the one to submit. So I copied the file into clean.csr, and manually deleted everything but the PEM segment. And checked it:

$ openssl req -text -in clean.csr -noout -verify

The output should make sense (correct requested name etc.).

Now delete oldcert.pem, as it contains the secret key in cleartext!