Linux: When Vivado’s GUI doesn’t start with an error on locale

Trying to running Vivado 2017.3 with GUI and all on a remote host with X forwarding, i.e.

$ ssh -X mycomputer

setting the environment with

$ . /path/to/Vivado/2017.3/settings64.sh

it failed with

$ vivado &
terminate called after throwing an instance of 'std::runtime_error'
  what():  locale::facet::_S_create_c_locale name not valid

Now here’s the odd thing: The error message is actually helpful! It is a locale problem:

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_IL
LC_CTYPE="en_IL"
LC_NUMERIC="en_IL"
LC_TIME="en_IL"
LC_COLLATE="en_IL"
LC_MONETARY="en_IL"
LC_MESSAGES="en_IL"
LC_PAPER="en_IL"
LC_NAME="en_IL"
LC_ADDRESS="en_IL"
LC_TELEPHONE="en_IL"
LC_MEASUREMENT="en_IL"
LC_IDENTIFICATION="en_IL"
LC_ALL=

Checking on a non-ssh terminal, all read “en_US.UTF-8″ instead. The problem seems to be that the SSH is from a newer Linux distro to an older one. “en_IL” is indeed the locale on the newer machine, which is OK there. And SSH changed the locale (which I believe one can avoid, but it’s not worth the effort given the simple workaround below).

So the fix is surprisingly simple:

$ export LC_ALL=en_US.UTF-8

and then check again:

$ locale
LANG=en_IL
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

OK, so LANG is still rubbish, but after this Vivado 2017.3′s GUI is up and running.

I should mention that there was no problem starting Vivado 2014.4′s GUI, but instead it crashed somewhere in the middle of the implementation. Once again, the fixing the locale solved this.

Xilinx’ Zynq Z007s: Is it really single core?

Introduction

Xilinx’ documentation says that XC7Z007S, among other “S” devices, is a single-core device, as opposed to, for example, its older brother XC7Z010, which is dual-core. So I compared several aspects of the PS part of a Z007S vs. Z010, and to my astonishment, I found that Z007S is exactly the same: Two CPUs are reported by the hardware itself, SMP is kicked off on both, and a simple performance test I made showed that Z007S runs two processes in parallel as fast as Z010.

So the question is: In what sense is XC7Z007S single-core? For now, I have no answer to that. I’ll update this post as soon as someone manages to explain this to me. In the meanwhile, I’ve tried to get this figured out in Xilinx’ forum.

The rest of this post outlines the various similarities between the Z007S vs. Z010 I tested. The PL bitfiles of different Zynq devices are incompatible, so there’s no chance I mistook which devices I worked with.

The tests below were made with Xillinux-2.0 (kernel v4.4) on two Z-turn Lite boards, one carrying Z007S, and one Z010.

Found 2 CPUs?

I started wondering when the kernel’s dmesg log indicated that it had found 2 CPUs on a Z007S:

[    0.132523] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.132586] Setting up static identity map for 0x82c0 - 0x82f4
[    0.310962] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.311065] Brought up 2 CPUs
[    0.311102] SMP: Total of 2 processors activated (2664.03 BogoMIPS).
[    0.311121] CPU: All CPU(s) started in SVC mode.

Also, /proc/cpuinfo consistently listed two CPUs. One could think that it’s because two CPUs are declared in the device tree, but removing one of them makes no difference.

On Z010, the exact same log and appears in this matter, and /proc/cpuinfo says the same.

CPU’s hardware register reporting two CPUs

According to the Zynq-7000 AP SoC Technical Reference Manual (UG585), the processor’s SCU_CONFIGURATION_REGISTER indicates the number of CPUs present in the Cortex-A9 MPCore processor in bits 1:0. Binary ’01′ means two Cortex-A9 processors, CPU0 and CPU1. Binary ’00′ means one Cortex-A9 processor, CPU0.

Using Xillinux-2.0′s poke kernel utility to read the processor’s SCU_CONFIGURATION_REGISTER register, I got exactly the same result on Z007S and Z010:

poke read addr=f8f00004: value=00000511

In other words, both devices report two processors.

I’m under the impression that the kernel uses this register to tell the number of CPUs by virtue of the scu_get_core_count() (defined in arch/arm/kernel/smp_scu.c) function, called by zynq_smp_init_cpus() in arch/arm/mach-zynq/platsmp.c.

The latter function sets the kernel’s “CPU possible” bits, so it’s how the Zynq-specific kernel setup code tells the kernel framework which CPUs indexes are good for use.

A simple benchmark test

The proof is in the pudding. I wrote a simple program, which forks into two processes, each running a certain amount of plain CPU-intensive operations, and then quits. The output of this program is of no interest, but it’s printed out to avoid the compiler from optimizing away the crunching. Its listing is given at the end of this post for reference.

Using the “time” utility to measure the execution times, I ran the program on Z007S and Z010, and consistently got the same results, except for slight fluctuations:

# time ./work 400
Parent process done with LSR at e89c4641
Child process quitting with LSR at e89c4641
Parent process quitting

real	0m3.604s
user	0m7.030s
sys	0m0.010s

The 3.6 seconds given as “real” is the wall clock time. The 7 seconds of “user” time is the amount of consumed CPU. And as one would expect from a program that runs on two processes on a dual core machine, the consumed CPU time is approximately double the wall clock time. This is the result I expected from Z010, but not from Z007S.

Just to be sure I wasn’t being silly, I booted the kernel with “nosmp” in the kernel command line, which forced a single-CPU bringup. Indeed, the kernel reported finding one CPU in its logs, and /proc/cpuinfo reflected that as well.

And the pudding?

# time ./work 400
Parent process done with LSR at e89c4641
Child process quitting with LSR at e89c4641
Parent process quitting

real	0m6.998s
user	0m6.970s
sys	0m0.010s

Exactly as expected: With one processor, forking into two processes has no advantage. The CPU time is the wall clock time. I waited twice as long for it to finish.

At some point I suspected that the specific Linux version I used had a specific scheduler issue, which allowed a single-core CPU to perform as well as a dual-core. However the dual-core results were repeated on a Zybo board with three completely different kernels (except Xillinux-2.0) and yielded the same results (or slightly worse, with older kernels).

Conclusion

Given the results above, it’s not clear why Z007S is labeled as a single-core device. It’s not a matter of how it quacks or walks, but in the end, the device performs twice as fast when the work is split into two processes.

Or I missed something here. Kindly comment below if you found my mistake.

———————————–

Appendix: The benchmark program’s listing

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <signal.h>
#include <errno.h>
#include <string.h>
#include <sys/wait.h>

static unsigned int lsr_state;

int main(int argc, char *argv[]) {

  int count, i, j, bit;
  pid_t pid;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s count\n", argv[0]);
    exit(1);
  }

  count = atoi(argv[1]);

  lsr_state = 1;

  pid = fork();

  if (pid < 0) {
    perror("Failed to fork");
    exit(1);
  }

  for (i=0; i<count; i++)
    for (j=0; j<(1<<20); j++) {
      bit = ((lsr_state >> 19) ^ (lsr_state >> 2)) & 0x01;

      lsr_state = (lsr_state << 1) | bit;

      if (lsr_state == 0) {
	fprintf(stderr, "Huh? The LSR state is zero!\n");
	exit(1);
      }
    }

  if (pid == 0) {
    fprintf(stderr, "Child process quitting with LSR at %x\n", lsr_state);
    return 0;
  }

  fprintf(stderr, "Parent process done with LSR at %x\n", lsr_state);

  pid = wait(&i);

  fprintf(stderr, "Parent process quitting\n");

  return 0;
}

saved as work.c, compiled with

# gcc -O3 -Wall work.c -o work

directly on the Zynq board itself (Xillinux comes with a native gcc compiler). But cross compilation should make no difference.

Soft Linux kernel hacking for dumping ULPI commands to USB PHY

Ever wanted to see how the a Linux USB host talks with its PHY with ULPI commands? Probably not. But if you do, here’s how I did it on a Zynq device, connected to an USB3320 USB 2.0 PHY chip. Note that:

  • The relevant sources must be compiled into the kernel. Modules are loaded too late. The choice of PHY frontend is made when the USB driver is initialized, and if the relevant driver isn’t handy, a generic PHY is picked instead…
  • … which is most likely as good. In retrospect, there’s is very little reason to load the actual driver.
  • In particular, my system works great without the dedicated USB PHY driver.

So it’s about adding a plain pr_info() into the kernel’s drivers/usb/phy/phy-ulpi-viewport.c, so it prints every ULPI register write command to the kernel log. Added code marked in red:

static int ulpi_viewport_write(struct usb_phy *otg, u32 val, u32 reg)
{
	int ret;
	void __iomem *view = otg->io_priv;

	pr_info("ulpi_viewport_write: reg 0x%04x = 0x%02x\n",
		reg, val);

	writel(ULPI_VIEW_WAKEUP | ULPI_VIEW_WRITE, view);
	ret = ulpi_viewport_wait(view, ULPI_VIEW_WAKEUP);
	if (ret)
		return ret;

	writel(ULPI_VIEW_RUN | ULPI_VIEW_WRITE | ULPI_VIEW_DATA_WRITE(val) |
						 ULPI_VIEW_ADDR(reg), view);

	return ulpi_viewport_wait(view, ULPI_VIEW_RUN);
}

And that’s it. One can also cover the ulpi_viewport_read() method in the same way, but it wasn’t important to me (I wanted to the powering on of Vbus).

The relevant part in my device tree read:

	usb_phy0: phy0 {
		compatible = "ulpi-phy";
		#phy-cells = <0>;
		reg = <0xe0002000 0x1000>;
		view-port = <0x0170>;
		drv-vbus;
	};

	usb0: usb@e0002000 {
		compatible = "xlnx,zynq-usb-2.20a", "chipidea,usb2";
		clocks = <&clkc 28>;
		interrupt-parent = <&ps7_scugic_0>;
		interrupts = <0 21 4>;
		reg = <0xe0002000 0x1000>;
		phy_type = "ulpi";
		dr_mode = "host";
		usb-phy = <&usb_phy0>;
	};

And this is what I got in the dmesg log:

[    1.396317] ulpi_phy_probe() invoked
[    1.399968] ulpi_phy_probe() returns successfully
[    1.405148] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.418505] ehci-pci: EHCI PCI platform driver
[    1.429765] ehci-platform: EHCI generic platform driver
[    1.441924] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    1.454951] ohci-pci: OHCI PCI platform driver
[    1.466237] ohci-platform: OHCI generic platform driver
[    1.478504] uhci_hcd: USB Universal Host Controller Interface driver
[    1.492047] usbcore: registered new interface driver usb-storage
[    1.505250] chipidea-usb2 e0002000.usb: ci_hdrc_usb2_probe invoked
[    1.518410] e0002000.usb supply vbus not found, using dummy regulator
[    1.532049] ci_hdrc ci_hdrc.0: ChipIdea HDRC found, revision: 22, lpm: 0; cap: e0d0a100 op: e0d0a140
[    1.532062] ulpi_init() invoked
[    1.542033] ULPI transceiver vendor/product ID 0x0424/0x0007
[    1.554611] Found SMSC USB3320 ULPI transceiver.
[    1.566118] ulpi_viewport_write: reg 0x0016 = 0x55
[    1.577825] ulpi_viewport_write: reg 0x0016 = 0xaa
[    1.589383] ULPI integrity check: passed.
[    1.600057] ulpi_viewport_write: reg 0x000a = 0x06
[    1.611516] ulpi_viewport_write: reg 0x0007 = 0x00
[    1.622872] ulpi_viewport_write: reg 0x0004 = 0x41
[    1.634203] ci_hdrc ci_hdrc.0: It is OTG capable controller
[    1.634233] ci_hdrc ci_hdrc.0: EHCI Host Controller
[    1.645628] ci_hdrc ci_hdrc.0: new USB bus registered, assigned bus number 1
[    1.672482] ci_hdrc ci_hdrc.0: USB 2.0 started, EHCI 1.00
[    1.684475] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    1.697770] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.711521] usb usb1: Product: EHCI Host Controller
[    1.722832] usb usb1: Manufacturer: Linux 4.4.30-xillinux-2.0 ehci_hcd
[    1.735797] usb usb1: SerialNumber: ci_hdrc.0
[    1.747236] hub 1-0:1.0: USB hub found
[    1.757421] hub 1-0:1.0: 1 port detected
[    1.767881] ulpi_viewport_write: reg 0x000a = 0x67

The log entries in green above are just some other similar debug outputs I made, and they pretty much explain themselves.

Did you note that the ULPI was detected by vendor ID / product ID? It’s for real. These were obtained by ULPI registers read (not shown above). I’m not all that convinced that this detection made any difference, except for printing out the name of the device.

As for the meaning of these ulpi_viewport_write dumps, most is pretty boring: The first two writes to address 0x16 do nothing. It’s a scratch pad register. Most likely used by the driver to test the ULPI interface.

The following three writes just assign the default values. So this does effectively nothing as well.

The last write to register 0x0a (OTG register) sets bits 6, 5 and 0, which are DrvVbusExternal, DrvVbus and IdPullup. The interesting part to me was DrvVbusExternal and DrvVbus, because setting any of these two (or both) causes the chip’s CPEN pin to go high, which turns on the power supply for Vbus. This is the point where the USB port starts behaving like a host and feeds power.

Quartus: The importance of derive_pll_clocks in the SDC file

Introduction

Whenever a PLL is used in a design to generate one clock from another, it’s quite common to expect the timing tools to figure out the frequencies and timing relations between the different clocks.

With Intel’s Quartus tools, this isn’t the case by default. A derive_pll_clocks command is required in the SDC constraints file for this happen. And indeed, this command appears in virtually any SDC file that is generated automatically by the tools.

But here’s the scary thing: If derive_pll_clocks is omitted, one would expect that the PLL’s output clocks would not be timed at all, and that the relevant paths would be listed as unconstrained. Unfortunately, it’s different: As shown below, timing calculations are made for these paths, but with wrong figures. So one might get the impression that the timing constraints were met and all is fine, but in fact nothing is assured.

An example

Let’s say that the FPGA has an oscillator input of 48 MHz (hence a period of 20.833 ns), from which a PLL generates a 240 MHz clock (with a period of 4.166 ns).

First let’s take a simple, properly written, SDC file going:

create_clock -name root_clk -period 20.833 [get_ports {osc_clock}]

derive_pll_clocks
derive_clock_uncertainty

Note that the derive_pll_clocks command is there.

Now let’s look at the timing report for a path between two registers, which are clocked by the derived clock. The only interesting part is marked with red:

+-------------------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                               ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; Total   ; Incr     ; RF ; Type ; Fanout ; Location               ; Element                                                                      ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; 0.000   ; 0.000    ;    ;      ;        ;                        ; launch edge time                                                             ;
; 4.937   ; 4.937    ;    ;      ;        ;                        ; clock path                                                                   ;
;   0.000 ;   0.000  ;    ;      ;        ;                        ; source latency                                                               ;
;   0.000 ;   0.000  ;    ;      ; 1      ; PIN_B12                ; osc_clock                                                                    ;
;   0.000 ;   0.000  ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|i                                                            ;
;   0.667 ;   0.667  ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|o                                                            ;
;   2.833 ;   2.166  ; RR ; IC   ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|inclk[0]                     ;
;   1.119 ;   -1.714 ; RR ; COMP ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|observablevcoout             ;
;   1.119 ;   0.000  ; RR ; CELL ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   3.274 ;   2.155  ; RR ; IC   ; 1      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   3.274 ;   0.000  ; RR ; CELL ; 8      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   4.336 ;   1.062  ; RR ; IC   ; 1      ; FF_X40_Y24_N27         ; clkrst_ins|main_state[0]|clk                                                 ;
;   4.937 ;   0.601  ; RR ; CELL ; 1      ; FF_X40_Y24_N27         ; clkrst:clkrst_ins|main_state[0]                                              ;
; 6.774   ; 1.837    ;    ;      ;        ;                        ; data path                                                                    ;
;   5.169 ;   0.232  ;    ; uTco ; 1      ; FF_X40_Y24_N27         ; clkrst:clkrst_ins|main_state[0]                                              ;
;   5.169 ;   0.000  ; FF ; CELL ; 5      ; FF_X40_Y24_N27         ; clkrst_ins|main_state[0]|q                                                   ;
;   5.591 ;   0.422  ; FF ; IC   ; 1      ; LCCOMB_X40_Y24_N24     ; clkrst_ins|Equal1~0|dataa                                                    ;
;   6.002 ;   0.411  ; FR ; CELL ; 1      ; LCCOMB_X40_Y24_N24     ; clkrst_ins|Equal1~0|combout                                                  ;
;   6.370 ;   0.368  ; RR ; IC   ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst_ins|the_register|d                                                    ;
;   6.774 ;   0.404  ; RR ; CELL ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                              ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; Total   ; Incr     ; RF ; Type ; Fanout ; Location               ; Element                                                                      ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; 4.166   ; 4.166    ;    ;      ;        ;                        ; latch edge time                                                              ;
; 9.005   ; 4.839    ;    ;      ;        ;                        ; clock path                                                                   ;
;   4.166 ;   0.000  ;    ;      ;        ;                        ; source latency                                                               ;
;   4.166 ;   0.000  ;    ;      ; 1      ; PIN_B12                ; osc_clock                                                                    ;
;   4.166 ;   0.000  ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|i                                                            ;
;   4.833 ;   0.667  ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|o                                                            ;
;   6.912 ;   2.079  ; RR ; IC   ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|inclk[0]                     ;
;   5.119 ;   -1.793 ; RR ; COMP ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|observablevcoout             ;
;   5.119 ;   0.000  ; RR ; CELL ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   7.187 ;   2.068  ; RR ; IC   ; 1      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   7.187 ;   0.000  ; RR ; CELL ; 8      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   8.199 ;   1.012  ; RR ; IC   ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst_ins|the_register|clk                                                  ;
;   8.736 ;   0.537  ; RR ; CELL ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
;   9.005 ;   0.269  ;    ;      ;        ;                        ; clock pessimism removed                                                      ;
; 8.985   ; -0.020   ;    ;      ;        ;                        ; clock uncertainty                                                            ;
; 8.890   ; -0.095   ;    ; uTsu ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+

Aside from all the mumbo-jumbo, there’s the “latch edge time” line, which is the time of the edge of the clock that will propagate through the clock network and become the latching clock on the receiving register. As this is the case of a plain register-to-register path, both clocked with the rising edge of the same clock, the “latch edge time” is simply the clock’s period. Indeed 4.166 ns. So far so good.

But then disaster

Let’s see what happens if the derive_pll_clocks command is omitted. In other words, the SDC file reads:

create_clock -name root_clk -period 20.833 [get_ports {osc_clock}]

derive_clock_uncertainty

For exactly the same path, the timing report reads:

+-------------------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                               ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; Total   ; Incr     ; RF ; Type ; Fanout ; Location               ; Element                                                                      ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; 0.000   ; 0.000    ;    ;      ;        ;                        ; launch edge time                                                             ;
; 4.937   ; 4.937    ;    ;      ;        ;                        ; clock path                                                                   ;
;   0.000 ;   0.000  ;    ;      ;        ;                        ; source latency                                                               ;
;   0.000 ;   0.000  ;    ;      ; 1      ; PIN_B12                ; osc_clock                                                                    ;
;   0.000 ;   0.000  ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|i                                                            ;
;   0.667 ;   0.667  ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|o                                                            ;
;   2.833 ;   2.166  ; RR ; IC   ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|inclk[0]                     ;
;   1.119 ;   -1.714 ; RR ; COMP ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|observablevcoout             ;
;   1.119 ;   0.000  ; RR ; CELL ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   3.274 ;   2.155  ; RR ; IC   ; 1      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   3.274 ;   0.000  ; RR ; CELL ; 8      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   4.336 ;   1.062  ; RR ; IC   ; 1      ; FF_X40_Y24_N19         ; clkrst_ins|main_state[0]|clk                                                 ;
;   4.937 ;   0.601  ; RR ; CELL ; 1      ; FF_X40_Y24_N19         ; clkrst:clkrst_ins|main_state[0]                                              ;
; 6.765   ; 1.828    ;    ;      ;        ;                        ; data path                                                                    ;
;   5.169 ;   0.232  ;    ; uTco ; 1      ; FF_X40_Y24_N19         ; clkrst:clkrst_ins|main_state[0]                                              ;
;   5.169 ;   0.000  ; FF ; CELL ; 5      ; FF_X40_Y24_N19         ; clkrst_ins|main_state[0]|q                                                   ;
;   5.583 ;   0.414  ; FF ; IC   ; 1      ; LCCOMB_X40_Y24_N24     ; clkrst_ins|Equal1~0|datab                                                    ;
;   5.994 ;   0.411  ; FR ; CELL ; 1      ; LCCOMB_X40_Y24_N24     ; clkrst_ins|Equal1~0|combout                                                  ;
;   6.361 ;   0.367  ; RR ; IC   ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst_ins|the_register|d                                                    ;
;   6.765 ;   0.404  ; RR ; CELL ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
+---------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+

+--------------------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                               ;
+----------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; Total    ; Incr     ; RF ; Type ; Fanout ; Location               ; Element                                                                      ;
+----------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+
; 20.833   ; 20.833   ;    ;      ;        ;                        ; latch edge time                                                              ;
; 25.672   ; 4.839    ;    ;      ;        ;                        ; clock path                                                                   ;
;   20.833 ;   0.000  ;    ;      ;        ;                        ; source latency                                                               ;
;   20.833 ;   0.000  ;    ;      ; 1      ; PIN_B12                ; osc_clock                                                                    ;
;   20.833 ;   0.000  ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|i                                                            ;
;   21.500 ;   0.667  ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8      ; osc_clock~input|o                                                            ;
;   23.579 ;   2.079  ; RR ; IC   ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|inclk[0]                     ;
;   21.786 ;   -1.793 ; RR ; COMP ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|observablevcoout             ;
;   21.786 ;   0.000  ; RR ; CELL ; 1      ; PLL_3                  ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   23.855 ;   2.069  ; RR ; IC   ; 1      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   23.855 ;   0.000  ; RR ; CELL ; 8      ; CLKCTRL_G13            ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   24.867 ;   1.012  ; RR ; IC   ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst_ins|the_register|clk                                                  ;
;   25.404 ;   0.537  ; RR ; CELL ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
;   25.672 ;   0.268  ;    ;      ;        ;                        ; clock pessimism removed                                                      ;
; 25.572   ; -0.100   ;    ;      ;        ;                        ; clock uncertainty                                                            ;
; 25.477   ; -0.095   ;    ; uTsu ; 1      ; DDIOOUTCELL_X41_Y24_N4 ; clkrst:clkrst_ins|the_register                                               ;
+----------+----------+----+------+--------+------------------------+------------------------------------------------------------------------------+

So it’s exactly the same analysis, only assuming that the clock period is 20.833 ns (note the “latch edge time” again). Note that the analysis traverses the PLL, but simply ignores the fact that the PLL’s output has another frequency. It’s as if the tools were saying: You forgot the derive_pll_clocks constraint? No problem. We’ll play as if the PLL’s input clock went right through it.

Frankly, I can’t think about a single case where this behavior would make sense. Either don’t calculate the timing of paths of the derived clock, or do it correctly. But just throwing in the original clock’s period? These incorrectly constrained paths don’t appear in the unconstrained path summary, nor is there any other indication that the timing is horribly wrong.

To the tools’ defense, the timing analysis produces warnings on this matter, but none at a Critical level, so it’s easy to miss them in the sea of warnings that FPGA tools always generate.

Bottom line

  • Make sure your design has the derive_pll_clocks command if you have a PLL involved (unless you’ve added explicit constraints for the derived clocks)
  • Be sure to generate a timing report, to read and understand it.
  • Always test your constraints by requiring impossible values, and verify that the failing paths are calculated correctly

Nvidia graphics cards on Linux: PCIe link speed and width

Why is it at 2.5 GT/s???

With all said about Nvidia’s refusal to release their drivers as open source, their Linux support is great. I don’t think I’ve ever had such a flawless graphics card experience with Linux. After replacing the nouveau driver with Nvidia’s, of course. Ideology is nice, but a computer that works is nicer.

But then I looked at the output of lspci -vv (on an Asus fanless GT 730 2GB DDR3), and horrors, it’s not running at full PCIe speed!

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[ ... ]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Whatwhat? The card declares it supports 5 GT/s, but runs only at 2.5 GT/s? And on my brand new super-duper motherboard, which supports Gen3 PCIe connected directly to an Intel X-family CPU?

It’s all under control

Well, the answer is surprisingly simple: Nvidia’s driver changes the card’s PCIe speed dynamically to support the bandwidth needed. When there’s no graphics activity, the speed drops to 2.5 GT/s.

This behavior can be controlled with Nvidia’s X Server Settings control panel (it has an icon in the system’s setting panel, or just type “Nvidia” on Gnome’s start menu). Under the PowerMizer sub-menu, the card’s behavior can be changed to stay at 5 GT/s if you like your card hot and electricity bill fat.

Otherwise, in “Adaptive mode” it switches back and forth from 2.5 GT/s to 5 GT/s. The screenshot below was taken after a few seconds of idling (click to enlarge):

Screenshot of Nvidia X Server settings in adaptive mode

And this is how to force it to 5 GT/s constantly (click to enlarge):

Screenshot of Nvidia X Server settings in maximum performance mode

With the latter setting, lspci -vv shows that the card is at 5 GT/s, as promised:

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

So don’t worry about a low speed on an Nvidia card (or make sure it steps up on request).

A word on GT 1030

I added another fanless card, Asus GT 1030 2GB, to the computer for some experiments. This card is somewhat harder to catch at 2.5 GT/s, because it steps up very quickly in response to any graphics event. But I managed to catch this:

65:00.0 VGA compatible controller: NVIDIA Corporation GP108 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GP108 [GeForce GT 1030]
[ ... ]
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

The running 2.5 GT/s speed vs. the maximal 8 GT/s is pretty clear by now, but the declared maximal Width is 4x? If so, why does it have a 16x PCIe form factor? The GT 730 has an 8x form factor, and uses 8x lanes, but GT 1030 has 16x and declares it can only use 4x? Is this some kind of marketing thing to make the card look larger and stronger?

On the other hand, show me a fairly recent motherboard without a 16x PCIe slot. The thing is that sometimes that slot can be used for something else, and the graphics card could then have gone into a vacant 4x slot instead. But no. Let’s make it big and impressive with a long PCIe plug that makes it look massive. Personally, I find the gigantic heatsink impressive enough.

PCIe on Cyclone 10 GX: Data loss on DMA writes by FPGA

TL;DR

DMA writes from a Cyclone 10 GX PCIe interface may be lost, probably due to a path that isn’t timed properly by the fitter. This has been observed with Quartus Prime Version 17.1.0 Build 240 SJ Pro Edition, and the official Cyclone 10 GX development board. A wider impact is likely, possibly on Arria 10 device as well (as its PCIe block is the same one).

The problem seems to be rare, and appears and disappears depending on how the fitter places the logic. It’s however fairly easy to diagnose if this specific problem is in effect (see “The smoking gun” below).

Computer hardware: Gigabyte GA-B150M-D2V motherboard (with an Intel B150 Chipset) + Intel i5-6400 CPU.

The story

It started with a routine data transport test (FPGA to host), which failed virtually immediately (that is, after a few kilobytes). It was apparent that some portions of data simply weren’t written into the DMA buffer by the FPGA.

So I tried a fix in my own code, and yep, it helped. Or so I thought. Actually, anything I changed seemed to fix the problem. In the end, I changed nothing, but just added

set_global_assignment -name SEED 2

to the QSF file. Which only changes the fitter’s initial placement of the logic elements, which eventually leads to an alternative placement and routing of the design. That should work exactly the same, of course. But it “solved the problem”.

This was consistent: One “magic” build that failed consistently, and any change whatsoever made the issue disappear.

The design was properly constrained, of course, as shown in the development board’s sample SDC file. In fact, there isn’t much to constrain: It’s just setting the main clock to 100 MHz, derive_pll_clocks and derive_clock_uncertainty. And a false path from the PERST pin.

So maybe my bad? Well, no. There were no unconstrained paths in the entire design (with these simple constraints), so one fitting of the design should be exactly like any other. Maybe my application logic? No again:

The smoking gun

The final nail in the coffin was when I noted errors in the PCIe Device Status Registers on both sides. I’ve discussed this topic in this and this other posts of mine, however in the current case no AER kernel messages were produced (unfortunately, and it’s not clear why).

And whatever the application code does, Intel / Altera’s PCIe block shouldn’t produce a link error, and neither it does normally. It’s a violation of the PCIe spec.

These are the steps for observing this issue on a Linux machine. First, find out who the link partners are:

$ lspci
00:00.0 Host bridge: Intel Corporation Device 191f (rev 07)
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07)
[ ... ]
01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb

and then figuring out that the FPGA card is connected via the bridge at 00:01.0 with

$ lspci -t
-[0000:00]-+-00.0
           +-01.0-[01]----00.0

So it’s between 00:01.0 and 01:00.0. Then, following that post of mine, using setpci to read from the status register to tell an error had occurred.

First, what it should look like: With any bitstream except that specific faulty one, I got

# setpci -s 01:00.0 CAP_EXP+0xa.w
0000
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

any time and all the time, which says the obvious: No errors sensed on either side.

But with the bitstream that had data losses, before any communication had taken place (except for the driver being loaded):

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

Non-zero means error. So at this stage the FPGA’s PCIe interface was unhappy with something (more on that below), but the processor’s side had no complaints.

I have to admit that I’ve seen the 0009 status in a lot of other tests, in which communication went through perfectly. So even though reflects some kind of error, it doesn’t necessarily predict any functional fault. As elaborated below, the 0009 status consists of correctable errors. It’s just that such errors are normally never seen (i.e. with any PCIe card that works properly).

Anyhow, back to the bitstream that did have data errors. After some data had been written by the FPGA:

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
root@diskless:/home/eli# setpci -s 00:01.0 CAP_EXP+0xa.w
000a

In this case, the FPGA card’s link partner complained. To save ourselves the meaning of these numbers (even though the’re listed in that post), use lspci -vv:

# lspci -vv
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07) (prog-if 00 [Normal decode])
[ ... ]
        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
[ ... ]

So the bridge complained about an uncorrectable and an unsupported request only after the data transmission, but the FPGA side:

01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb
[ ... ]
        Capabilities: [80] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-

complained about a correctable error and an unsupported request (as seen above, that happened before any payload transmission).

Low-level errors. I couldn’t make this happen even if I wanted to.

Aftermath

The really bad news is that this problem isn’t in the logic itself, but in how it’s placed. It seems to be a rare and random occurrence of a poor job done by the fitter. Or maybe it’s not all that rare, if you let the FPGA heat up a bit. In my case a spinning fan kept an almost idle FPGA quite cool, I suppose.

The somewhat good news is that the data loss comes with these PCIe status errors, and maybe with the relevant kernel messages (not clear why I didn’t see any). So there’s something to hold on to.

And I should also mention that the offending PCIe interface was a Gen2 x 4 running with a 64-bit interface at 250 MHz. which a rather marginal frequency for Arria 10 / Cyclone 10. So going with the speculation that this is a timing issue that isn’t handled properly by the fitter, maybe sticking to 125 MHz interfaces on these devices is good enough to be safe against this issue.

Note to self: The outputs are kept in cyclone10-failure.tar.gz

Quartus / Linux: Programming the FPGA with command-line

Command-line?

Yes, it much more convenient than the GUI programmer. Programming an FPGA is a repeated task, always the same file to the same FPGA on the same board connected to the computer. And somehow the GUI programming tools turn it into a daunting ceremony (and sometimes even a quiz, when it can’t tell exactly which device is connected, so I’m supposed to nail the exact one).

With command line its literally picking the command from bash history, and press Enter. And surprisingly enough, the command line tool doesn’t ask the silly questions that the GUI tool does.

First, some mucking about

Set up the environment:

$ /path/to/quartus/15.1/nios2eds/nios2_command_shell.sh

To list all devices found (cable auto-detected):

$ quartus_pgm --auto
Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 15.1.0 Build 185 10/21/2015 SJ Lite Edition
    Info: Copyright (C) 1991-2015 Altera Corporation. All rights reserved.
[ ... ]
    Info: agreement for further details.
    Info: Processing started: Sun May 27 15:06:22 2018
Info: Command: quartus_pgm --auto
Info (213045): Using programming cable "USB-BlasterII [2-5.1]"
1) USB-BlasterII [2-5.1]
  02B040DD   5CGTFD9(A5|C5|D5|E5)/..
  020A40DD   5M2210Z/EPM2210

[ ... ]

Note that listing the devices as shown above is not necessary for programming. It might be useful to tell the position of the FPGA in the JTAG chain, maybe. Really something that is done once to explore the board.

Programming

quartus_pgm displays most of its output in green. Generally speaking, if there’s no red text, all went fine.

$ quartus_pgm -m jtag -o "p;path/to/file.sof"

Or add the position in the JTAG explicitly (in particular if it’s not the first device). In this case it’s @1, meaning it’s the first device in the JTAG chain. If it’s the second device, pick @2 etc.

$ quartus_pgm -m jtag -o "p;path/to/file.sof@1"
Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 15.1.0 Build 185 10/21/2015 SJ Lite Edition
    Info: Copyright (C) 1991-2015 Altera Corporation. All rights reserved.
    Info: Your use of Altera Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Altera Program License
    Info: Subscription Agreement, the Altera Quartus Prime License Agreement,
    Info: the Altera MegaCore Function License Agreement, or other
    Info: applicable license agreement, including, without limitation,
    Info: that your use is for the sole purpose of programming logic
    Info: devices manufactured by Altera and sold by Altera or its
    Info: authorized distributors.  Please refer to the applicable
    Info: agreement for further details.
    Info: Processing started: Sun May 27 15:35:02 2018
Info: Command: quartus_pgm -m jtag -o p;path/to/file.sof@1
Info (213045): Using programming cable "USB-BlasterII [2-5.1]"
Info (213011): Using programming file p;path/to/file.sof@1 with checksum 0x061958E1 for device 5CGTFD9E5F35@1
Info (209060): Started Programmer operation at Sun May 27 15:35:05 2018
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x02B040DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sun May 27 15:35:09 2018
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 432 megabytes
    Info: Processing ended: Sun May 27 15:35:09 2018
    Info: Elapsed time: 00:00:07
    Info: Total CPU time (on all processors): 00:00:03

If anything goes wrong — device mismatch, a failure to scan the JTAG chain or whatever, it will be hard to miss because of the errors written in red. The sweet thing with the command line interface is that every attempt starts from fresh, so just turn the board on (the usual reason for errors) and give it another go.

Quartus / sdc: Constraining I/O ports clocked by an internal clock

Introduction

This post is an expansion for another post of mine, which deals with register I/O packing. It’s recommended reading that one first.

Timing constraining of I/O ports is typically intended to ensure timing relations between an external clock and the timing of signals that are clocked by this clock (or derived from this clock, with the same frequency or a simple relation to it).

However in some cases the clock of the I/O registers is generated with an PLL within the FPGA, and is practically unrelated to the originating clock. There are still good reasons to constrain the timing of such ports, among others:

  • Even though the external clock source isn’t involved directly, the timing must still be under control. In particular, when the interface with an external device is bidirectional, the timing of the signals arriving from the device depend on those being generated by the FPGA going to it. Constraining the ports is part of ensuring that this timing loop is fast enough.
  • Ensuring that I/O registers are used. Tight constraints, which can only be met with I/O registers will fail if the tools don’t pack those registers as desired.
  • Ensuring that no delay is inserted by the tools between the input pad and the register.

Clearly, nobody at Altera thought that this kind of constraining was necessary. Consequently, getting this done in a fairly clean manner is nontrivial, to say the least (yours truly wasted a full week of work to figure this out). This post suggests a methodology which is hopefully the clean enough. This is the best I managed to work out, anyhow.

The relevant documentation:

The Intel / Altera toolset used is Quartus Prime 15.1 (Web Edition).

The goal

The naïve approach is to apply set_input_delay and set_output_delay to the output ports as usual, using the clock from the PLL in the -clock argument. Even so, the tools interpret this as constraining the timing relative to the external clock, which is inherently pointless, since this relation has no meaning. To make things worse, if the clock frequency relations aren’t a plain ratio, the timing requirements become unreal, as the closest clock edge relations between the two clocks are applied as the worst case. So this doesn’t work at all.

Ideally, we’d like to constrain only the path between the I/O pin and the register connected directly to it. It appears like there’s no way to do that exactly, as Quartus’ Timing Analyzer automatically mixes in the clock’s path delay when there’s a register involved.

So the goal is to define the delay between the external pin and the register that samples its state or vice versa, with as little interference as possible.

A sample set of constraints

This is the sdc file that worked for me. Each part is explained in detail afterwards.

create_clock -name root_clk -period 20.833 [get_ports {osc_clock}]

# The 60 MHz clock is defined on the global clock buffer's output pin:
create_clock -name main_clk -period 16.666 [get_pins {clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk}]

set_clock_groups -asynchronous -group [ get_clocks root_clk ] \
    -group [ get_clocks main_clk ]

set_annotated_delay -from clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk 0

derive_pll_clocks
derive_clock_uncertainty

set_false_path -hold -from [get_ports pixadc_*] -to [get_registers]
set_false_path -hold -from [get_registers] -to [get_ports pixadc_*]

set_max_delay -from [get_registers] -to [get_ports pixadc_*] 2.7
set_max_delay -from [get_ports pixadc_*] -to [get_registers] 0.7

create_clock assignments

In the relevant design, the external clock source runs at 48 MHz (20.833 ns period) and there’s a PLL on the FPGA generating a 60 MHz clock (16.666 ns period) based upon the external clock.

First and somewhat unrelated, note that neither the duty cycle nor the waveform attributes are given in these definitions. If the duty cycle is 50%, don’t add that rubbish. It’s the default anyhow, but is nevertheless added by the automatic constraint generator.

The definition of root_clk is quite standard. But pay attention to the way the derived clock, main_clk is defined. Not only isn’t it given as a derived clock from root_clk (or I could have relied on an automatic derivation made with “derive_pll_clocks”), but it’s assigned to the PLL’s output pin. It’s not a coincidence: That specific “get_pins” format is mandatory in the create_clock definition, or the PLL’s input-to-output delay is included (around 2 ns). For example, even

create_clock -name main_clk -period 16.666 clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]

(which is what the tools would generate automatically with derive_pll_clocks) will include the delay from inclk[0], even though the identifier given in the constraint is the one of the PLL’s output net. This is also the case if the net is referred to with “get_nets”. Only being specific with get_pins on the PLL’s output pins clarifies that the clock delay should start at the output of the PLL.

A peculiar thing is that the fitter issues a warnings like

Warning (332049): Ignored create_clock at test.sdc: Argument <targets> is an empty collection File: ...
 Info (332050): create_clock -name main_clk -period 16.666 [get_pins {clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk}] File: ...

and main_clk does indeed not appear in the list of clocks made by the fitter, even though the constraint’s effect is clear in the timing analysis made by the timing analyzer.

As shown below, the clock path is included in the timing calculation, no matter what. So this is one of the necessities for keeping that interference in the timing calculations minimal.

IMPORTANT: Setting the clock constraint on the PLL’s output pin as shown above, as well as the set_annotated_delay constraint, disrupt the constraining of register-to-register paths slightly, as the original, rigorous timing calculation assumes that the destination register receives the clock slightly earlier than the source register (presumably to represent a worst-case scenario involving global clock skew and other effects). It’s therefore recommended to compare the timing difference between the clock paths with the clock constraint made the classic way, and deduce this difference from the clock’s period, to ensure an accurate calculation. It should be no more than a few hundred picoseconds.

Ah, and another things I tried but lead nowhere: In the Handbook, it says that a virtual clock can be defined with something like

create_clock -name main_clk -period 16.666

In other words, this is a clock that can be mentioned in constraints, but has no signal in the FPGA related to it. My original thought that if this clock doesn’t relate to anything real, it won’t have any global clock delay in its timing calculations.

But trying it, constraining I/O pins with the virtual clock as the -clock argument, no timing calculations were made at all. The timing report ended up with “Nothing to report”. So the virtual clock concept didn’t help.

set_clock_groups

Nothing special about this. Just declaring that root_clk and main_clk should be considered unrelated (all paths between these clock domains are false).

set_annotated_delay

This command tells Timing Analyzer to consider the delay of the global clock buffer as zero. This is yet another necessity to keep unrelated delays out of the calculation.

Together with the definition of main_clk above, which relates the timed clock to the PLL’s output, the delay of the global clock network in the FPGA fabric is left out of the calculations. As shown below, there’s still a clock delay component that is counted in, but it’s presumably the delay of the clock within the logic element.

As the clock’s delay is left out, it doesn’t matter if the PLL is set to compensate for the global clock delay or not; the same timing is achieved either way. One could, by the way, argue that the PLL’s clock timing compensation is an alternative way to minimize the clock path’s role in the timing calculations. My own attempts to go down that road have however led to nothing else than a lot of wasted time. Note that in order to make sense of the PLL’s timing compensation, the commonplace create_clock definition must be used for main_clk, so the PLL’s own delay is included (it’s compensated for further down the road), and this leads to a total lack of control of what’s timed and what is not.

derive_pll_clocks and derive_clock_uncertainty

derive_pll_clocks is applied even though main_clk is defined explicitly with a create_clock constraint, and the latter overrides the clock generated by derive_pll_clocks. But since the create_clock statement for main_clk is ignored by the synthesizer as well as the fitter (because the relevant pin isn’t found), derive_pll_clocks is necessary during these stages to ensure that the relevant paths are timed. In particular, that the fitter makes sure that register-to-register paths meet timing.

If the clock period given in the create_clock constraint is shorter than the one derived from the PLL (which is recommended for reasons mentioned in this post), there might a situation where timing fails because the fitter didn’t attempt to meet a constraint it was blind to. Or at least theoretically. I’ve never encountered anything like this, partly because it’s quite difficult to fail on a 60 MHz clock.

derive_clock_uncertainty is used, as with any proper set of constraints.

set_max_delay

Finally, the delay constraints themselves. set_max_delay is used rather than set_input_delay and set_output_delay, mainly because set_max_delay expresses the element we want to constrain: A segment between the register and a port. As outlined in this other post of mine, set_input_delay and set_output_delay are tailored to allow copying numbers from the counterpart device’s datasheet directly. However if we want to constrain the internal delay with these, the sampling clock’s period needs to be taken into account. So for the purpose of constraining the internal delay, set_input_delay and set_output_delay’s values must be adjusted if the clock’s frequency changes, and that’s an unnecessary headache.

One could have hoped that there would be a way to constrain the plain combinatoric path between a port and a register. It seems however like there’s no way to do this, but that Timing Analyzer is being a bit too helpful: When any (or both) of the endpoints of a set_max_delay constraint is a register, the clock delay path is taken into consideration. In other words, if the source of the delay path is a register, the clock path delay is added to the constrained path to represent the fact that the data toggle from the source register is delayed by this path. Likewise, if the destination of the constraint is a register, the clock path is added to the timing requirement (relaxes the constraint) to represent that the destination register samples its input later.

This holds true no matter of how the register endpoint is given to the constraint command: Quite obviously, if get_regs was used to select the relevant endpoint, the clock path is included in the math. But it’s less obvious, for example, that if the source endpoint was selected with get_pins on the registers’ output pin (e.g. [ get_pins -hierarchical the_sample_reg|d ]), the clock path is still included. Bottom line: No way to avoid having the clock path in the math. This is the reason for the manipulations with create_clock and set_annotated_delay above.

Examples of the timing obtained with set_max_delay are given below.

set_false_path -hold

These set_false_path constraints disable the timing calculation for the registers’ hold requirement (note the -hold flag). Without these two constraints, Timing Analyzer will mark the relevant I/O ports (partly) unconstrained, even if they have related set_max_delay constraints. This has no practical implication except that the “TimeQuest Timing Analyzer” group in the GUI’s compilation report pane is marked red, indicating there’s a timing problem.

The sole purpose of these set_false_path constraints is hence to tell the tools not to bother about the hold paths, avoiding the said red color in the GUI.

As with any set_false_path constraint, care must be taken not to include any unintended paths.

Hold timing is irrelevant for the purpose of ensuring I/O register packing. Neither does it have any significance when timing against the external device, as its hold timing should be ensured by manual timing calculations. As for timing a loop from the FPGA to the device and back this is unnecessary as well: Failing the receiving register’s hold timing in this case requires that the receiver’s hold time is shorter than the clock-to-output (which involves driving a physical pin and its equivalent capacitor) plus the external device’s response time to the toggling signal. So this is by far unrealistic.

One could think that rather than making a false path, a reasonable set_min_delay constraint would do the job. But no: Any set_min_delay, which in turn activates hold time constraining, leads to an “Input Pin to Input Register Delay” as shown in this other post, but for other reasons and with another behavior. In particular, with the constraint setting of this post, this Input Pin delay is added even if that causes a failure of the set_max_delay constraint.

The underlying reason is to compensate for the clock delay: The tools must ensure that the clock arrives to the input register before the data on its input port toggles. Otherwise, the data sampled for the minimal clock case is different from the case of the maximal clock delay (for which the data toggle is obviously after the clock toggle).

Given the delay in the clock path, this forces the tools to insert a delay before the input register that is at least the clock time delay. When clock delay compensation is enabled at the PLL (and the originating clock is external), the PLL is set to create a negative clock delay, hence eliminating the need for this Input Pin delay.

But it gets worse with a clock generated internally: It’s not completely clear why, but even if the clock path is set to zero with the set_annotate_delay statement as said above, the tools keep adding this delay. Also regardless of whether the PLL is set to compensate for the clock delay. One explanation can be found in set_annotated_delay’s help text saying “This assignment is for timing analysis only, and is not considered during timing-driven compilation”. But this still doesn’t explain why it’s inserted even with the clock path compensation of the PLL enabled. So the conclusion is that the tools weren’t really meant to handle this internally generated clock scheme.

Bottom line: Don’t make any set_min_delay constraints on this path, and surely not set_input_delay -min or set_output_delay -min (the latter two will mess up things even worse. Believe me on that).

Timing example: Register to pin (output)

+-------------------------------------------------------------+
; Path Summary                                                ;
+---------------------+---------------------------------------+
; Property            ; Value                                 ;
+---------------------+---------------------------------------+
; From Node           ; video_adc:video_adc_ins|pixadc_clk[1] ;
; To Node             ; pixadc_clk[1]                         ;
; Launch Clock        ; main_clk                              ;
; Latch Clock         ; n/a                                   ;
; Max Delay Exception ; 2.700                                 ;
; Data Arrival Time   ; 2.666                                 ;
; Data Required Time  ; 2.700                                 ;
; Slack               ; 0.034                                 ;
+---------------------+---------------------------------------+

+---------------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                           ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element                                                                    ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                       ; launch edge time                                                           ;
; 0.559   ; 0.559   ;    ;      ;        ;                       ; clock path                                                                 ;
;   0.000 ;   0.000 ;    ;      ;        ;                       ; source latency                                                             ;
;   0.000 ;   0.000 ;    ;      ; 13     ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|clk                                            ;
;   0.559 ;   0.559 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]                                      ;
; 2.666   ; 2.107   ;    ;      ;        ;                       ; data path                                                                  ;
;   0.771 ;   0.212 ;    ; uTco ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]                                      ;
;   1.268 ;   0.497 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|q                                              ;
;   1.268 ;   0.000 ; RR ; IC   ; 2      ; IOOBUF_X0_Y10_N2      ; pixadc_clk[1]~output|i                                                     ;
;   2.666 ;   1.398 ; RR ; CELL ; 1      ; IOOBUF_X0_Y10_N2      ; pixadc_clk[1]~output|o                                                     ;
;   2.666 ;   0.000 ; RR ; CELL ; 0      ; PIN_R2                ; pixadc_clk[1]                                                              ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+

+-------------------------------------------------------------------------+
; Data Required Path                                                      ;
+---------+---------+----+------+--------+----------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location ; Element             ;
+---------+---------+----+------+--------+----------+---------------------+
; 2.700   ; 2.700   ;    ;      ;        ;          ; latch edge time     ;
; 2.700   ; 0.000   ;    ;      ;        ;          ; clock path          ;
;   2.700 ;   0.000 ; R  ;      ;        ;          ; clock network delay ;
; 2.700   ; 0.000   ; R  ; oExt ; 0      ; PIN_R2   ; pixadc_clk[1]       ;
+---------+---------+----+------+--------+----------+---------------------+

The interconnect delay on the line after location CLKCTRL_G13 is the global clock’s delay, which the set_annotate_delay constraint forces to zero. Without that, it would have read 1.076 ns instead. Together with the create_clock assignment on the output pin, the only part left in the clock path is the 0.559 ns corresponding to the clock’s delay within the register itself (it’s not the clock-to-output, that one follows as uTco).

A regular create_clock declaration would have yielded the following at the beginning of the datapath instead:

+---------+---------+----+------+--------+-----------------------+------------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element                                                                      ;
+---------+---------+----+------+--------+-----------------------+------------------------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                       ; launch edge time                                                             ;
; 2.716   ; 2.716   ;    ;      ;        ;                       ; clock path                                                                   ;
;   0.000 ;   0.000 ;    ;      ;        ;                       ; source latency                                                               ;
;   0.000 ;   0.000 ;    ;      ; 1      ; PLL_3                 ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   2.157 ;   2.157 ; RR ; IC   ; 1      ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   2.157 ;   0.000 ; RR ; CELL ; 13     ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   2.157 ;   0.000 ; RR ; IC   ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|clk                                              ;
;   2.716 ;   0.559 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]

The above relates to a PLL without delay compensation.

Timing example: Pin to register (input)

+-----------------------------------------------------------+
; Path Summary                                              ;
+---------------------+-------------------------------------+
; Property            ; Value                               ;
+---------------------+-------------------------------------+
; From Node           ; pixadc_da[2]                        ;
; To Node             ; video_adc:video_adc_ins|samp_reg[2] ;
; Launch Clock        ; n/a                                 ;
; Latch Clock         ; main_clk                            ;
; Max Delay Exception ; 0.700                               ;
; Data Arrival Time   ; 0.992                               ;
; Data Required Time  ; 1.020                               ;
; Slack               ; 0.028                               ;
+---------------------+-------------------------------------+

+-------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                               ;
+---------+---------+----+------+--------+------------------+-------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location         ; Element                             ;
+---------+---------+----+------+--------+------------------+-------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                  ; launch edge time                    ;
; 0.000   ; 0.000   ;    ;      ;        ;                  ; clock path                          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                  ; clock network delay                 ;
; 0.000   ; 0.000   ; R  ; iExt ; 1      ; PIN_W2           ; pixadc_da[2]                        ;
; 0.992   ; 0.992   ;    ;      ;        ;                  ; data path                           ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X0_Y7_N15 ; pixadc_da[2]~input|i                ;
;   0.748 ;   0.748 ; RR ; CELL ; 1      ; IOIBUF_X0_Y7_N15 ; pixadc_da[2]~input|o                ;
;   0.748 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17     ; video_adc_ins|samp_reg[2]|d         ;
;   0.992 ;   0.244 ; RR ; CELL ; 1      ; FF_X0_Y7_N17     ; video_adc:video_adc_ins|samp_reg[2] ;
+---------+---------+----+------+--------+------------------+-------------------------------------+

+------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                 ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location     ; Element                                                                    ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+
; 0.700   ; 0.700   ;    ;      ;        ;              ; latch edge time                                                            ;
; 1.124   ; 0.424   ;    ;      ;        ;              ; clock path                                                                 ;
;   0.700 ;   0.000 ;    ;      ;        ;              ; source latency                                                             ;
;   0.700 ;   0.000 ;    ;      ; 13     ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk ;
;   0.700 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17 ; video_adc_ins|samp_reg[2]|clk                                              ;
;   1.124 ;   0.424 ; RR ; CELL ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                        ;
; 1.020   ; -0.104  ;    ; uTsu ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                        ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+

First, note the zero time increment marked in green above. It just confirms that no Input Pin delay was inserted by the tools.

Once again, the zero increment in red is the result of the set_annotate_delay constraint. It would have read 1.028 ns otherwise.

And again, a regular create_clock declaration would have yielded the following at the beginning of the datapath instead:

+--------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                   ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location     ; Element                                                                      ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+
; 0.700   ; 0.700   ;    ;      ;        ;              ; latch edge time                                                              ;
; 3.194   ; 2.494   ;    ;      ;        ;              ; clock path                                                                   ;
;   0.700 ;   0.000 ;    ;      ;        ;              ; source latency                                                               ;
;   0.700 ;   0.000 ;    ;      ; 1      ; PLL_3        ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   2.770 ;   2.070 ; RR ; IC   ; 1      ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   2.770 ;   0.000 ; RR ; CELL ; 13     ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   2.770 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17 ; video_adc_ins|samp_reg[2]|clk                                                ;
;   3.194 ;   0.424 ; RR ; CELL ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                          ;
; 3.090   ; -0.104  ;    ; uTsu ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                          ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+

The figures differ from the corresponding figures for the output timing, because increments in the data required path relax the constraint, so the tools pick the minimal delays here.

Loop timing budget

OK, so we have one constraint requiring the data on output ports to be valid 2.7 ns after main_clk. We have another constraint saying that the delay from an input pin to a register is no more than 0.7 ns. The clock period is 16.666 ns. Does it mean that the difference, 16.666 – (2.7 + 0.7) = 13.266 ns is the time allowed for the device to respond?

In other words, if the output signal is a clock that triggers the outputs of the external device, is it enough that the device’s clock-to-output, plus the PCB trace delay, mount up to less than 13.266 ns?

The answer is almost yes. The only thing not taken into account is the skew between the clocks as they arrive to each of the two I/O registers, because the global clock delay was forced to zero. But the skew is typically less than a few hundred picoseconds. All the rest is covered.

Note in particular that in the input timing calculation, the data path (from the pin to the register) isn’t compared with the constrained time (0.7 ns), but rather with the constrained time, plus the register’s internal clock delay, minus the register’s setup time. In other words, these small adjustments result in an accurate answer to if 0.7 ns from the pin to the clock is OK.

And because the delay calculations for the input and output delays begin at exactly the same global clock toggle at the register’s pins, the overall result is valid and accurate, except for the global clock skew, which isn’t taken into account.

Conclusion

It’s quite peculiar that this seemingly simple task of constraining the I/O timing turned out to be as difficult. It’s also unfortunate that this requires some crippling of the regular register-to-register calculations.

What makes this even more unfortunate, is that this constraining is practically necessary to ensure that no input pin delay is inserted by the tools. It’s not just a safety mechanism to set the alarm if the I/O registers slip away into the logic fabric.

One could argue that if timing is important, an external clock should have been used as a direct reference, in which case this whole issue would not have risen. But the point is that even if the design doesn’t squeeze the best possible timing performance from the FPGA, proper constraining is still required. It’s the designers prerogative to use the FPGA in a suboptimal way for ease and laziness, as long as the application’s requirements are met. It’s too bad that the punishment comes from the tools themselves, turning a straightforward task into a saga.

UPS, Fedex or DHL: Will your neighbor get your package?

Is that for me?

I had some $1,200 worth package sent to me from an electronics vendor (Mouser) with UPS. Free shipping. Got an SMS saying when the courier was expected to arrive. Took a nap and didn’t hear the phone ringing nor the doorbell. Woke up to an SMS saying “thank you for choosing UPS” and a note on the door saying the package was delivered to me neighbor.

Needless to say, I didn’t give my consent to this. Actually, I didn’t know this option existed with large couriers. Don’t get me wrong: My neighbor is great. I just think it’s completely wrong that he should be bothered with my stuff.

This isn’t a rant post. It’s a note to future self, so I can make an informed choice of courier and shipping conditions. Like many other posts on this blog, I’m writing it for myself, but let others see and share their insights (in the comments).

Needless to say, all companies deliver the package to you if you’re at home. The question is what they do if you’re not. Will they get rid of the package as quickly as possible, or will they go on trying (which makes the delivery more expensive to them).

Written in August 2018.

UPS

Yes, the delivery to neighbor was legit, according to UPS’ own website: “Shipments that do not require a signature can be left in a safe place, out of sight and out of weather, at the driver’s discretion. This could include the front porch, side door, back porch, garage area, or with a neighbor or leasing office (which would be noted in a yellow UPS InfoNotice® left by the driver).”

From “UPS’ Tariffs / Terms and Conditions”, “Delivery”: “UPS does not limit Delivery of a Shipment to the person specified as the Receiver in the UPS Shipping System. Unless the Shipper uses Delivery Confirmation service requiring a signature, UPS reserves the right, in its sole and unlimited discretion, to make a Delivery without obtaining a signature.”

The “Signature Required” option adds $4.75 to the tariff, according to their pricing page. Mouser obviously opted this out. So much for “free shipping”.

Fedex

From Fedex’ Service Guide 2018, “FedEx Express Terms and Conditions”, in “Delivery Signature Options”, it says “someone at the delivery address” with respect to who is allowed to acknowledge the delivery. If the sender has chosen “Indirect Signature Required”, a neighbor is perfectly eligible to sign for the parcel. Actually, it gets better: “Shipments to residential addresses may be released without obtaining a signature. If you require a signature for a residential shipment, select one of the Delivery Signature Options.” Let’s hope that the sender does require a signature.

So it seems Fedex is flexible on this issue, requiring the sender to pick the option, possibly at a cost: For example, “Direct Signature Required” costs $4.75 extra if the package’s worth is under $500, according to Fedex’ Fees information leaflet. The Service Guide 2018 confirms this: “Direct Signature Required fees will apply only to those packages within the shipment with a declared value of less than $500″. In other words, they don’t give the shipper the option to be irresponsible.

Conclusion: No package above $500 will reach the neighbor. Or if that extra tariff has been paid.

DHL

DHL’s “Terms and Conditions” (which is remarkably short and concise) says under “Deliveries and Undeliverables”: “Shipments cannot be delivered to PO boxes or postal codes. Shipments are delivered to the Receiver’s address given by Shipper but not necessarily to the named Receiver personally. Shipments to addresses with a central receiving area will be delivered to that area. DHL may notify Receiver of an upcoming delivery or a missed delivery. Receiver may be offered alternative delivery options such as delivery on another day, no signature required, redirection or collection at a DHL Service Point. Shipper may exclude some delivery options on request”.

No neighbors mentioned, no alternative destinations. It’s either the destination address or nothing.

Their German site allows choosing a preferred neighbor or a preferred outdoors location for placing the parcel. This is an active choice made by the recipient, not an ad-hoc improvisation by the courier.

TNT?

Opted out, after I recently had to make several phone calls in order to get the invoice for the customs clearance tariffs. At least in Israel, they’re not up to it.

Bottom line

Judging by the official docs, DHL most careful about where the package ends, but Fedex isn’t so bad either (in particular when the declared worth is above $500, or if those extra $4.75 has been paid).

UPS, well, it seems like they offer good deals to the shippers.

Solved: netcat (nc) doesn’t terminate at end of transmission

Introduction

I often use netcat to transmit chunks of data between two Linux machines. I usually go something like

$ pv backup-image.iso | nc -l 1234

on one machine, and then maybe

# nc 10.1.2.3 1234 > /dev/sdb1

This is an example for using another machine to write data into a USB disk-on-key, because writing to any /dev/sdX on my main computer scares me too much.

But it doesn’t quit when it finishes

So after a couple of hours of operation it’s obviously finished, but with certain couples of computers, neither side quits the netcat program. So it’s not clear if the very last piece of data was fully written on the target.

Immediate thing to try

If you’re stuck like this after a long netcat transmission, this might save you: Press CTRL-D on the console of computer receiving data. If you’re lucky, this releases netcat on both sides.

Why this happens

This was written in July 2018, reflecting the netcat versions I’ve come across.

netcat opens a bidirectional TCP link, passing one side’s stdin to the other side’s stdout and vice versa. When netcat is faced with an EOF on its standard input, it may or may not close the sending part of its TCP connection, depending on which version of netcat it is. If it indeed closed the the sending connection, a FIN TCP packet is sent to the other side. The netcat program on the other side receives this FIN packet, and may or may not quit, once again, depending on which version of netcat it is. If it did quit, it returns a FIN packet, closing the connection altogether.

So we have two “may or may not”, leading to four possibilities of how it all behaves.

The example above works if (and only if) the sending side sends a FIN when it sees EOF on its stdin, and that causes the other side’s netcat to quit, closing the TCP connection completely (sending a FIN packet back), which causes the first netcat to quit as well. And all is good.

Well, no. Formally, this is actually wrong behavior: Considering netcat to be a bidirectional link (this isn’t used a lot, but still), closing one direction shouldn’t cause the closing of the other. Maybe there’s data for waiting for transmission in the opposite direction. It’s perfectly legal, and quite commonplace, to transmit data on a half-open TCP link.

This is probably why recent revisions of netcat will not quit on receiving a FIN packet, but only when there’s no data in either direction: After receiving a FIN on the incoming TCP line and an EOF on its stdin.

Also, recent netcat revisions ignore the EOF on its stdin until the FIN arrives, unless the -q flag is given (which is not supported by earlier versions, but neither is it needed). This causes a potential deadlock: Even if both sides have received an EOF, none will quit, because neither has sent the FIN packet. The -q flag solves this.

Does it matter which side is listening?

I haven’t read the sources (of which revision should I read?), but after quite some experiments I got the impression that the behavior is symmetric: Client and server behave exactly the same way. Doesn’t matter which side was listening and which was initiating the TCP connection.

So what to do

Since there are two different paradigms of netcat out there, there’s no catch-all solution. For each pair of machines, test your netcat pair before starting a heavy operation. Possibly on a short file, possibly by typing data on the console. Be sure both sides quit at the end.

One thing that helps is to change the receiving part to e.g.

# nc 10.1.2.3 1234 > /dev/sdb1 < /dev/null

/dev/null supplies an EOF right away. Older netcats will send a FIN on the TCP link immediately on its establishment, so if there’s an old netcat on the other side, both sides quit right away, possibly before any data is sent. But if there are old netcats on both sides, you’re probably not bothered by this issue at all.

Newer netcats do nothing because of this immediate EOF (they ignore it until a FIN arrives), but it allows them to quit when the other side does.

Another thing to do, is to add the -q flag on the sending netcat (supported only by the newer netcats). For example:

$ pv backup-image.dat | nc -q 0 -l 1234

The “-q 0″ part tells netcat to “quit” immediately after receiving an EOF (or so says the man page). My own anecdotal experiment shows that “-q 0″ doesn’t make netcat quit at all, but just to send the FIN packet when an EOF arrives. In other words, “-q 0″ means “send a FIN packet when EOF arrives on stdin”. Something old netcats do anyhow.

This is good enough to get out of the deadlock mentioned above: When the data stream ends, the sending part sends a FIN because of the “-q 0″ flag. The receiving part now has an EOF by virtue of /dev/null, and a FIN from the sending part, so it quits, sending a FIN back. Now the first side has an EOF and a FIN, and quits as well.

Note that the “-q 0″ is more important than the /dev/null trick: If the receiving side has quit, we know all data has been flushed. It therefore doesn’t matter so much that a CTRL-C is needed to release the sending side. Doesn’t matter, but doesn’t add a feeling of confidence when the transmission is really important.

And this bring me back to what I began with: Each pair of computers needs a test before attempting something heavy. Sadly.