Nvidia graphics cards on Linux: PCIe link speed and width

Why is it at 2.5 GT/s???

With all said about Nvidia’s refusal to release their drivers as open source, their Linux support is great. I don’t think I’ve ever had such a flawless graphics card experience with Linux. After replacing the nouveau driver with Nvidia’s, of course. Ideology is nice, but a computer that works is nicer.

But then I looked at the output of lspci -vv (on an Asus fanless GT 730 2GB DDR3), and horrors, it’s not running at full PCIe speed!

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[ ... ]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Whatwhat? The card declares it supports 5 GT/s, but runs only at 2.5 GT/s? And on my brand new super-duper motherboard, which supports Gen3 PCIe connected directly to an Intel X-family CPU?

It’s all under control

Well, the answer is surprisingly simple: Nvidia’s driver changes the card’s PCIe speed dynamically to support the bandwidth needed. When there’s no graphics activity, the speed drops to 2.5 GT/s.

This behavior can be controlled with Nvidia’s X Server Settings control panel (it has an icon in the system’s setting panel, or just type “Nvidia” on Gnome’s start menu). Under the PowerMizer sub-menu, the card’s behavior can be changed to stay at 5 GT/s if you like your card hot and electricity bill fat.

Otherwise, in “Adaptive mode” it switches back and forth from 2.5 GT/s to 5 GT/s. The screenshot below was taken after a few seconds of idling (click to enlarge):

Screenshot of Nvidia X Server settings in adaptive mode

And this is how to force it to 5 GT/s constantly (click to enlarge):

Screenshot of Nvidia X Server settings in maximum performance mode

With the latter setting, lspci -vv shows that the card is at 5 GT/s, as promised:

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

So don’t worry about a low speed on an Nvidia card (or make sure it steps up on request).

A word on GT 1030

I added another fanless card, Asus GT 1030 2GB, to the computer for some experiments. This card is somewhat harder to catch at 2.5 GT/s, because it steps up very quickly in response to any graphics event. But I managed to catch this:

65:00.0 VGA compatible controller: NVIDIA Corporation GP108 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GP108 [GeForce GT 1030]
[ ... ]
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

The running 2.5 GT/s speed vs. the maximal 8 GT/s is pretty clear by now, but the declared maximal Width is 4x? If so, why does it have a 16x PCIe form factor? The GT 730 has an 8x form factor, and uses 8x lanes, but GT 1030 has 16x and declares it can only use 4x? Is this some kind of marketing thing to make the card look larger and stronger?

On the other hand, show me a fairly recent motherboard without a 16x PCIe slot. The thing is that sometimes that slot can be used for something else, and the graphics card could then have gone into a vacant 4x slot instead. But no. Let’s make it big and impressive with a long PCIe plug that makes it look massive. Personally, I find the gigantic heatsink impressive enough.

PCIe on Cyclone 10 GX: Data loss on DMA writes by FPGA

TL;DR

DMA writes from a Cyclone 10 GX PCIe interface may be lost, probably due to a path that isn’t timed properly by the fitter. This has been observed with Quartus Prime Version 17.1.0 Build 240 SJ Pro Edition, and the official Cyclone 10 GX development board. A wider impact is likely, possibly on Arria 10 device as well (as its PCIe block is the same one).

The problem seems to be rare, and appears and disappears depending on how the fitter places the logic. It’s however fairly easy to diagnose if this specific problem is in effect (see “The smoking gun” below).

Computer hardware: Gigabyte GA-B150M-D2V motherboard (with an Intel B150 Chipset) + Intel i5-6400 CPU.

The story

It started with a routine data transport test (FPGA to host), which failed virtually immediately (that is, after a few kilobytes). It was apparent that some portions of data simply weren’t written into the DMA buffer by the FPGA.

So I tried a fix in my own code, and yep, it helped. Or so I thought. Actually, anything I changed seemed to fix the problem. In the end, I changed nothing, but just added

set_global_assignment -name SEED 2

to the QSF file. Which only changes the fitter’s initial placement of the logic elements, which eventually leads to an alternative placement and routing of the design. That should work exactly the same, of course. But it “solved the problem”.

This was consistent: One “magic” build that failed consistently, and any change whatsoever made the issue disappear.

The design was properly constrained, of course, as shown in the development board’s sample SDC file. In fact, there isn’t much to constrain: It’s just setting the main clock to 100 MHz, derive_pll_clocks and derive_clock_uncertainty. And a false path from the PERST pin.

So maybe my bad? Well, no. There were no unconstrained paths in the entire design (with these simple constraints), so one fitting of the design should be exactly like any other. Maybe my application logic? No again:

The smoking gun

The final nail in the coffin was when I noted errors in the PCIe Device Status Registers on both sides. I’ve discussed this topic in this and this other posts of mine, however in the current case no AER kernel messages were produced (unfortunately, and it’s not clear why).

And whatever the application code does, Intel / Altera’s PCIe block shouldn’t produce a link error, and neither it does normally. It’s a violation of the PCIe spec.

These are the steps for observing this issue on a Linux machine. First, find out who the link partners are:

$ lspci
00:00.0 Host bridge: Intel Corporation Device 191f (rev 07)
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07)
[ ... ]
01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb

and then figuring out that the FPGA card is connected via the bridge at 00:01.0 with

$ lspci -t
-[0000:00]-+-00.0
           +-01.0-[01]----00.0

So it’s between 00:01.0 and 01:00.0. Then, following that post of mine, using setpci to read from the status register to tell an error had occurred.

First, what it should look like: With any bitstream except that specific faulty one, I got

# setpci -s 01:00.0 CAP_EXP+0xa.w
0000
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

any time and all the time, which says the obvious: No errors sensed on either side.

But with the bitstream that had data losses, before any communication had taken place (except for the driver being loaded):

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

Non-zero means error. So at this stage the FPGA’s PCIe interface was unhappy with something (more on that below), but the processor’s side had no complaints.

I have to admit that I’ve seen the 0009 status in a lot of other tests, in which communication went through perfectly. So even though reflects some kind of error, it doesn’t necessarily predict any functional fault. As elaborated below, the 0009 status consists of correctable errors. It’s just that such errors are normally never seen (i.e. with any PCIe card that works properly).

Anyhow, back to the bitstream that did have data errors. After some data had been written by the FPGA:

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
# setpci -s 00:01.0 CAP_EXP+0xa.w
000a

In this case, the FPGA card’s link partner complained. To save ourselves the meaning of these numbers (even though the’re listed in that post), use lspci -vv:

# lspci -vv
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07) (prog-if 00 [Normal decode])
[ ... ]
        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
[ ... ]

So the bridge complained about an uncorrectable and an unsupported request only after the data transmission, but the FPGA side:

01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb
[ ... ]
        Capabilities: [80] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-

complained about a correctable error and an unsupported request (as seen above, that happened before any payload transmission).

Low-level errors. I couldn’t make this happen even if I wanted to.

Aftermath

The really bad news is that this problem isn’t in the logic itself, but in how it’s placed. It seems to be a rare and random occurrence of a poor job done by the fitter. Or maybe it’s not all that rare, if you let the FPGA heat up a bit. In my case a spinning fan kept an almost idle FPGA quite cool, I suppose.

The somewhat good news is that the data loss comes with these PCIe status errors, and maybe with the relevant kernel messages (not clear why I didn’t see any). So there’s something to hold on to.

And I should also mention that the offending PCIe interface was a Gen2 x 4 running with a 64-bit interface at 250 MHz. which a rather marginal frequency for Arria 10 / Cyclone 10. So going with the speculation that this is a timing issue that isn’t handled properly by the fitter, maybe sticking to 125 MHz interfaces on these devices is good enough to be safe against this issue.

Note to self: The outputs are kept in cyclone10-failure.tar.gz

Quartus / Linux: Programming the FPGA with command-line

Command-line?

Yes, it much more convenient than the GUI programmer. Programming an FPGA is a repeated task, always the same file to the same FPGA on the same board connected to the computer. And somehow the GUI programming tools turn it into a daunting ceremony (and sometimes even a quiz, when it can’t tell exactly which device is connected, so I’m supposed to nail the exact one).

With command line its literally picking the command from bash history, and press Enter. And surprisingly enough, the command line tool doesn’t ask the silly questions that the GUI tool does.

First, some mucking about

Set up the environment:

$ /path/to/quartus/15.1/nios2eds/nios2_command_shell.sh

To list all devices found (cable auto-detected):

$ quartus_pgm --auto
Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 15.1.0 Build 185 10/21/2015 SJ Lite Edition
    Info: Copyright (C) 1991-2015 Altera Corporation. All rights reserved.
[ ... ]
    Info: agreement for further details.
    Info: Processing started: Sun May 27 15:06:22 2018
Info: Command: quartus_pgm --auto
Info (213045): Using programming cable "USB-BlasterII [2-5.1]"
1) USB-BlasterII [2-5.1]
  02B040DD   5CGTFD9(A5|C5|D5|E5)/..
  020A40DD   5M2210Z/EPM2210

[ ... ]

Note that listing the devices as shown above is not necessary for programming. It might be useful to tell the position of the FPGA in the JTAG chain, maybe. Really something that is done once to explore the board.

jtagd

It’s important to be aware of this deamon, which listens to TCP/IP port 1309: It’s responsible for talking with the JTAG adapter through the USB bus, so both the GUI and command line programmer utilities rely on it. If there’s no daemon running, both of these launch it.

But if you use multiple versions of Quartus, this may be a source of confusion, in particular if you make a first attempt to program an FPGA with an older version, and then try a newer one. That’s because the newer version of Quartus will keep using the older version of jtagd, possibly failing to work with recent devices. Bottom line: If wonky things happen, this won’t hurt:

$ killall jtagd

Programming

quartus_pgm displays most of its output in green. Generally speaking, if there’s no red text, all went fine.

$ quartus_pgm -m jtag -o "p;path/to/file.sof"

Or add the position in the JTAG explicitly (in particular if it’s not the first device). In this case it’s @1, meaning it’s the first device in the JTAG chain. If it’s the second device, pick @2 etc.

$ quartus_pgm -m jtag -o "p;path/to/file.sof@1"
Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 15.1.0 Build 185 10/21/2015 SJ Lite Edition
    Info: Copyright (C) 1991-2015 Altera Corporation. All rights reserved.
    Info: Your use of Altera Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Altera Program License
    Info: Subscription Agreement, the Altera Quartus Prime License Agreement,
    Info: the Altera MegaCore Function License Agreement, or other
    Info: applicable license agreement, including, without limitation,
    Info: that your use is for the sole purpose of programming logic
    Info: devices manufactured by Altera and sold by Altera or its
    Info: authorized distributors.  Please refer to the applicable
    Info: agreement for further details.
    Info: Processing started: Sun May 27 15:35:02 2018
Info: Command: quartus_pgm -m jtag -o p;path/to/file.sof@1
Info (213045): Using programming cable "USB-BlasterII [2-5.1]"
Info (213011): Using programming file p;path/to/file.sof@1 with checksum 0x061958E1 for device 5CGTFD9E5F35@1
Info (209060): Started Programmer operation at Sun May 27 15:35:05 2018
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x02B040DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sun May 27 15:35:09 2018
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 432 megabytes
    Info: Processing ended: Sun May 27 15:35:09 2018
    Info: Elapsed time: 00:00:07
    Info: Total CPU time (on all processors): 00:00:03

If anything goes wrong — device mismatch, a failure to scan the JTAG chain or whatever, it will be hard to miss because of the errors written in red. The sweet thing with the command line interface is that every attempt starts from fresh, so just turn the board on (the usual reason for errors) and give it another go.

Cyclone 10 GX FPGA development kit

This board caused me some extra trouble, so a few words about it. When connected, it appears as 09fb:6810, however after attempting to program the FPGA (note the “@2″ in the end) with

$ quartus_pgm -m jtag -o "p;thecode.sof@2"
Error (213019): Can't scan JTAG chain. Error code 86.

it changes to 09fb:6010. So there’s clearly some reprogramming of firmware (the log shows a disconnection and reconnection with the new ID). The board is detected as GX0000406 by the Quartus GUI Programming Tool, but clicking “Auto Detect” yields “Unable to scan device chain. Hardware is not connected”.

OK, what about a scan?

$ quartus_pgm --auto
[ ... ]
Info (213045): Using programming cable "10CGX0000406 [1-5.1.2]"
1) 10CGX0000406 [1-5.1.2]
  Unable to read device chain - Hardware not attached

The problem in my case was apparently that the jtagd running was launched by an older version of Quartus, which didn’t recognize Cyclone 10 devices. So follow the advice above, and kill it. After that, programming with the command above worked with Quartus Pro 17.1:

$ quartus_pgm --auto
[...]
Info (213045): Using programming cable "USB-BlasterII [1-5.1.2]"
1) USB-BlasterII [1-5.1.2]
  031820DD   10M08SA(.|ES)/10M08SC
  02E120DD   10CX220Y

Quartus / sdc: Constraining I/O ports clocked by an internal clock

Introduction

This post is an expansion for another post of mine, which deals with register I/O packing. It’s recommended reading that one first.

Timing constraining of I/O ports is typically intended to ensure timing relations between an external clock and the timing of signals that are clocked by this clock (or derived from this clock, with the same frequency or a simple relation to it).

However in some cases the clock of the I/O registers is generated with an PLL within the FPGA, and is practically unrelated to the originating clock. There are still good reasons to constrain the timing of such ports, among others:

  • Even though the external clock source isn’t involved directly, the timing must still be under control. In particular, when the interface with an external device is bidirectional, the timing of the signals arriving from the device depend on those being generated by the FPGA going to it. Constraining the ports is part of ensuring that this timing loop is fast enough.
  • Ensuring that I/O registers are used. Tight constraints, which can only be met with I/O registers will fail if the tools don’t pack those registers as desired.
  • Ensuring that no delay is inserted by the tools between the input pad and the register.

Clearly, nobody at Altera thought that this kind of constraining was necessary. Consequently, getting this done in a fairly clean manner is nontrivial, to say the least (yours truly wasted a full week of work to figure this out). This post suggests a methodology which is hopefully the clean enough. This is the best I managed to work out, anyhow.

The relevant documentation:

The Intel / Altera toolset used is Quartus Prime 15.1 (Web Edition).

The goal

The naïve approach is to apply set_input_delay and set_output_delay to the output ports as usual, using the clock from the PLL in the -clock argument. Even so, the tools interpret this as constraining the timing relative to the external clock, which is inherently pointless, since this relation has no meaning. To make things worse, if the clock frequency relations aren’t a plain ratio, the timing requirements become unreal, as the closest clock edge relations between the two clocks are applied as the worst case. So this doesn’t work at all.

Ideally, we’d like to constrain only the path between the I/O pin and the register connected directly to it. It appears like there’s no way to do that exactly, as Quartus’ Timing Analyzer automatically mixes in the clock’s path delay when there’s a register involved.

So the goal is to define the delay between the external pin and the register that samples its state or vice versa, with as little interference as possible.

A sample set of constraints

This is the sdc file that worked for me. Each part is explained in detail afterwards.

create_clock -name root_clk -period 20.833 [get_ports {osc_clock}]

# The 60 MHz clock is defined on the global clock buffer's output pin:
create_clock -name main_clk -period 16.666 [get_pins {clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk}]

set_clock_groups -asynchronous -group [ get_clocks root_clk ] \
    -group [ get_clocks main_clk ]

set_annotated_delay -from clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk 0

derive_pll_clocks
derive_clock_uncertainty

set_false_path -hold -from [get_ports pixadc_*] -to [get_registers]
set_false_path -hold -from [get_registers] -to [get_ports pixadc_*]

set_max_delay -from [get_registers] -to [get_ports pixadc_*] 2.7
set_max_delay -from [get_ports pixadc_*] -to [get_registers] 0.7

create_clock assignments

In the relevant design, the external clock source runs at 48 MHz (20.833 ns period) and there’s a PLL on the FPGA generating a 60 MHz clock (16.666 ns period) based upon the external clock.

First and somewhat unrelated, note that neither the duty cycle nor the waveform attributes are given in these definitions. If the duty cycle is 50%, don’t add that rubbish. It’s the default anyhow, but is nevertheless added by the automatic constraint generator.

The definition of root_clk is quite standard. But pay attention to the way the derived clock, main_clk is defined. Not only isn’t it given as a derived clock from root_clk (or I could have relied on an automatic derivation made with “derive_pll_clocks”), but it’s assigned to the PLL’s output pin. It’s not a coincidence: That specific “get_pins” format is mandatory in the create_clock definition, or the PLL’s input-to-output delay is included (around 2 ns). For example, even

create_clock -name main_clk -period 16.666 clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]

(which is what the tools would generate automatically with derive_pll_clocks) will include the delay from inclk[0], even though the identifier given in the constraint is the one of the PLL’s output net. This is also the case if the net is referred to with “get_nets”. Only being specific with get_pins on the PLL’s output pins clarifies that the clock delay should start at the output of the PLL.

A peculiar thing is that the fitter issues a warnings like

Warning (332049): Ignored create_clock at test.sdc: Argument <targets> is an empty collection File: ...
 Info (332050): create_clock -name main_clk -period 16.666 [get_pins {clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk}] File: ...

and main_clk does indeed not appear in the list of clocks made by the fitter, even though the constraint’s effect is clear in the timing analysis made by the timing analyzer.

As shown below, the clock path is included in the timing calculation, no matter what. So this is one of the necessities for keeping that interference in the timing calculations minimal.

IMPORTANT: Setting the clock constraint on the PLL’s output pin as shown above, as well as the set_annotated_delay constraint, disrupt the constraining of register-to-register paths slightly, as the original, rigorous timing calculation assumes that the destination register receives the clock slightly earlier than the source register (presumably to represent a worst-case scenario involving global clock skew and other effects). It’s therefore recommended to compare the timing difference between the clock paths with the clock constraint made the classic way, and deduce this difference from the clock’s period, to ensure an accurate calculation. It should be no more than a few hundred picoseconds.

Ah, and another things I tried but lead nowhere: In the Handbook, it says that a virtual clock can be defined with something like

create_clock -name main_clk -period 16.666

In other words, this is a clock that can be mentioned in constraints, but has no signal in the FPGA related to it. My original thought that if this clock doesn’t relate to anything real, it won’t have any global clock delay in its timing calculations.

But trying it, constraining I/O pins with the virtual clock as the -clock argument, no timing calculations were made at all. The timing report ended up with “Nothing to report”. So the virtual clock concept didn’t help.

set_clock_groups

Nothing special about this. Just declaring that root_clk and main_clk should be considered unrelated (all paths between these clock domains are false).

set_annotated_delay

This command tells Timing Analyzer to consider the delay of the global clock buffer as zero. This is yet another necessity to keep unrelated delays out of the calculation.

Together with the definition of main_clk above, which relates the timed clock to the PLL’s output, the delay of the global clock network in the FPGA fabric is left out of the calculations. As shown below, there’s still a clock delay component that is counted in, but it’s presumably the delay of the clock within the logic element.

As the clock’s delay is left out, it doesn’t matter if the PLL is set to compensate for the global clock delay or not; the same timing is achieved either way. One could, by the way, argue that the PLL’s clock timing compensation is an alternative way to minimize the clock path’s role in the timing calculations. My own attempts to go down that road have however led to nothing else than a lot of wasted time. Note that in order to make sense of the PLL’s timing compensation, the commonplace create_clock definition must be used for main_clk, so the PLL’s own delay is included (it’s compensated for further down the road), and this leads to a total lack of control of what’s timed and what is not.

derive_pll_clocks and derive_clock_uncertainty

derive_pll_clocks is applied even though main_clk is defined explicitly with a create_clock constraint, and the latter overrides the clock generated by derive_pll_clocks. But since the create_clock statement for main_clk is ignored by the synthesizer as well as the fitter (because the relevant pin isn’t found), derive_pll_clocks is necessary during these stages to ensure that the relevant paths are timed. In particular, that the fitter makes sure that register-to-register paths meet timing.

If the clock period given in the create_clock constraint is shorter than the one derived from the PLL (which is recommended for reasons mentioned in this post), there might a situation where timing fails because the fitter didn’t attempt to meet a constraint it was blind to. Or at least theoretically. I’ve never encountered anything like this, partly because it’s quite difficult to fail on a 60 MHz clock.

derive_clock_uncertainty is used, as with any proper set of constraints.

set_max_delay

Finally, the delay constraints themselves. set_max_delay is used rather than set_input_delay and set_output_delay, mainly because set_max_delay expresses the element we want to constrain: A segment between the register and a port. As outlined in this other post of mine, set_input_delay and set_output_delay are tailored to allow copying numbers from the counterpart device’s datasheet directly. However if we want to constrain the internal delay with these, the sampling clock’s period needs to be taken into account. So for the purpose of constraining the internal delay, set_input_delay and set_output_delay’s values must be adjusted if the clock’s frequency changes, and that’s an unnecessary headache.

One could have hoped that there would be a way to constrain the plain combinatoric path between a port and a register. It seems however like there’s no way to do this, but that Timing Analyzer is being a bit too helpful: When any (or both) of the endpoints of a set_max_delay constraint is a register, the clock delay path is taken into consideration. In other words, if the source of the delay path is a register, the clock path delay is added to the constrained path to represent the fact that the data toggle from the source register is delayed by this path. Likewise, if the destination of the constraint is a register, the clock path is added to the timing requirement (relaxes the constraint) to represent that the destination register samples its input later.

This holds true no matter of how the register endpoint is given to the constraint command: Quite obviously, if get_regs was used to select the relevant endpoint, the clock path is included in the math. But it’s less obvious, for example, that if the source endpoint was selected with get_pins on the registers’ output pin (e.g. [ get_pins -hierarchical the_sample_reg|d ]), the clock path is still included. Bottom line: No way to avoid having the clock path in the math. This is the reason for the manipulations with create_clock and set_annotated_delay above.

Examples of the timing obtained with set_max_delay are given below.

set_false_path -hold

These set_false_path constraints disable the timing calculation for the registers’ hold requirement (note the -hold flag). Without these two constraints, Timing Analyzer will mark the relevant I/O ports (partly) unconstrained, even if they have related set_max_delay constraints. This has no practical implication except that the “TimeQuest Timing Analyzer” group in the GUI’s compilation report pane is marked red, indicating there’s a timing problem.

The sole purpose of these set_false_path constraints is hence to tell the tools not to bother about the hold paths, avoiding the said red color in the GUI.

As with any set_false_path constraint, care must be taken not to include any unintended paths.

Hold timing is irrelevant for the purpose of ensuring I/O register packing. Neither does it have any significance when timing against the external device, as its hold timing should be ensured by manual timing calculations. As for timing a loop from the FPGA to the device and back this is unnecessary as well: Failing the receiving register’s hold timing in this case requires that the receiver’s hold time is shorter than the clock-to-output (which involves driving a physical pin and its equivalent capacitor) plus the external device’s response time to the toggling signal. So this is by far unrealistic.

One could think that rather than making a false path, a reasonable set_min_delay constraint would do the job. But no: Any set_min_delay, which in turn activates hold time constraining, leads to an “Input Pin to Input Register Delay” as shown in this other post, but for other reasons and with another behavior. In particular, with the constraint setting of this post, this Input Pin delay is added even if that causes a failure of the set_max_delay constraint.

The underlying reason is to compensate for the clock delay: The tools must ensure that the clock arrives to the input register before the data on its input port toggles. Otherwise, the data sampled for the minimal clock case is different from the case of the maximal clock delay (for which the data toggle is obviously after the clock toggle).

Given the delay in the clock path, this forces the tools to insert a delay before the input register that is at least the clock time delay. When clock delay compensation is enabled at the PLL (and the originating clock is external), the PLL is set to create a negative clock delay, hence eliminating the need for this Input Pin delay.

But it gets worse with a clock generated internally: It’s not completely clear why, but even if the clock path is set to zero with the set_annotate_delay statement as said above, the tools keep adding this delay. Also regardless of whether the PLL is set to compensate for the clock delay. One explanation can be found in set_annotated_delay’s help text saying “This assignment is for timing analysis only, and is not considered during timing-driven compilation”. But this still doesn’t explain why it’s inserted even with the clock path compensation of the PLL enabled. So the conclusion is that the tools weren’t really meant to handle this internally generated clock scheme.

Bottom line: Don’t make any set_min_delay constraints on this path, and surely not set_input_delay -min or set_output_delay -min (the latter two will mess up things even worse. Believe me on that).

Constraints for crossing clock domain

This is somewhat unrelated, but it’s another aspect of how set_max_delay path works.

When crossing clock domains, it’s common to put two registers in series, so that the first register is a metastability guard, and the second samples the signal safely in the destination clock domain.

But since the paths crossing clock domains are not timed by the tools, they may in theory have an arbitrarily high propagation delay. This undermines the whole idea of the metastability guard. So to be extra safe, it makes sense to constrain these paths in order to ensure that the path delay is limited to something sensible.

Unfortunately, there is nothing better than set_max_delay for this purpose, which takes the clock delays into account. As these two clocks are unrelated, this makes no sense at all, but this is what Quartus offers. It would have been much better to constrain just the data path, and maybe creating a special clock and using set_annotated_delay as suggested above would do the trick.

But I’ll suggest the simple and crude method:

set_max_delay -from [ get_clocks *|some_ins|*|tx_clkout] \
    -to [ get_clocks *|some_ins|*|rx_clkout] 4
set_max_delay -from [ get_clocks *|some_ins|*|rx_clkout] \
    -to [ get_clocks *|some_ins|*|tx_clkout] 4

set_false_path -hold -from [ get_clocks *|some_ins|*|tx_clkout] \
    -to [ get_clocks *|some_ins|*|rx_clkout]
set_false_path -hold -from [ get_clocks *|some_ins|*|rx_clkout] \
    -to [ get_clocks *|some_ins|*|tx_clkout]

Choosing the delay as 4 ns as shown above keeps the delays sensibly small on a Cyclone 10, but this is something to verify separately on each design with the Timing Analyzer.

As for the two false path settings: Note that they are only for hold timing. This is sometimes necessary if the tools consider the clocks related, in which case the hold timing might fail because of the different clock delays. Since the clocks are treated as unrelated in the logic design, the hold timing is pointless.

Timing example: Register to pin (output)

+-------------------------------------------------------------+
; Path Summary                                                ;
+---------------------+---------------------------------------+
; Property            ; Value                                 ;
+---------------------+---------------------------------------+
; From Node           ; video_adc:video_adc_ins|pixadc_clk[1] ;
; To Node             ; pixadc_clk[1]                         ;
; Launch Clock        ; main_clk                              ;
; Latch Clock         ; n/a                                   ;
; Max Delay Exception ; 2.700                                 ;
; Data Arrival Time   ; 2.666                                 ;
; Data Required Time  ; 2.700                                 ;
; Slack               ; 0.034                                 ;
+---------------------+---------------------------------------+

+---------------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                           ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element                                                                    ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                       ; launch edge time                                                           ;
; 0.559   ; 0.559   ;    ;      ;        ;                       ; clock path                                                                 ;
;   0.000 ;   0.000 ;    ;      ;        ;                       ; source latency                                                             ;
;   0.000 ;   0.000 ;    ;      ; 13     ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|clk                                            ;
;   0.559 ;   0.559 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]                                      ;
; 2.666   ; 2.107   ;    ;      ;        ;                       ; data path                                                                  ;
;   0.771 ;   0.212 ;    ; uTco ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]                                      ;
;   1.268 ;   0.497 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|q                                              ;
;   1.268 ;   0.000 ; RR ; IC   ; 2      ; IOOBUF_X0_Y10_N2      ; pixadc_clk[1]~output|i                                                     ;
;   2.666 ;   1.398 ; RR ; CELL ; 1      ; IOOBUF_X0_Y10_N2      ; pixadc_clk[1]~output|o                                                     ;
;   2.666 ;   0.000 ; RR ; CELL ; 0      ; PIN_R2                ; pixadc_clk[1]                                                              ;
+---------+---------+----+------+--------+-----------------------+----------------------------------------------------------------------------+

+-------------------------------------------------------------------------+
; Data Required Path                                                      ;
+---------+---------+----+------+--------+----------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location ; Element             ;
+---------+---------+----+------+--------+----------+---------------------+
; 2.700   ; 2.700   ;    ;      ;        ;          ; latch edge time     ;
; 2.700   ; 0.000   ;    ;      ;        ;          ; clock path          ;
;   2.700 ;   0.000 ; R  ;      ;        ;          ; clock network delay ;
; 2.700   ; 0.000   ; R  ; oExt ; 0      ; PIN_R2   ; pixadc_clk[1]       ;
+---------+---------+----+------+--------+----------+---------------------+

The interconnect delay on the line after location CLKCTRL_G13 is the global clock’s delay, which the set_annotate_delay constraint forces to zero. Without that, it would have read 1.076 ns instead. Together with the create_clock assignment on the output pin, the only part left in the clock path is the 0.559 ns corresponding to the clock’s delay within the register itself (it’s not the clock-to-output, that one follows as uTco).

A regular create_clock declaration would have yielded the following at the beginning of the datapath instead:

+---------+---------+----+------+--------+-----------------------+------------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element                                                                      ;
+---------+---------+----+------+--------+-----------------------+------------------------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                       ; launch edge time                                                             ;
; 2.716   ; 2.716   ;    ;      ;        ;                       ; clock path                                                                   ;
;   0.000 ;   0.000 ;    ;      ;        ;                       ; source latency                                                               ;
;   0.000 ;   0.000 ;    ;      ; 1      ; PLL_3                 ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   2.157 ;   2.157 ; RR ; IC   ; 1      ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   2.157 ;   0.000 ; RR ; CELL ; 13     ; CLKCTRL_G13           ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   2.157 ;   0.000 ; RR ; IC   ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc_ins|pixadc_clk[1]|clk                                              ;
;   2.716 ;   0.559 ; RR ; CELL ; 1      ; DDIOOUTCELL_X0_Y10_N4 ; video_adc:video_adc_ins|pixadc_clk[1]

The above relates to a PLL without delay compensation.

Timing example: Pin to register (input)

+-----------------------------------------------------------+
; Path Summary                                              ;
+---------------------+-------------------------------------+
; Property            ; Value                               ;
+---------------------+-------------------------------------+
; From Node           ; pixadc_da[2]                        ;
; To Node             ; video_adc:video_adc_ins|samp_reg[2] ;
; Launch Clock        ; n/a                                 ;
; Latch Clock         ; main_clk                            ;
; Max Delay Exception ; 0.700                               ;
; Data Arrival Time   ; 0.992                               ;
; Data Required Time  ; 1.020                               ;
; Slack               ; 0.028                               ;
+---------------------+-------------------------------------+

+-------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                               ;
+---------+---------+----+------+--------+------------------+-------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location         ; Element                             ;
+---------+---------+----+------+--------+------------------+-------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                  ; launch edge time                    ;
; 0.000   ; 0.000   ;    ;      ;        ;                  ; clock path                          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                  ; clock network delay                 ;
; 0.000   ; 0.000   ; R  ; iExt ; 1      ; PIN_W2           ; pixadc_da[2]                        ;
; 0.992   ; 0.992   ;    ;      ;        ;                  ; data path                           ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X0_Y7_N15 ; pixadc_da[2]~input|i                ;
;   0.748 ;   0.748 ; RR ; CELL ; 1      ; IOIBUF_X0_Y7_N15 ; pixadc_da[2]~input|o                ;
;   0.748 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17     ; video_adc_ins|samp_reg[2]|d         ;
;   0.992 ;   0.244 ; RR ; CELL ; 1      ; FF_X0_Y7_N17     ; video_adc:video_adc_ins|samp_reg[2] ;
+---------+---------+----+------+--------+------------------+-------------------------------------+

+------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                 ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location     ; Element                                                                    ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+
; 0.700   ; 0.700   ;    ;      ;        ;              ; latch edge time                                                            ;
; 1.124   ; 0.424   ;    ;      ;        ;              ; clock path                                                                 ;
;   0.700 ;   0.000 ;    ;      ;        ;              ; source latency                                                             ;
;   0.700 ;   0.000 ;    ;      ; 13     ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk ;
;   0.700 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17 ; video_adc_ins|samp_reg[2]|clk                                              ;
;   1.124 ;   0.424 ; RR ; CELL ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                        ;
; 1.020   ; -0.104  ;    ; uTsu ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                        ;
+---------+---------+----+------+--------+--------------+----------------------------------------------------------------------------+

First, note the zero time increment marked in green above. It just confirms that no Input Pin delay was inserted by the tools.

Once again, the zero increment in red is the result of the set_annotate_delay constraint. It would have read 1.028 ns otherwise.

And again, a regular create_clock declaration would have yielded the following at the beginning of the datapath instead:

+--------------------------------------------------------------------------------------------------------------------------------------+
; Data Required Path                                                                                                                   ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location     ; Element                                                                      ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+
; 0.700   ; 0.700   ;    ;      ;        ;              ; latch edge time                                                              ;
; 3.194   ; 2.494   ;    ;      ;        ;              ; clock path                                                                   ;
;   0.700 ;   0.000 ;    ;      ;        ;              ; source latency                                                               ;
;   0.700 ;   0.000 ;    ;      ; 1      ; PLL_3        ; clkrst_ins|altpll_component|auto_generated|pll1|clk[0]                       ;
;   2.770 ;   2.070 ; RR ; IC   ; 1      ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|inclk[0] ;
;   2.770 ;   0.000 ; RR ; CELL ; 13     ; CLKCTRL_G13  ; clkrst_ins|altpll_component|auto_generated|wire_pll1_clk[0]~clkctrl|outclk   ;
;   2.770 ;   0.000 ; RR ; IC   ; 1      ; FF_X0_Y7_N17 ; video_adc_ins|samp_reg[2]|clk                                                ;
;   3.194 ;   0.424 ; RR ; CELL ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                          ;
; 3.090   ; -0.104  ;    ; uTsu ; 1      ; FF_X0_Y7_N17 ; video_adc:video_adc_ins|samp_reg[2]                                          ;
+---------+---------+----+------+--------+--------------+------------------------------------------------------------------------------+

The figures differ from the corresponding figures for the output timing, because increments in the data required path relax the constraint, so the tools pick the minimal delays here.

Loop timing budget

OK, so we have one constraint requiring the data on output ports to be valid 2.7 ns after main_clk. We have another constraint saying that the delay from an input pin to a register is no more than 0.7 ns. The clock period is 16.666 ns. Does it mean that the difference, 16.666 – (2.7 + 0.7) = 13.266 ns is the time allowed for the device to respond?

In other words, if the output signal is a clock that triggers the outputs of the external device, is it enough that the device’s clock-to-output, plus the PCB trace delay, mount up to less than 13.266 ns?

The answer is almost yes. The only thing not taken into account is the skew between the clocks as they arrive to each of the two I/O registers, because the global clock delay was forced to zero. But the skew is typically less than a few hundred picoseconds. All the rest is covered.

Note in particular that in the input timing calculation, the data path (from the pin to the register) isn’t compared with the constrained time (0.7 ns), but rather with the constrained time, plus the register’s internal clock delay, minus the register’s setup time. In other words, these small adjustments result in an accurate answer to if 0.7 ns from the pin to the clock is OK.

And because the delay calculations for the input and output delays begin at exactly the same global clock toggle at the register’s pins, the overall result is valid and accurate, except for the global clock skew, which isn’t taken into account.

Conclusion

It’s quite peculiar that this seemingly simple task of constraining the I/O timing turned out to be as difficult. It’s also unfortunate that this requires some crippling of the regular register-to-register calculations.

What makes this even more unfortunate, is that this constraining is practically necessary to ensure that no input pin delay is inserted by the tools. It’s not just a safety mechanism to set the alarm if the I/O registers slip away into the logic fabric.

One could argue that if timing is important, an external clock should have been used as a direct reference, in which case this whole issue would not have risen. But the point is that even if the design doesn’t squeeze the best possible timing performance from the FPGA, proper constraining is still required. It’s the designers prerogative to use the FPGA in a suboptimal way for ease and laziness, as long as the application’s requirements are met. It’s too bad that the punishment comes from the tools themselves, turning a straightforward task into a saga.

UPS, Fedex or DHL: Will your neighbor get your package?

Is that for me?

I had some $1,200 worth package sent to me from an electronics vendor (Mouser) with UPS. Free shipping. Got an SMS saying when the courier was expected to arrive. Took a nap and didn’t hear the phone ringing nor the doorbell. Woke up to an SMS saying “thank you for choosing UPS” and a note on the door saying the package was delivered to me neighbor.

Needless to say, I didn’t give my consent to this. Actually, I didn’t know this option existed with large couriers. Don’t get me wrong: My neighbor is great. I just think it’s completely wrong that he should be bothered with my stuff.

This isn’t a rant post. It’s a note to future self, so I can make an informed choice of courier and shipping conditions. Like many other posts on this blog, I’m writing it for myself, but let others see and share their insights (in the comments).

Needless to say, all companies deliver the package to you if you’re at home. The question is what they do if you’re not. Will they get rid of the package as quickly as possible, or will they go on trying (which makes the delivery more expensive to them).

Written in August 2018.

Use lockers

The couriers work with self-service lockers companies that allow picking up the package at certain points. There are also places where the pickup is done from a human (in particular shops that do this as an extra) but these are better avoided, in particular because they tend to be messy and lose parcels. From experience.

For example, UPS works with PickUp / Paz Yellow Box. So instead of setting my own address, go something like “Pickup Paz Yellow Box Locker, Sderot HaNassi 132, Haifa” and be sure to set the cellular phone number right. And then just wait for the SMS with the codes to open the locker.

No more pushy couriers. Bliss.

Customs

If the package’s worth (shipping fee not included) exceeds $75, it’s in for a full custom clearance. That’s about 200 NIS just for the process, and then there’s the taxes. If two packages are sent within a few days, their worth is summed for this purpose. Careful, in particular with those smaller things.

There seems to be a significant difference between couriers in this respect too: In an anecdotal comparison, UPS charged 193 NIS for their part in releasing from customs, while DHL charged 100 NIS (both including VAT). So UPS is low-cost to the sender only.

UPS

Yes, the delivery to neighbor was legit, according to UPS’ own website: “Shipments that do not require a signature can be left in a safe place, out of sight and out of weather, at the driver’s discretion. This could include the front porch, side door, back porch, garage area, or with a neighbor or leasing office (which would be noted in a yellow UPS InfoNotice® left by the driver).”

From “UPS’ Tariffs / Terms and Conditions”, “Delivery”: “UPS does not limit Delivery of a Shipment to the person specified as the Receiver in the UPS Shipping System. Unless the Shipper uses Delivery Confirmation service requiring a signature, UPS reserves the right, in its sole and unlimited discretion, to make a Delivery without obtaining a signature.”

The “Signature Required” option adds $4.75 to the tariff, according to their pricing page. Mouser obviously opted this out. So much for “free shipping”.

Fedex

From Fedex’ Service Guide 2018, “FedEx Express Terms and Conditions”, in “Delivery Signature Options”, it says “someone at the delivery address” with respect to who is allowed to acknowledge the delivery. If the sender has chosen “Indirect Signature Required”, a neighbor is perfectly eligible to sign for the parcel. Actually, it gets better: “Shipments to residential addresses may be released without obtaining a signature. If you require a signature for a residential shipment, select one of the Delivery Signature Options.” Let’s hope that the sender does require a signature.

So it seems Fedex is flexible on this issue, requiring the sender to pick the option, possibly at a cost: For example, “Direct Signature Required” costs $4.75 extra if the package’s worth is under $500, according to Fedex’ Fees information leaflet. The Service Guide 2018 confirms this: “Direct Signature Required fees will apply only to those packages within the shipment with a declared value of less than $500″. In other words, they don’t give the shipper the option to be irresponsible.

Conclusion: No package above $500 will reach the neighbor. Or if that extra tariff has been paid.

DHL

DHL’s “Terms and Conditions” (which is remarkably short and concise) says under “Deliveries and Undeliverables”: “Shipments cannot be delivered to PO boxes or postal codes. Shipments are delivered to the Receiver’s address given by Shipper but not necessarily to the named Receiver personally. Shipments to addresses with a central receiving area will be delivered to that area. DHL may notify Receiver of an upcoming delivery or a missed delivery. Receiver may be offered alternative delivery options such as delivery on another day, no signature required, redirection or collection at a DHL Service Point. Shipper may exclude some delivery options on request”.

No neighbors mentioned, no alternative destinations. It’s either the destination address or nothing.

Their German site allows choosing a preferred neighbor or a preferred outdoors location for placing the parcel. This is an active choice made by the recipient, not an ad-hoc improvisation by the courier.

TNT?

Opted out, after I recently had to make several phone calls in order to get the invoice for the customs clearance tariffs. At least in Israel, they’re not up to it.

Bottom line

Judging by the official docs, DHL most careful about where the package ends, but Fedex isn’t so bad either (in particular when the declared worth is above $500, or if those extra $4.75 has been paid).

UPS, well, it seems like they offer good deals to the shippers. But if the package goes to a locker, who cares.

The courier chosen by a vendor can also work as an indicator for how serious it is: Prefer those who ship with DHL. From experience. Shipping with UPS might indicate a “I don’t care what happens with the package as long as I get the money” attitude. Which works in reality, because the credit card company can’t cancel the deal if the package was handed over to the courier.


Some notes on custom clearance in Israel

Not directly related to the title of this post, I’ll add some accumulated knowledge on how customs clearance is handled in Israel as of August 2020.

First thing first: If the declared worth is below 75 USD, the package goes right through. Otherwise it needs custom clearance, in which case the fees for the process itself may turn out the dominant cost, in particular if the customs just add VAT (the typical case). The handling fees alone are about 250 NIS, which can be really annoying for a package around 75 USD.

And now a few things I found out while handling a package that arrived with the custom declaration wrong (it was supposed to be a company registered as the importer, not myself). This is true for UPS, but other couriers work most likely in a similar manner.

For expedited packages, the process begins when the waybill is produced on the sender’s side. The waybill (titled “invoice”, and is carried in a small plastic bag outside the parcel) is sent electronically to the destination country, and is processed quickly, sometimes too quickly. The ID number of the recipient is often fetched from a database, probably based upon the name and phone number of the recipient. The process with the customs is then completed, possibly before the package itself has been picked up. The idea is to make the delivery as quick as possible.

This is important in particular for packages that are sent on a Thursday (as was my case), or even worse, or a Thursday afternoon: The process may be finished just before the personnel dealing with this goes home for the weekend, and now there’s no way to make corrections. Even not to request to stop the process. Then the package may arrive on Saturday and pass the custom clearance before three stars have appeared in the sky.

As exactly this happened to me (I requested to stop the process on Friday, but in vain), I wanted to change the identity of the importer retroactively. I was told time and time again, by this or other representative that it’s impossible. Refusing to pay the customs bill, I ended up, after several hours and even more “that can’t be done”s, with their ultimate problem blaster, who initially thought I just wasn’t in a mood to pay. When she realized it’s not about money but formalities, she said the magic words Tikun Rashomon (תיקון רשימון), which means correcting the import entry. It turned out it’s possible, after all. But a day later, she came back to me, saying that the importer’s identity can’t be changed if the sender wrote incorrect details, and neither can this be done after the release from customs. Which made me wonder when it can be done at all. Checked it up, and it seems like corrections can be made only when a typing mistake or something of that sort was made.

She also explained what happens if the recipient refuses to pay the custom clearance costs: They fall on the sender of the package. This is what the sender of any package has to agree upon prior to sending the package. The couriers won’t do any work and not get paid. Will not happen.

As for the package itself, the sender may request to have it sent back (at the sender’s expense) or desert the package (נטישה). In the latter case, the package is stored for 8 months, during which any of the two parties can claim it and have it delivered normally (I guess there’s some cost there too). After these 8 months, the package is destroyed.

So the bottom line is that it’s a very good idea to refuse paying the custom fees as a method to get the things moving. In particular if the sender of the package has a part of the blame (which was definitely my case). Even if the package won’t be delivered to any side, the debt goes to the sender.

Solved: netcat (nc) doesn’t terminate at end of transmission

Introduction

I often use netcat to transmit chunks of data between two Linux machines. I usually go something like

$ pv backup-image.iso | nc -l 1234

on one machine, and then maybe

# nc 10.1.2.3 1234 > /dev/sdb1

This is an example for using another machine to write data into a USB disk-on-key, because writing to any /dev/sdX on my main computer scares me too much.

But it doesn’t quit when it finishes

So after a couple of hours of operation it’s obviously finished, but with certain couples of computers, neither side quits the netcat program. So it’s not clear if the very last piece of data was fully written on the target.

Immediate thing to try

If you’re stuck like this after a long netcat transmission, this might save you: Press CTRL-D on the console of computer receiving data. If you’re lucky, this releases netcat on both sides.

Why this happens

This was written in July 2018, reflecting the netcat versions I’ve come across.

netcat opens a bidirectional TCP link, passing one side’s stdin to the other side’s stdout and vice versa. When netcat is faced with an EOF on its standard input, it may or may not close the sending part of its TCP connection, depending on which version of netcat it is. If it indeed closed the the sending connection, a FIN TCP packet is sent to the other side. The netcat program on the other side receives this FIN packet, and may or may not quit, once again, depending on which version of netcat it is. If it did quit, it returns a FIN packet, closing the connection altogether.

So we have two “may or may not”, leading to four possibilities of how it all behaves.

The example above works if (and only if) the sending side sends a FIN when it sees EOF on its stdin, and that causes the other side’s netcat to quit, closing the TCP connection completely (sending a FIN packet back), which causes the first netcat to quit as well. And all is good.

Well, no. Formally, this is actually wrong behavior: Considering netcat to be a bidirectional link (this isn’t used a lot, but still), closing one direction shouldn’t cause the closing of the other. Maybe there’s data for waiting for transmission in the opposite direction. It’s perfectly legal, and quite commonplace, to transmit data on a half-open TCP link.

This is probably why recent revisions of netcat will not quit on receiving a FIN packet, but only when there’s no data in either direction: After receiving a FIN on the incoming TCP line and an EOF on its stdin.

Also, recent netcat revisions ignore the EOF on its stdin until the FIN arrives, unless the -q flag is given (which is not supported by earlier versions, but neither is it needed). This causes a potential deadlock: Even if both sides have received an EOF, none will quit, because neither has sent the FIN packet. The -q flag solves this.

Does it matter which side is listening?

I haven’t read the sources (of which revision should I read?), but after quite some experiments I got the impression that the behavior is symmetric: Client and server behave exactly the same way. Doesn’t matter which side was listening and which was initiating the TCP connection.

So what to do

Since there are two different paradigms of netcat out there, there’s no catch-all solution. For each pair of machines, test your netcat pair before starting a heavy operation. Possibly on a short file, possibly by typing data on the console. Be sure both sides quit at the end.

One thing that helps is to change the receiving part to e.g.

# nc 10.1.2.3 1234 > /dev/sdb1 < /dev/null

/dev/null supplies an EOF right away. Older netcats will send a FIN on the TCP link immediately on its establishment, so if there’s an old netcat on the other side, both sides quit right away, possibly before any data is sent. But if there are old netcats on both sides, you’re probably not bothered by this issue at all.

Newer netcats do nothing because of this immediate EOF (they ignore it until a FIN arrives), but it allows them to quit when the other side does.

Another thing to do, is to add the -q flag on the sending netcat (supported only by the newer netcats). For example:

$ pv backup-image.dat | nc -q 0 -l 1234

The “-q 0″ part tells netcat to “quit” immediately after receiving an EOF (or so says the man page). My own anecdotal experiment shows that “-q 0″ doesn’t make netcat quit at all, but just to send the FIN packet when an EOF arrives. In other words, “-q 0″ means “send a FIN packet when EOF arrives on stdin”. Something old netcats do anyhow.

This is good enough to get out of the deadlock mentioned above: When the data stream ends, the sending part sends a FIN because of the “-q 0″ flag. The receiving part now has an EOF by virtue of /dev/null, and a FIN from the sending part, so it quits, sending a FIN back. Now the first side has an EOF and a FIN, and quits as well.

Note that the “-q 0″ is more important than the /dev/null trick: If the receiving side has quit, we know all data has been flushed. It therefore doesn’t matter so much that a CTRL-C is needed to release the sending side. Doesn’t matter, but doesn’t add a feeling of confidence when the transmission is really important.

And this bring me back to what I began with: Each pair of computers needs a test before attempting something heavy. Sadly.

Synplify Pro on Linux Mint 18.1: The cheat sheet

Introduction

I needed to run Synplify Pro for a short trial period on my Fedora 12 machine (yup, it’s 2018, and still). And I have a full Mint 18.1 as a chroot jail on that machine for installing contemporary software.

So these are my notes on the go. Consider everything below as run om Mint 18.1 x86_64 (which is an Ubuntu derivative), except for the licensing manager, which I eventually ran directly on the Fedora 12 oldie (x86_64 as well).

I should point out, that Synopsys officially supports only Red Hat based distributions (SUSE and RHEL), which explains why small tweaks were necessary. But once that is over with, all was fine.

Synopsys offers extensive documentation for all this, of course. As the title implies, this should be considered as a cheat sheet, nothing more.

Download the stuff

In essence, three parts are needed: Synopsys’ installer program, the tool to run and the licensing manager. In my case, I downloaded all inside the following directories (into separate directories on my own computer):

  • /rev/installer_v4.1
  • /rev/s_fpga_d_vN-2018.03-SP1
  • /rev/scl_v2018.06

These three directories happened to be everything under /rev/ in my case (this is what they prepared for me, I suppose). So I grabbed it all. This included some large files for Windows, which I surely didn’t need, but it’s easier to fetch all files and wait longer than to use my own brain, for example.

Make the system ready

C-shell is used by the installation scripts (and others, possibly):

# apt-get install csh

Synplify itself expects a LSB (Linux Standard Base), in particular a symlink in /lib64/ for the ELF loader.

# apt-get install lsb-core

Without this, the licensing related program go something like:

$ ./lmhostid
-bash: ./lmhostid: No such file or directory

And then you go “But what??? The file is there!” and you’re right, the file is there, but the ELF loader which the executable requests isn’t, because it’s /lib64/ld-lsb-x86-64.so.3. I’ve discussed this issue in another post.

Also, some scripts have their shebang to /bin/sh (!). Unfortunately, Debian’s standard symlink for /bin/sh goes to /bin/dash (because working out of the box is for the weak). So

# cd /bin
# mv sh old-sh
# ln -s bash sh

If you don’t change this symlink, the typical error goes “synplify-pro-exec/fpga/N-2018.03-SP1/bin/config/execute: Syntax error: “(” unexpected (expecting “;;”)”

Then create another symlink from /usr/tmp to /tmp, because the licensing deamon creates the lock file there. As root:

# cd /usr
# ln -s /tmp

Installing

Refer to the Synopsys’ installation guide for how to run through the installation. This is just a brief.

Installing doesn’t require being root, if the target directories are owned by the user.

First, install the installer. In the directory where the installer was downloaded to, go

$ ./SynopsysInstaller_v4.1.run

And extract the installer into some other directory.

Navigate to the directory to which the installer was installed, and go

$ ./setup.sh

for a GUI installer, or

$ ./installer

for the textual stuff (but the ssh’ers).

The installer should be run (at least) twice: Once to install the tool of interest, and a second time to install the licensing manager.

For each time tell the installer where Synopsys’ installation files were downloaded to, and then to where the installed program should go. Both are different directories for each of the two installations.

Editing the licensing file

Edit the SERVER line, replacing “hostname1″ with the actual host name (as returned by “uname -n”).

There is no need to change the VENDOR line. At least in my case, it worked fine as is.

Check the licensing file

Be sure it’s valid. Navigate to where the licensing manager was installed (e.g. scl/2018.06/linux64/bin), and go (below is a successful validation for a temporary key):

$ ./sssverify /path/to/license.txt 

Integrity check report for license file "/path/to/license.txt".
Report generated on 06-Jul-2018 (SCL_2018.06)
---------------------------------------------------------
Checking the integrity of the license file...
Valid SSST feature found.
Licensed to Temp Keys for New Customers
Siteid: 5.1, Server Hostid: 200247EDD334, Issued on: 7/4/2018
License file integrity check PASSED!
---------------------------------------------------------
You may now USE this license file to start your license server.
Please don't edit or manipulate the contents of this license file.

Or use the -pinfo flag for a list of licensed features:

$ ./sssverify -pinfo /path/to/license.txt
=============================================================================================
	PRODUCT TO FEATURE MAPPING REPORT GENERATED ON 6/7/2018
---------------------------------------------------------------------------------------------
License File: /path/to/license.txt
Site ID: NEWSITE
Host ID: 200247EDD334
Key File Date: 07/04/2018
SCL Version: SCL_2018.06
=============================================================================================

=============================================================================================
Product: *****			 Serial Number: (SN=0:0)
---------------------------------------------------------------------------------------------
Feature Name                     Expiry-Date  Daemon       Version Quantity        Start-Date
---------------------------------------------------------------------------------------------
SSST                             22-Jul-2018  snpslmd      1.0            1       04-Jul-2018
=============================================================================================

=============================================================================================
Product: *****			 Serial Number: (SN=4881-0:503161)
---------------------------------------------------------------------------------------------
Feature Name                     Expiry-Date  Daemon       Version Quantity        Start-Date
---------------------------------------------------------------------------------------------
synplifypro_altera               22-jul-2018  snpslmd      2018.03        1
=============================================================================================

Start the licensing manager

OK, this is the only place where I left my Mint chroot jail, because I got

18:07:13 (snpslmd) Cannot open daemon lock file
18:07:13 (snpslmd) EXITING DUE TO SIGNAL 41 Exit reason 9
18:07:13 (lmgrd) snpslmd exited with status 41 (Exited because another server was running)

and then it just worked on Fedora 12, so what the heck with that. There is a word that the licensing manager doesn’t work on reiserfs, and I also spotted with strace that this failure occurs immediately after getdents() system calls on the root directory, which was a fake root in my case. So maybe because of that, maybe something else I didn’t get right, or more precisely: Didn’t bother to get right.

Anyhow, root aren’t required, and neither is any environment variable.

$ cd /path/to/scl/2018.06/linux64/bin
$ ./lmgrd -c /path/to/license.txt

And of course, if you’re really into it, make a service for this on your machine.

Is there a licensing manager running?

Is it up?

 ./lmstat
lmstat - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
Flexible License Manager status on Fri 7/6/2018 20:59

License server status: 27020@myhost.localdomain
    License file(s) on myhost.localdomain: /path/to/license.txt:

myhost.localdomain: license server UP (MASTER) v11.14.1

Vendor daemon status (on myhost.localdomain):

   snpslmd: UP v11.14.1

What’s its process (for killing)?

$ ps aux | grep lmg
eli      27499  0.0  0.0  17760  1428 pts/15   S    17:41   0:00 ./lmgrd -c /path/to/license.txt

Shut down the licensing manager

$ ./lmdown -c /path/to/synplify-pro-exec/license.txt

This works however only if the licensing manager went up OK. Otherwise, it might say it shut down the daemon, but there’s still a process running.

Running Simplify Pro

Finally there.

Be sure that the licensing manager is up and running, and go:

$ SNPSLMD_LICENSE_FILE='27020@localhost' ./synplify_pro &

Synopsys’ docs tell us to set and export the environment variable, but this way works, and this is how I like it.

Cyclone V and some transceiver CDR/PLL parameters

Introduction

Connecting an Intel FPGA (Altera) Cyclone V’s Native Transceiver IP to a USB 3.0 channel (which involves a -5000 ppm Spread Spectrum modulation), I got a significant bit error rate and what appeared to be occasional losses of lock. Suspecting that the CDR didn’t catch up with the frequency modulation, I wanted to try out a larger PLL bandwidth = track more aggressively at the expense of higher jitter. That turned out to be not so trivial.

This post sums up my findings related to Quartus. As for solving the original problem (bit errors and that), changing the bandwidth made no difference.

Toolset: Quartus Lite 15.1 on Linux.

And by the way, the problem turned out to be unrelated to the PLL, but the lack of an  equalizer on Cyclone V’s receiver. Hence no canceling of the low-pass filtering effect of the USB 3.0 cable. I worked this around by setting XCVR_RX_LINEAR_EQUALIZER_CONTROL  to 2 in the QSF file and the errors were gone. However this just activates a constant compensation high-pass filter on the receiver’s input (see the Cyclone V Device Datasheet, CV-51002,  2018.05.07, Figure 4) and consequently works the problem around for a specific cable, not more.

Assignments in the QSF file

In order to change the CDR’s bandwidth, assignments in the QSF are due, as detailed in V-Series Transceiver PHY IP Core User Guide (UG-01080, 2017.07.06) in the section “Analog Settings for Cyclone V Devices” and on page 20-28. In principle, CDR_BANDWIDTH_PRESET should be set to High instead of its default “Auto”. In this post, I’ll also set PLL_BANDWIDTH_PRESET to High, even though I’m quite confident it has nothing to do with locking to data (rather, it controls locking to the reference clock). But it causes quite some confusion, as shown below.

So all that is left is to nail down the CDR’s instance name, and assign it these parameters.

Now first, what not to do: Using wildcards. This is quite tempting because the path to the CDR is very long. So at first, I went for this, which is wrong:

set_instance_assignment -name CDR_BANDWIDTH_PRESET High -to *|xcvr_inst|*rx_pma.rx_cdr
set_instance_assignment -name PLL_BANDWIDTH_PRESET High -to *|xcvr_inst|*rx_pma.rx_cdr

And nothing happened, except a small notice in some very important place of the fitter report:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
; Ignored Assignments                                                                                                                                                                                   ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+
; Name                                             ; Ignored Entity            ; Ignored From ; Ignored To                                                 ; Ignored Value ; Ignored Source             ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+
; Merge TX PLL driven by registers with same clear ; altera_xcvr_reset_control ;              ; alt_xcvr_reset_counter:g_pll.counter_pll_powerdown|r_reset ; ON            ; Compiler or HDL Assignment ;
; CDR Bandwidth Preset                             ; myproj                    ;              ; *|xcvr_inst|*rx_pma.rx_cdr                                 ; HIGH          ; QSF Assignment             ;
; PLL Bandwidth Preset                             ; myproj                    ;              ; *|xcvr_inst|*rx_pma.rx_cdr                                 ; HIGH          ; QSF Assignment             ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+

Ayeee. So it seems like there’s no choice but to spell out the entire path. I haven’t investigated this thoroughly, though. Maybe there is some form of wildcards that would work. I also discuss this topic briefly in another post of mine.

So this is more like it:

set_instance_assignment -name CDR_BANDWIDTH_PRESET High -to frontend_ins|xcvr_inst|xcvr_inst|gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|inst_av_pma|av_rx_pma|rx_pmas[0].rx_pma.rx_cdr
set_instance_assignment -name PLL_BANDWIDTH_PRESET High -to frontend_ins|xcvr_inst|xcvr_inst|gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|inst_av_pma|av_rx_pma|rx_pmas[0].rx_pma.rx_cdr

I guess this clarifies why wildcards are tempting.

Verifying something happened

This is where things get confusing. Looking at the fitter report, in the part on transceivers, this was the output before adding the QSF assignments above (pardon the wide line, this is what the Fitter produced):

;         -- Name                                                                                           ; frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr                                                                                                                                                                                                                                                     ;
;         -- PLL Location                                                                                   ; CHANNELPLL_X0_Y49_N32                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
;         -- PLL Type                                                                                       ; CDR PLL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ;
;         -- PLL Bandwidth Type                                                                             ; Auto (Medium)                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ;
;         -- PLL Bandwidth Range                                                                            ; 2 to 4 MHz

And after adding the QSF assignments:

;         -- Name                                                                                           ; frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr                                                                                                                                                                                                                                                     ;
;         -- PLL Location                                                                                   ; CHANNELPLL_X0_Y49_N32                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
;         -- PLL Type                                                                                       ; CDR PLL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ;
;         -- PLL Bandwidth Type                                                                             ; High                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ;
;         -- PLL Bandwidth Range                                                                            ; 4 to 8 MHz

Bingo, huh? Well, not really. Which of these two assignments made this happen? CDR_BANDWIDTH_PRESET or PLL_BANDWIDTH_PRESET? In other words: Does the fitter report tell us about the bandwidth of the PLL on the reference clock or the data?

The answer is PLL_BANDWIDTH_PRESET. Setting CDR_BANDWIDTH_PRESET doesn’t change anything in the Fitter report at all. I know it all too well (after spending some pleasant quality time trying to figure out why, before realizing it’s about PLL_BANDWIDTH_PRESET).

So where’s does CDR_BANDWIDTH_PRESET do its trick?

To find that, one needs to get down to the post-fitting properties of the rx_cdr instance. The following sequence applies to Quartus 15.1′s GUI:

After fitting, select Tools > Netlist Viewers > Technology Map Viewer (Post-Fitting). Locate the instance in the Find tab (to the left; it’s a plain substring search on the instance name given in the QSF assignment). Once found, click on the block in the graphics display so its bounding box becomes red, and then right-click this block. On the menu that shows up, select Locate in Resource Property Editor.

And that displays a list of properties (which can be exported into a CSV file). One of which is rxpll_pd_bw_ctrl. Changing CDR_BANDWIDTH_PRESET to High altered this property’s value from 300 to 600. Changing it to Low sets it to 240.

And by the way, a change in PLL_BANDWIDTH_PRESET to High has no impact on any of the properties listed in the Resource Property Editor for the said instance, but making it Low takes pfd_charge_pump_current_ctrl from 30 to 20, and rxpll_pfd_bw_ctrl from 4800 to 3200. Whatever that means.

It’s worth mentioning that the CDR is instantiated as an arriav_channel_pll primitive (yes, an Arria V primitive on a Cyclone V FPGA) in the av_rx_pma.sv module (generated automatically for the Transceiver Native PHY IP). One of the instantiation parameters is rxpll_pd_bw_ctrl, which is assigned 300 by default. The source file doesn’t change as a result of the said change in the QSF file. So the tools somehow change something post-synthesis. I guess.

There are however no instantiation parameters for neither pfd_charge_pump_current_ctrland nor rxpll_pfd_bw_ctrl. So the rxpll_pd_bw_ctrl naming match is probably more of a coincidence. Once again, I guess.

A closer look on the PLL

It’s quite clear from above that CDR_BANDWIDTH_PRESET influenced rxpll_pd_bw_ctrl (note the _pd_ part) and that PLL_BANDWIDTH_PRESET is related to a couple of parameters with pfd them. This terminology goes along with the one used in the documentation (see e.g. Figure 1-17, “Channel PLL Block Diagram” in Cyclone V Device Handbook Volume 2: Transceivers, cv_5v3.pdf, 2016.01.28): The displayed terminology is that PFD relates to the Lock-To-Reference loop, which locks on the reference clock, and PD relates to the Lock-To-Data loop, which is the CDR.

This isn’t just a curiosity, because the VCO’s output dividers, L, are assigned separately for the PD and PDF loops (see the fitter report as well as Table 1-9).

As for the numbers in the fitter report, they match the doc’s as shown in the two relevant segments below. The first relates to a Native PHY IP, and the second to a PCIe PHY, both on the same design, both targeted at 5 Gb/s (and hence having the same “Output Clock Frequency”).

;         -- Reference Clock Frequency                                                                      ; 100.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ;
;         -- Output Clock Frequency                                                                         ; 2500.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ;
;         -- L Counter PD Clock Disable                                                                     ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- M Counter                                                                                      ; 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ;
;         -- PCIE Frequency Control                                                                         ; pcie_100mhz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ;
;         -- PD L Counter                                                                                   ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- PFD L Counter                                                                                  ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- Powerdown                                                                                      ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- Reference Clock Divider                                                                        ; 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;

versus

;         -- Reference Clock Frequency                                                                      ; 100.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ;
;         -- Output Clock Frequency                                                                         ; 2500.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ;
;         -- L Counter PD Clock Disable                                                                     ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- M Counter                                                                                      ; 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ;
;         -- PCIE Frequency Control                                                                         ; pcie_100mhz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ;
;         -- PD L Counter                                                                                   ; 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- PFD L Counter                                                                                  ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- Powerdown                                                                                      ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- Reference Clock Divider                                                                        ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;

In both transceivers, a 2500 MHz clock is generated from a 100 MHz reference clock. It seems like the trick to understanding what’s going on is noting footnote (2) of Table 1-17, saying that the output of L_PD is the one that applies when the PLL is configured as a CDR.

In the first case, the reference clock is fed into the phase detector without division. Since the reference clock is not divided, 100 MHz reaches one input of the phase detector. As the output is divided by PFD_L = 2 and then by M=25, the VCO has to run at 5000 MHz so that its output divided by 50 matches the 100 MHz reference. That doesn’t seem very clever to me (why not pick L=1, and avoid 5 GHz, which I’m not even sure is possible on that silicon?). But at least the math adds up: The output is divided with PD_L = 2, and we have 2500 MHz.

Now to the second case (PCIe): The reference clock is divided by 2, so the phase detector is fed with a 50 MHz reference. The VCO’s clock is divided by PDF_L = 2 and then with M = 25, and hence the VCO runs at 2500 MHz. This way, the total division by 50 (again) matches the 50 MHz reference on the phase detector. PD_L = 1, so the VCO’s output is used undivided, hence an output clock of 2500 MHz, again.

I’m not sure that I’m buying this explanation myself, actually, but it’s the only way I found to make sense of these figures. At some point I tried to convince the tools to divide the reference clock by 2 on the Native PHY (first case above) by adding

set_instance_assignment -name PLL_PFD_CLOCK_FREQUENCY "50 MHz" -to "frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr"

to the QSF file. This assignment was silently ignored. It wasn’t mentioned anywhere in the reports (not even in the Ignored Assignments) part, but the Divider remained at 1. I should mention that this assignment isn’t documented for Cyclone V, but Quartus Assignment Editor nevertheless agreed to generate it. And Quartus usually refuses to load a project if anything is fishy in the QSF file.

Quartus, timing closure: Obtaining a concise multi-corner timing path report

Introduction

The natural thing to do when an FPGA design fails timing is to take a detailed look at the critical paths, based upon a timing report showing the logic elements and their delays of this path.

If you’re not a heavy user of Intel’s FPGAs (a.k.a. Altera), it may not be so trivial to figure out how to obtain this report. And even worse, you might unknowingly be looking at the wrong report.

The first thing to have sorted out is the concept of multi-corner timing analysis. Without proving that this is sufficient and/or necessary (mainly because I don’t know. Can anyone present a solid proof?), the common practice is to verify an FPGA’s timing validity by ensuring that the timing constraints are met in four cases: The minimal and maximal temperature, and a “slow” and “fast” model, which makes four combinations, or as they are referred to, four corners.

When looking at the critical paths, it’s therefore important to look at the paths at all four corners. This is often overlooked: For example, just generating a timing report in TimeQuest, typically produces the report for a single corner.

So this post describes how to get the report that says something. It relates to Quartus Prime 17.1 Lite.

Everything the said below (including the scripts) works on Quartus Prime 15.1 Lite as well, except that this version (and earlier, I suppose) doesn’t generate any multi-corner reports in any form. This makes the HTML report generation option attractive, as these reports are easier to work with.

I should mention two other related posts in this blog: One taking a look on the relation between input / output constraints and the timing report, and another experimenting a bit with Tcl scripting with TimeQuest.

Getting a multi-corner report: Scripted & quick

First, copy the following Tcl script into a file, say, timing.tcl:

create_timing_netlist
read_sdc
update_timing_netlist

foreach_in_collection op [get_available_operating_conditions] {
  set_operating_conditions $op

  report_timing -setup -npaths 20 -detail full_path -multi_corner \
    -panel_name "Critical paths"
}

Don’t let the “multi_corner” flag confuse you: Each call to report_timing covers one corner. It’s not clear if this flag does anything.

Now to action:

  • In Quartus, expand the TimeQuest group in the Task pane, and open TimeQuest Timing Analyzer.
  • In TimeQuest Timing Analyzer, pick Script > Run Tcl Script… from the menu bar, and select the Tcl script (e.g. timing.tcl).
  • An entry named “Critical paths” is added to the TimeQuest Timing Analyzer’s Report pane. Click on Multi-Corner Summary. A list of paths and their details now fill the main panes.
  • To export all path information into a textual file, right-click Multi-Corner Summary, and select “Export…”. Choose a name for an output file with a .rpt suffix. HTML reports are not supported (they will be empty).

There will also be four separate reports in the same entry, one for each corner. On earlier versions of Quartus, only these will appear (i.e., no Multi-Corner Summary).

Generate HTML / text reports only

The tools can generate neat HTML reports, which are considerably more comfortable to read than TimeQuest’s own GUI. Alas, these reports only cover one corner each. This script generates four HTML reports (it’s a whole bunch of files, JQuery script files, CSS and whatnot. Bells and whistles, but not a multi-corner report).

Suppose the following script as timing-html.tcl

#project_open myproj
create_timing_netlist
read_sdc
update_timing_netlist

foreach_in_collection op [get_available_operating_conditions] {
  set_operating_conditions $op

  report_timing -setup -npaths 20 -detail full_path -multi_corner \
    -file "timing_paths_$op.html" \
    -panel_name "Critical paths for $op"
}

For a plain textual report, change the -file flag’s argument, so the suffix is .rpt or .txt instead of .html.

Note the “project_open” command which is commented out at the top of the script. If it’s uncommented and “myproj” is replaced with the actual project name, a plain shell command line can be used to generate the HTML reports with something like

$ /path/to/quartus/bin/quartus_sta -t timing-html.tcl

I haven’t however found a way to generate a multi-corner report like this.

In order to have these reports generated in each implementation (which is recommended), add a line like the following to the QSF file:

set_global_assignment -name TIMEQUEST_REPORT_SCRIPT relative/path/to/timing-html.tcl

When included in a QSF file, the said Tcl script should not call project_open (comment it out or delete it).

The GUI only method

A multi-corner report can be obtained with just pointing and clicking:

  • In Quartus, expand the TimeQuest group in the Task pane, and open TimeQuest Timing Analyzer.
  • Inside the Timing Analyzer’s Tasks pane, double-click “Update Timing Netlist” .
  • In the same pane, scroll down to “Custom Reports” and double-click “Report Timing…”
  • A dialog box opens. Accept the defaults, and click “Report Timing” below.
  • In the Report pane, an “Report Timing” entry will be added. Expand it and right-click it. In the menu that opens, click “Generate in All Corners”
  • Click on the “Multi Corner Summary” group and possibly export the report as outlined above.

Making any IP in the IP Catalog availabe in QSys

Introduction

I needed the Cyclone V Transceiver Native PHY IP Core inside QSys. Why? Actually, part of a failed attempt to find solve a compilation error.

The IP is available in Quartus 15.1′s IP Catalog, but inside the same toolkit’s QSys it doesn’t appear in the list of IPs. As discussed in this forum thread, this is intentional: Altera doesn’t support having it inside QSys, seemingly because it’s not “fully verified”. OK, so I’ll take the risk. How do I make QSys list this IP, so it can be included?

The fix

As mentioned in this guide, the thing is that IPs which are hidden from QSys have the INTERNAL property set to “true”. All that is left is hence to edit the relevant Tcl file, and update the IP database.

Mission number one is to find the correct Tcl file. The hints on the file’s name are:

  • It’s probably related to the IP’s name and functionality
  • It ends with *_hw.tcl
  • The FPGA family is denoted by “av”, “cv” “sv” etc

Eventually the file I was looking for was at /path/to/quartus/ip/altera/alt_xcvr/altera_xcvr_native_phy/cv/tcl/altera_xcvr_native_cv_hw.tcl. Unlike many other HW Tcl files, it doesn’t just assign parameters directly (in which case it’s easy to spot the assignment to INTERNAL), but it merely consists of adding a couple of directories to some search path, and then it goes:

::altera_xcvr_native_cv::module::declare_module

which refers to module.tcl, which has the following code snippet:

  namespace export \
    declare_module

  # Internal variables
  variable module {\
    {NAME                   VERSION                 INTERNAL  ANALYZE_HDL EDITABLE  ELABORATION_CALLBACK                        PARAMETER_UPGRADE_CALLBACK                    DISPLAY_NAME                        GROUP                                 AUTHOR                DESCRIPTION DATASHEET_URL                                           DESCRIPTION  }\
    {altera_xcvr_native_cv  15.1  true      false       false     ::altera_xcvr_native_cv::module::elaborate  ::altera_xcvr_native_cv::parameters::upgrade  "Cyclone V Transceiver Native PHY"  "Interface Protocols/Transceiver PHY" "Altera Corporation"  NOVAL       "http://www.altera.com/literature/ug/xcvr_user_guide.pdf" "Cyclone V Transceiver Native PHY."}\
  }
}

This is an assignment of multiple variables: The names of the variables are listed on the first curly brackets, and the values in the second. As the third variable is INTERNAL, that’s the one to fix. So the actual edit consist of changing the “true” marked in red above to “false”.

Updating the IP catalog

Only making the change above isn’t enough. The IP Catalog cache must be updated as well.

Change directory to something like /path/to/quartus/ip/altera/ and set up the environment variables:

$ ../../nios2eds/nios2_command_shell.sh

and then create an IP Catalog cache:

$ ip-make-ipx

Once done, overwrite the previous file (you may want to make a copy of it first):

$ mv components.ipx altera_components.ipx

And now restart Quartus. The said IP now appears in QSys’ IP Catalog.