01signal: Quartus: Packing registers into I/O cells

Often I prefer to handle I/O timing by ensuring that all registers are put in into the I/O cells. Where timing matters, that is.

It seems like I/O register packing isn’t the default with Quartus. Anyhow, here’s the lazy man’s recipe for this scenario.

In a previous version of this post, I suggested to disable timing checking on all I/Os. This silences the unconstrained path warning during implementation, and in particular, it prevents the "TimeQuest Timing Analyzer" section in Quartus’ reports pane from turning red:

set_false_path -from [get_ports]
set_false_path -to [get_ports]

This isn’t such a good idea, it turns out, in particular regarding input ports. This is elaborated further below.

Nevertheless, one needs to convince the fitter to put the registers in the I/O block. In the QSF, add

set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to *
set_instance_assignment -name FAST_INPUT_REGISTER ON -to *
set_instance_assignment -name FAST_OUTPUT_ENABLE_REGISTER ON -to *

It’s somewhat aggressive to make these assignments on absolutely every register, but it does the job. The fitter issues warnings for the I/O elements that it fails to enforce these constraints on, which is actually a good thing.

To see how well it went, look in the "Resource Section" of the fitter report (possibly find it in Quartus’ reports pane) and look for "Input Registers" etc., whatever applies.

The difference is evident in timing reports of paths that involve I/O cells. For example, compare this path which involves an I/O register:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element         ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; 2.918   ; 2.918   ;    ;      ;        ;                       ; data path       ;
;   0.000 ;   0.000 ;    ;      ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst             ;
;   0.465 ;   0.465 ; RR ; CELL ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst|q           ;
;   0.465 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|i ;
;   2.918 ;   2.453 ; RR ; CELL ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|o ;
;   2.918 ;   0.000 ; RR ; CELL ; 0      ; PIN_P3                ; RESETB          ;
+---------+---------+----+------+--------+-----------------------+-----------------+

Note the DDIOOUTCELL element, and the zero increment in the routing between the register and the IOOBUF.

For comparison, here’s a path for which an I/O register wasn’t applied (because it was prevented by logic):

+--------------------------------------------------------------------------------+
; Data Arrival Path                                                              ;
+---------+---------+----+------+--------+-----------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location        ; Element             ;
+---------+---------+----+------+--------+-----------------+---------------------+
; 8.284   ; 8.284   ;    ;      ;        ;                 ; data path           ;
;   0.000 ;   0.000 ;    ;      ; 1      ; FF_X3_Y0_N17    ; Dir_flop_sig        ;
;   0.496 ;   0.496 ; RR ; CELL ; 8      ; FF_X3_Y0_N17    ; Dir_flop_sig|q      ;
;   2.153 ;   1.657 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|oe   ;
;   8.284 ;   6.131 ; RF ; CELL ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|o    ;
;   8.284 ;   0.000 ; FF ; CELL ; 1      ; PIN_T3          ; DATA[7]             ;
+---------+---------+----+------+--------+-----------------+---------------------+

Here we see how a general-purpose flip-flop generates the signal, leading to a routing delay that amounts to 1.657 ns. The main problem is that this routing delay will be different each implementation, so if there’s a signal integrity issue with the board, the FPGA might be blamed for it, since different FPGA design versions will seem to fix the problem or make it reappear.

Timing constraints

Both the input ports and the output ports should have tight timing constraints, so they can’t be met other than taking full advantage of I/O registers. Not only will this generate a timing failure if something goes wrong with the desired register packing, but it’s also necessary to achieve the minimal input-to-register timing, as explained next.

The discussion below applies only when the clock that drives the registers is directly related to an external clock (i.e. with a PLL that multiplies the clock with e.g. an integer number). If the clock that drives the registers is practically unrelated to the external clock, things get significantly more complicated, as discussed in this post.

To demonstrate this issue, consider the following Verilog code:

module top
  (
   input        clk,
   input        in,
   output reg   out
   );

   reg 		in_d, in_d2;
   wire  	pll_clk;

   always @(posedge pll_clk)
     begin
	in_d <= in;
	in_d2 <= in_d;
	out <= in_d2;
     end

  /* Here comes an instantiation of a phase-compensating PLL, which
     doesn't change the frequency */
endmodule

Also consider the following constraint in the SDC file:

create_clock -name main_clk -period 10 -waveform { 0 5 } [get_ports {clk}]

derive_pll_clocks
derive_clock_uncertainty

set_input_delay -clock main_clk -max 8.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

As explained on this post, set_input_delay is the maximal delay of the source of the signal, from clock to a valid logic state. Since the clock’s time period is set to 10 ns, setting the delay constraint to 8.5 ns allows for 1.5 ns until the following clock arrives (at 10 ns). In other words, the setup time on the FPGA's pin is has a constraint that forces it not to exceed 1.5 ns.

Note that set_max_delay can be used as well for this purpose (in some cases it’s the only way) as discussed in this post.

A compilation of this (along with the FAST_INPUT_REGISTER ON QSF assignment shown above) yields the following segment in the timing report:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 8.500   ; 8.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 9.550   ; 1.050   ;    ;      ;        ;                   ; data path           ;
;   8.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   9.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   9.308 ;   0.000 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   9.550 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Unlike the case with the output register, there is no flip-flop with the type "DDIOINCELL" in the list, but there's something that looks like a regular flip-flop instead. However, note that the wiring to this flip-flop has zero delay (marked in red), which is a clear indication that the flip-flop and input buffer are fused together.

The datasheet report for this input says:

+---------------------------------------------------------------------------------------------------+
; Setup Times                                                                                       ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 1.282 ; 1.461 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -0.683 ; -0.862 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

As required, the setup time that is required by the FPGA is lower than the 1.5 ns limit set by the constraint.

Now let’s loosen the input setup delay by 2 ns, and leave everything else as it was, and rerun the compilation:

set_input_delay -clock main_clk -max 6.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

The segment in the timing report now says:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 6.500   ; 6.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 8.612   ; 2.112   ;    ;      ;        ;                   ; data path           ;
;   6.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   7.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   8.370 ;   1.062 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   8.612 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Huh? The interconnect suddenly rose to 1.062 ns?! Note that the placement of the register didn’t change, so there’s no doubt that in_d is an I/O register. So where did this delay come from?

To answer this, a closer look on the design is required. After a full compilation and selecting Tools > Netlist Viewers > Technology Map Viewer (Post-Fitting), the following diagram appears (partly shown below, click to enlarge):

Right-clicking in_d (the register) and selecting Locate Note > Locate in Resource Property Editor reveals the following (click to enlarge):

To the right of this drawing (not shown above), the property "Input Pin to Input Register Delay" is set to 2. This is the reason for the delay. Before the constraint was loosened up, it was set to 0. The immediate lesson is:

If the setup constraint isn’t set to the technology’s best possible value, Quartus may add a delay on its expense.

But why, Quartus, why?

So one may wonder why Quartus inserts this delay between the input pad and the register. Wasn’t the whole point to sample as soon as possible? To answer this, let’s look at the updated datasheet report:

---------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 2.205 ; 2.523 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -1.570 ; -1.882 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

Recall that 2 ns were reduced from the delay constraint. Hence the maximal allowed setup time went up from 1.5 ns to 3.5 ns. It’s easy to see that this requirement is met, with a slack of almost 1 ns.

So Quartus said something like "I can meet the setup requirement easily, with a surplus of 2 ns. Let’s give 1 ns extra to the setup time, and 1 ns to the hold time requirement (which is 0 ns)". And indeed, by adding 1.062 ns with this delay, the hold time improved from -0.683 ns to -1.570 ns (and please don’t pick on me on why the difference isn’t exact).

Bottom line: Quartus widened the margin for both setup and hold, making the input more robust to jitter. While this is a rather sensible thing to do, this is often not desired nor expected to happen.

Conclusion: If you want to get the absolutely minimal delay from the input to the register, run a compilation with a delay constraint that fails, and then loosen the constraint just enough to resolve this failure. This ensures that Quartus won’t try to "improve" the timing by adding this input delay for the sake of a better hold time.

Using DDR primitives

Intel’s FPGAs have logic on or near the I/O cells that is dedicated to allow for producing output as well as sample the input at a double clock rate. This topic is detailed in the relevant user guide, ug_altddio.pdf. Instantiating a DDR primitive (or using the ALTDDIO_BIDIR megafunction) is an appealing way to force the tools into putting the registers into the I/O cells. However, it’s not necessarily a good idea.

For example, an instantiation like this:

altddio_bidir ioddr
 (
 .padio(pin),
 .aclr (1'b0),
 .datain_h(datain_h),
 .datain_l(datain_l),
 .inclock(clk),
 .oe(oe),
 .outclock(clk),
 .dataout_h(dataout_h),
 .dataout_l(dataout_l),
 .oe_out (),
 .aset (1'b0),
 .combout(),
 .dqsundelayedout(),
 .inclocken(1'b1),
 .outclocken(1'b1),
 .sclr(1'b0),
 .sset(1'b0));
 defparam
   ioddr.extend_oe_disable = "OFF",
   ioddr.implement_input_in_lcell = "OFF",
   ioddr.intended_device_family = "Cyclone IV E",
   ioddr.invert_output = "OFF",
   ioddr.lpm_hint = "UNUSED",
   ioddr.lpm_type = "altddio_bidir",
   ioddr.oe_reg = "REGISTERED",
   ioddr.power_up_high = "OFF",
   ioddr.width = 1;

This indeed results in logic that implements a bidirectional DDR interface, but with a partial success as far as timing is concerned, at least on Cyclone IV. The clock-to-output timing is exactly the same as a plain output register that is packed into the I/O cell, but the delay on the input path is actually worse with the instantiation above. The results may be different with other Intel FPGA families.

Note that in order to mimic plain SDR registers with a DDR primitive, its datain_h port and datain_l port must be connected to the same wire, so the clock’s falling edge doesn’t change anything. Likewise, the dataout_l port’s value should be ignored, as it's sampled on the falling edge. Also note that the output enable port (oe) is an SDR input — as far as I can understand, it’s not possible to turn high-Z on and off with DDR rate with Intel FPGAs. At least not with the supplied logic primitives.

Now to why it worked nicely on the output registers, and not with the input: The hint is in the timing reports above: Even for a plain I/O cell register, a DDIOOUTCELL_Xn_Ym_Nk component is used as the register. In other words, the DDR output register is used even for single-rate outputs, but only with one clock edge. As for the input path, the timing reports above show that a logic fabric register (FF_Xn_Ym_Nk) is used. And here’s the crux: The DDR input logic is implemented in the logic fabric as well. And to make it worse, combinatorial blocks are squeezed between the I/O cell and the flip-flop in the case with DDR. Frankly, I don’t understand why, because each such combinatorial block is just a pass-through between a single input to a single output.

These observations are backed by timing reports as well as the drawings displayed by Quartus’ Post-Fit Technology Map Viewer. In particular, those useless combinatorial blocks are evident in these information sources.

This entire issue most likely varies from one FPGA family to another. As for Cyclone IV, it only makes sense to use DDR primitives for outputs.

Even more important, the fact that a DDR primitive output is used when an output register is required, allows producing an output clock that is aligned with the the other outputs. To accomplish this, feed a DDR output primitive with constant '1' and '0' on the datain_h port and datain_l port, respectively. As for the other outputs, require the use of output register packing. The toggling of the other outputs is hence aligned to the rising edge of the clock that comes from the DDR output.

Well, almost. The timing analysis of a output clock is different, because the clock toggles a mux that selects which of the two output registers feeds the output (scroll horizontally for the details):

+------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                  ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location                ; Element                                                         ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                         ; launch edge time                                                ;
; 0.000   ; 0.000   ;    ;      ;        ;                         ; clock path                                                      ;
;   0.000 ;   0.000 ; R  ;      ;        ;                         ; clock network delay                                             ;
; 0.000   ; 0.000   ; R  ;      ; 1      ; PIN_B12                 ; osc_clock                                                       ;
; 5.610   ; 5.610   ;    ;      ;        ;                         ; data path                                                       ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|i                                               ;
;   0.667 ;   0.667 ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|o                                               ;
;   0.853 ;   0.186 ; RR ; IC   ; 1      ; CLKCTRL_G12             ; osc_clock~inputclkctrl|inclk[0]                                 ;
;   0.853 ;   0.000 ; RR ; CELL ; 165    ; CLKCTRL_G12             ; osc_clock~inputclkctrl|outclk                                   ;
;   1.971 ;   1.118 ; RR ; IC   ; 1      ; DDIOOUTCELL_X16_Y29_N11 ; sram_controller_ins|ddr_clk|auto_generated|ddio_outa[0]|muxsel  ;
;   3.137 ;   1.166 ; RR ; CELL ; 1      ; DDIOOUTCELL_X16_Y29_N11 ; sram_controller_ins|ddr_clk|auto_generated|ddio_outa[0]|dataout ;
;   3.137 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X16_Y29_N9       ; sram_clk~output|i                                               ;
;   5.610 ;   2.473 ; RR ; CELL ; 1      ; IOOBUF_X16_Y29_N9       ; sram_clk~output|o                                               ;
;   5.610 ;   0.000 ; RR ; CELL ; 0      ; PIN_E10                 ; sram_clk                                                        ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------;

Note that this isn’t a register-to-pin analysis, but clock-to-pin. A set_output_delay constraint will include this path nevertheless. However, a set_max_delay constraint from registers to ports, if used, won’t include this path, so it has to be handled separately. In other words, if set_max_delay is used, it has to be of the form:

set_max_delay -from [get_clocks main_clk] -to [get_ports sram_clk] 3.8

Now, compare this with another pin with the same voltage standard etc., only driven by a register:

+----------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                    ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location                ; Element                                           ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                         ; launch edge time                                  ;
; 2.507   ; 2.507   ;    ;      ;        ;                         ; clock path                                        ;
;   0.000 ;   0.000 ;    ;      ;        ;                         ; source latency                                    ;
;   0.000 ;   0.000 ;    ;      ; 1      ; PIN_B12                 ; osc_clock                                         ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|i                                 ;
;   0.667 ;   0.667 ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|o                                 ;
;   0.853 ;   0.186 ; RR ; IC   ; 1      ; CLKCTRL_G12             ; osc_clock~inputclkctrl|inclk[0]                   ;
;   0.853 ;   0.000 ; RR ; CELL ; 165    ; CLKCTRL_G12             ; osc_clock~inputclkctrl|outclk                     ;
;   1.970 ;   1.117 ; RR ; IC   ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller_ins|dq_wr_data[6]|clk             ;
;   2.507 ;   0.537 ; RR ; CELL ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller:sram_controller_ins|dq_wr_data[6] ;
; 5.645   ; 3.138   ;    ;      ;        ;                         ; data path                                         ;
;   2.717 ;   0.210 ;    ; uTco ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller:sram_controller_ins|dq_wr_data[6] ;
;   3.182 ;   0.465 ; RR ; CELL ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller_ins|dq_wr_data[6]|q               ;
;   3.182 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X37_Y29_N9       ; sram_dq[6]~output|i                               ;
;   5.645 ;   2.463 ; RR ; CELL ; 1      ; IOOBUF_X37_Y29_N9       ; sram_dq[6]~output|o                               ;
;   5.645 ;   0.000 ; RR ; CELL ; 1      ; PIN_G14                 ; sram_dq[6]                                        ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------;

The total clock-to-output time differs by no more than 35 ps, even though the latter path is completely different on the face of it. This isn’t a coincidence. The FPGA is clearly designed to produce this similarity. Specifically, the timing analysis above is slow 1200 mV at a temperature of 100°C, but this small difference is consistent in the other analyzed conditions as well.