Quartus: Packing registers into I/O cells

This post was written by eli on April 3, 2017
Posted Under: Altera,FPGA

Often I prefer to handle I/O timing simply by ensuring that all registers are pushed into the I/O cells. Where timing matters, that is.

It seems like I/O register packing isn’t the default in Quartus. Anyhow, here’s the lazy man’s recipe for this scenario.

In a previous version of this post, I suggested to disable timing checking on all I/Os. This silences the unconstrained path warning during implementation, and in particular prevents the “TimeQuest Timing Analyzer” section in Quartus’ reports pane turning red:

set_false_path -from [get_ports]
set_false_path -to [get_ports]

This isn’t such a good idea, it turns out, in particular regarding input ports. This is elaborated further below.

Nevertheless, one needs to convince the fitter to push registers into the I/O block. In the QSF, add

set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to *
set_instance_assignment -name FAST_INPUT_REGISTER ON -to *
set_instance_assignment -name FAST_OUTPUT_ENABLE_REGISTER ON -to *

It’s somewhat aggressive to assign these assignments to absolutely everything, but it does the job. The fitter issues warnings for the I/O elements it fails to enforce these constraints on, which is actually a good thing.

To see how well it went, look in the “Resource Section” of the fitter report (possibly find it in Quartus’ reports pane) and look for “Input Registers” etc., whatever applies.

The difference is evident in timing reports of paths involving I/O cells. For example, compare this path which involves an I/O register:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element         ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; 2.918   ; 2.918   ;    ;      ;        ;                       ; data path       ;
;   0.000 ;   0.000 ;    ;      ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst             ;
;   0.465 ;   0.465 ; RR ; CELL ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst|q           ;
;   0.465 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|i ;
;   2.918 ;   2.453 ; RR ; CELL ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|o ;
;   2.918 ;   0.000 ; RR ; CELL ; 0      ; PIN_P3                ; RESETB          ;
+---------+---------+----+------+--------+-----------------------+-----------------+

Note the DDIOOUTCELL element, and the zero increment in the routing between the register and the IOOBUF.

For comparison, here’s a path for which an I/O register wasn’t applied (prevented by logic):

+--------------------------------------------------------------------------------+
; Data Arrival Path                                                              ;
+---------+---------+----+------+--------+-----------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location        ; Element             ;
+---------+---------+----+------+--------+-----------------+---------------------+
; 8.284   ; 8.284   ;    ;      ;        ;                 ; data path           ;
;   0.000 ;   0.000 ;    ;      ; 1      ; FF_X3_Y0_N17    ; Dir_flop_sig        ;
;   0.496 ;   0.496 ; RR ; CELL ; 8      ; FF_X3_Y0_N17    ; Dir_flop_sig|q      ;
;   2.153 ;   1.657 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|oe   ;
;   8.284 ;   6.131 ; RF ; CELL ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|o    ;
;   8.284 ;   0.000 ; FF ; CELL ; 1      ; PIN_T3          ; DATA[7]             ;
+---------+---------+----+------+--------+-----------------+---------------------+

Here we see how a general-purpose flip-flop generates the signal, leading to routing of 1.657 ns. The main problem is that this routing delay will be different each implementation, so if there’s a signal integrity issue with the board, the FPGA might be blamed for it, since different FPGA versions seem to fix the problem or make it reappear.

Timing constraints

Both input and output ports should be tightly constrained, so they can’t be met other than making the best of I/O registers. Not only will this generate a timing failure if something goes wrong with the desired register packing, but it’s also necessary to achieve the minimal input-to-register timing, as explained next.

To demonstrate this issue, consider the following Verilog snippet:

module top
  (
   input        clk,
   input        in,
   output reg   out
   );

   reg 		in_d, in_d2;
   wire  	pll_clk;

   always @(posedge pll_clk)
     begin
	in_d <= in;
	in_d2 <= in_d;
	out <= in_d2;
     end

  /* Here comes an instantiation of a phase-compensating PLL, which
     doesn't change the frequency */
endmodule

with the following constraint in the SDC file

create_clock -name main_clk -period 10 -waveform { 0 5 } [get_ports {clk}]

derive_pll_clocks
derive_clock_uncertainty

set_input_delay -clock main_clk -max 8.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

As explained on this post, set_input_delay is the maximal delay of the source of the signal, from clock to a valid logic state. Since the clock’s period is set to 10 ns, setting the delay constraint to 8.5 ns leaves 1.5 ns until the following clock arrives (at 10 ns). In other words, the setup time on the FPGA pin is constrained not to exceed 1.5 ns.

Compiling this (along with the FAST_INPUT_REGISTER ON QSF assignment shown above) yields the following segment in the timing report:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 8.500   ; 8.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 9.550   ; 1.050   ;    ;      ;        ;                   ; data path           ;
;   8.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   9.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   9.308 ;   0.000 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   9.550 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Unlike the output register, there is no “DDIOINCELL” flip-flop listed, but what appears to be a regular flip-flop. However note that the interconnect to this flip-flop has zero delay (marked in red), which is a clear indication that the flip-flop and input buffer are fused together.

The datasheet report for this input goes:

+---------------------------------------------------------------------------------------------------+
; Setup Times                                                                                       ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 1.282 ; 1.461 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -0.683 ; -0.862 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

As required, the setup time required by the FPGA is lower than the 1.5 ns limit set by the constraint.

Now let’s loosen the input setup delay by 2 ns, leave everything else as it was, and rerun the compilation:

set_input_delay -clock main_clk -max 6.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

The segment in the timing report is now:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 6.500   ; 6.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 8.612   ; 2.112   ;    ;      ;        ;                   ; data path           ;
;   6.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   7.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   8.370 ;   1.062 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   8.612 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Huh? The interconnect suddenly rose to 1.062 ns?! Note that the placement of the register didn’t change, so there’s no doubt that in_d is an I/O register. So where did this delay come from?

To answer this, a closer look on the design is required. After a full compilation and selecting Tools > Netlist Viewers > Technology Map Viewer (Post-Fitting), the following diagram appears (partly shown below, click to enlarge):

Design diagramRight-clicking in_d (the register) and selecting Locate Note > Locate in Resource Property Editor reveals the following (click to enlarge):

Property Editor ViewTo the right of this drawing (not shown above), the property “Input Pin to Input Register Delay” is set to 2. This is the reason for the delay. Before the constraint was loosened up, it was set to 0. The immediate lesson is:

If the setup constraint isn’t set to the technology’s best possible value, Quartus may add a delay on its expense.

But why, Quartus, why?

So one may wonder why Quartus inserts this delay between the input pad and the register. Wasn’t the whole point to sample as soon as possible? To answer this, let’s look at the updated datasheet report:

---------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 2.205 ; 2.523 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -1.570 ; -1.882 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

Recall that 2 ns were reduced from the delay constraint, hence the maximal allowed setup time went up from 1.5 ns to 3.5 ns. It’s easy to see that this requirement is met, with a slack of almost 1 ns.

So what Quartus did was saying “I can meet the setup requirement easily, with a spare of 2 ns. Let’s give 1 ns extra to the setup time, and one 1 ns to the hold time requirement (which is 0 ns)”. And indeed, by adding this 1.062 ns delay, the hold time improved from -0.683 ns to -1.570 ns (and please don’t pick on me on why the difference isn’t exact).

Bottom line: Quartus widened the margin for both setup and hold, making the input more robust to jitter. While this is a rather sensible thing to do, this is often not desired nor expected to happen.

Conclusion: If you want to get the absolutely minimal delay from the input to the register, run a compilation with a delay constraint that fails, and then loosen the constraint just enough to resolve this failure. This ensures Quartus won’t try to “improve” the timing by adding this input delay for the sake of a better hold time.

Add a Comment

required, use real name
required, will not be published
optional, your blog address