Quartus: Packing registers into I/O cells

This post was written by eli on April 3, 2017
Posted Under: Altera,FPGA

Often I prefer to handle I/O timing simply by ensuring that all registers are pushed into the I/O cells. Where timing matters, that is.

It seems like I/O register packing isn’t the default in Quartus. Anyhow, here’s the lazy man’s recipe for this scenario.

In a previous version of this post, I suggested to disable timing checking on all I/Os. This silences the unconstrained path warning during implementation, and in particular prevents the “TimeQuest Timing Analyzer” section in Quartus’ reports pane turning red:

set_false_path -from [get_ports]
set_false_path -to [get_ports]

This isn’t such a good idea, it turns out, in particular regarding input ports. This is elaborated further below.

Nevertheless, one needs to convince the fitter to push registers into the I/O block. In the QSF, add

set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to *
set_instance_assignment -name FAST_INPUT_REGISTER ON -to *
set_instance_assignment -name FAST_OUTPUT_ENABLE_REGISTER ON -to *

It’s somewhat aggressive to assign these assignments to absolutely everything, but it does the job. The fitter issues warnings for the I/O elements it fails to enforce these constraints on, which is actually a good thing.

To see how well it went, look in the “Resource Section” of the fitter report (possibly find it in Quartus’ reports pane) and look for “Input Registers” etc., whatever applies.

The difference is evident in timing reports of paths involving I/O cells. For example, compare this path which involves an I/O register:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location              ; Element         ;
+---------+---------+----+------+--------+-----------------------+-----------------+
; 2.918   ; 2.918   ;    ;      ;        ;                       ; data path       ;
;   0.000 ;   0.000 ;    ;      ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst             ;
;   0.465 ;   0.465 ; RR ; CELL ; 1      ; DDIOOUTCELL_X3_Y0_N32 ; rst|q           ;
;   0.465 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|i ;
;   2.918 ;   2.453 ; RR ; CELL ; 1      ; IOOBUF_X3_Y0_N30      ; RESETB~output|o ;
;   2.918 ;   0.000 ; RR ; CELL ; 0      ; PIN_P3                ; RESETB          ;
+---------+---------+----+------+--------+-----------------------+-----------------+

Note the DDIOOUTCELL element, and the zero increment in the routing between the register and the IOOBUF.

For comparison, here’s a path for which an I/O register wasn’t applied (prevented by logic):

+--------------------------------------------------------------------------------+
; Data Arrival Path                                                              ;
+---------+---------+----+------+--------+-----------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location        ; Element             ;
+---------+---------+----+------+--------+-----------------+---------------------+
; 8.284   ; 8.284   ;    ;      ;        ;                 ; data path           ;
;   0.000 ;   0.000 ;    ;      ; 1      ; FF_X3_Y0_N17    ; Dir_flop_sig        ;
;   0.496 ;   0.496 ; RR ; CELL ; 8      ; FF_X3_Y0_N17    ; Dir_flop_sig|q      ;
;   2.153 ;   1.657 ; RR ; IC   ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|oe   ;
;   8.284 ;   6.131 ; RF ; CELL ; 1      ; IOOBUF_X3_Y0_N9 ; DATA[7]~output|o    ;
;   8.284 ;   0.000 ; FF ; CELL ; 1      ; PIN_T3          ; DATA[7]             ;
+---------+---------+----+------+--------+-----------------+---------------------+

Here we see how a general-purpose flip-flop generates the signal, leading to routing of 1.657 ns. The main problem is that this routing delay will be different each implementation, so if there’s a signal integrity issue with the board, the FPGA might be blamed for it, since different FPGA versions seem to fix the problem or make it reappear.

Timing constraints

Both input and output ports should be tightly constrained, so they can’t be met other than making the best of I/O registers. Not only will this generate a timing failure if something goes wrong with the desired register packing, but it’s also necessary to achieve the minimal input-to-register timing, as explained next.

The discussion below applies only when the clock that drives the registers is directly related to an external clock (i.e. with a PLL that doesn’t multiply it with some exotic ratio). If the driving clock is practically unrelated to the external clock, things get significantly more complicated, as discussed in this post.

To demonstrate this issue, consider the following Verilog snippet:

module top
  (
   input        clk,
   input        in,
   output reg   out
   );

   reg 		in_d, in_d2;
   wire  	pll_clk;

   always @(posedge pll_clk)
     begin
	in_d <= in;
	in_d2 <= in_d;
	out <= in_d2;
     end

  /* Here comes an instantiation of a phase-compensating PLL, which
     doesn't change the frequency */
endmodule

with the following constraint in the SDC file

create_clock -name main_clk -period 10 -waveform { 0 5 } [get_ports {clk}]

derive_pll_clocks
derive_clock_uncertainty

set_input_delay -clock main_clk -max 8.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

As explained on this post, set_input_delay is the maximal delay of the source of the signal, from clock to a valid logic state. Since the clock’s period is set to 10 ns, setting the delay constraint to 8.5 ns leaves 1.5 ns until the following clock arrives (at 10 ns). In other words, the setup time on the FPGA pin is constrained not to exceed 1.5 ns.

Note that set_max_delay can be used as well for this purpose (in some cases it’s the only way) as discussed in this post.

Compiling this (along with the FAST_INPUT_REGISTER ON QSF assignment shown above) yields the following segment in the timing report:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 8.500   ; 8.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 9.550   ; 1.050   ;    ;      ;        ;                   ; data path           ;
;   8.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   9.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   9.308 ;   0.000 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   9.550 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Unlike the output register, there is no “DDIOINCELL” flip-flop listed, but what appears to be a regular flip-flop. However note that the interconnect to this flip-flop has zero delay (marked in red), which is a clear indication that the flip-flop and input buffer are fused together.

The datasheet report for this input goes:

+---------------------------------------------------------------------------------------------------+
; Setup Times                                                                                       ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 1.282 ; 1.461 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -0.683 ; -0.862 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

As required, the setup time required by the FPGA is lower than the 1.5 ns limit set by the constraint.

Now let’s loosen the input setup delay by 2 ns, leave everything else as it was, and rerun the compilation:

set_input_delay -clock main_clk -max 6.5 [get_ports in*]
set_input_delay -clock main_clk -min 0 [get_ports in*]

The segment in the timing report is now:

+----------------------------------------------------------------------------------+
; Data Arrival Path                                                                ;
+---------+---------+----+------+--------+-------------------+---------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location          ; Element             ;
+---------+---------+----+------+--------+-------------------+---------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                   ; launch edge time    ;
; 0.000   ; 0.000   ;    ;      ;        ;                   ; clock path          ;
;   0.000 ;   0.000 ; R  ;      ;        ;                   ; clock network delay ;
; 6.500   ; 6.500   ; F  ; iExt ; 1      ; PIN_F2            ; in                  ;
; 8.612   ; 2.112   ;    ;      ;        ;                   ; data path           ;
;   6.500 ;   0.000 ; FF ; IC   ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|i          ;
;   7.308 ;   0.808 ; FF ; CELL ; 1      ; IOIBUF_X0_Y22_N15 ; in~input|o          ;
;   8.370 ;   1.062 ; FF ; IC   ; 1      ; FF_X0_Y22_N17     ; in_d|d              ;
;   8.612 ;   0.242 ; FF ; CELL ; 1      ; FF_X0_Y22_N17     ; in_d                ;
+---------+---------+----+------+--------+-------------------+---------------------+

Huh? The interconnect suddenly rose to 1.062 ns?! Note that the placement of the register didn’t change, so there’s no doubt that in_d is an I/O register. So where did this delay come from?

To answer this, a closer look on the design is required. After a full compilation and selecting Tools > Netlist Viewers > Technology Map Viewer (Post-Fitting), the following diagram appears (partly shown below, click to enlarge):

Design diagramRight-clicking in_d (the register) and selecting Locate Note > Locate in Resource Property Editor reveals the following (click to enlarge):

Property Editor ViewTo the right of this drawing (not shown above), the property “Input Pin to Input Register Delay” is set to 2. This is the reason for the delay. Before the constraint was loosened up, it was set to 0. The immediate lesson is:

If the setup constraint isn’t set to the technology’s best possible value, Quartus may add a delay on its expense.

But why, Quartus, why?

So one may wonder why Quartus inserts this delay between the input pad and the register. Wasn’t the whole point to sample as soon as possible? To answer this, let’s look at the updated datasheet report:

---------------------+
; Data Port ; Clock Port ; Rise  ; Fall  ; Clock Edge ; Clock Reference                             ;
+-----------+------------+-------+-------+------------+---------------------------------------------+
; in        ; main_clk   ; 2.205 ; 2.523 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+-------+-------+------------+---------------------------------------------+

+-----------------------------------------------------------------------------------------------------+
; Hold Times                                                                                          ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; Data Port ; Clock Port ; Rise   ; Fall   ; Clock Edge ; Clock Reference                             ;
+-----------+------------+--------+--------+------------+---------------------------------------------+
; in        ; main_clk   ; -1.570 ; -1.882 ; Rise       ; altpll_component|auto_generated|pll1|clk[0] ;
+-----------+------------+--------+--------+------------+---------------------------------------------+

Recall that 2 ns were reduced from the delay constraint, hence the maximal allowed setup time went up from 1.5 ns to 3.5 ns. It’s easy to see that this requirement is met, with a slack of almost 1 ns.

So what Quartus did was saying “I can meet the setup requirement easily, with a spare of 2 ns. Let’s give 1 ns extra to the setup time, and one 1 ns to the hold time requirement (which is 0 ns)”. And indeed, by adding this 1.062 ns delay, the hold time improved from -0.683 ns to -1.570 ns (and please don’t pick on me on why the difference isn’t exact).

Bottom line: Quartus widened the margin for both setup and hold, making the input more robust to jitter. While this is a rather sensible thing to do, this is often not desired nor expected to happen.

Conclusion: If you want to get the absolutely minimal delay from the input to the register, run a compilation with a delay constraint that fails, and then loosen the constraint just enough to resolve this failure. This ensures Quartus won’t try to “improve” the timing by adding this input delay for the sake of a better hold time.

Using DDR primitives

Intel’s FPGAs have dedicated logic on or near the I/O cells to allow for DDR output and sampling, as detailed in the relevant user guide, ug_altddio.pdf. Instantiating such (or using the ALTDDIO_BIDIR megafunction) is an appealing way to force the tools into pushing the register(s) into the I/O cells. Spoiler: It’s not necessarily a good idea.

For example, instantiating something like

altddio_bidir ioddr
 (
 .padio(pin),
 .aclr (1'b0),
 .datain_h(datain_h),
 .datain_l(datain_l),
 .inclock(clk),
 .oe(oe),
 .outclock(clk),
 .dataout_h(dataout_h),
 .dataout_l(dataout_l),
 .oe_out (),
 .aset (1'b0),
 .combout(),
 .dqsundelayedout(),
 .inclocken(1'b1),
 .outclocken(1'b1),
 .sclr(1'b0),
 .sset(1'b0));
 defparam
   ioddr.extend_oe_disable = "OFF",
   ioddr.implement_input_in_lcell = "OFF",
   ioddr.intended_device_family = "Cyclone IV E",
   ioddr.invert_output = "OFF",
   ioddr.lpm_hint = "UNUSED",
   ioddr.lpm_type = "altddio_bidir",
   ioddr.oe_reg = "REGISTERED",
   ioddr.power_up_high = "OFF",
   ioddr.width = 1;

indeed results in logic that implements bidirectional DDR interface, but it’s a partial success as far as timing is concerned, at least on Cyclone IV: While the clock-to-output timing is exactly the same as a plain output register that is packet into the I/O cell, the delay on the input path is actually worse with the instantiation above. YMMV with other Intel FPGA families.

Note that in order to mimic plain SDR registers with a DDR primitive, its datain_h and datain_l ports must be connected to the same wire, so the clock’s falling edge doesn’t change anything. Likewise, the dataout_l port’s value should be ignored, as it’s sampled on the falling edge. Also note that the output enable port (oe) is an SDR input — as far as I can understand, it’s not possible to go on and off high-Z in DDR rate with Intel FPGAs. At least not natively.

Now to why it worked nicely on the output registers, and not with the input: The hint is in the timing reports above: Even for a plain I/O cell register, a DDIOOUTCELL_Xn_Ym_Nk component is the register used. In other words, the DDR output register is used even for single-rate outputs, but only with one clock edge. As for the input path, the timing reports above show that a logic fabric register (FF_Xn_Ym_Nk) is used. And here’s the crux: The DDR input logic is implemented in fabric as well, and to make it worse, combinatoric blocks are squeezed between the I/O cell and the flip-flop in the DDR case. Frankly, I don’t understand why, because these combinatoric blocks are just single-input-single-output passthroughs.

These observations are backed by timing reports as well as the drawings displayed by Quartus’ Post-Fit Technology Map Viewer. In particular those useless combinatoric blocks.

This entire issue most likely varies from one FPGA family to another. As for Cyclone IV, it only makes sense to use DDR primitives for outputs.

Even more important, the fact that a DDR primitive output uses identical logic as an packed output register allows producing an output clock that is aligned with the the other outputs: Feed a DDR output primitive with constant ’1′ and ’0′ on the datain_h and datain_l ports, respectively, and apply plain output register packing for the other outputs. The toggling of the other outputs is aligned to the rising edge of clock that comes from the DDR output.

Well, almost. The timing analysis of a output clock is different, because the clock toggles a mux that selects which of the two output registers feeds the output (scroll horizontally for the details):

+------------------------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                                  ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location                ; Element                                                         ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                         ; launch edge time                                                ;
; 0.000   ; 0.000   ;    ;      ;        ;                         ; clock path                                                      ;
;   0.000 ;   0.000 ; R  ;      ;        ;                         ; clock network delay                                             ;
; 0.000   ; 0.000   ; R  ;      ; 1      ; PIN_B12                 ; osc_clock                                                       ;
; 5.610   ; 5.610   ;    ;      ;        ;                         ; data path                                                       ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|i                                               ;
;   0.667 ;   0.667 ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|o                                               ;
;   0.853 ;   0.186 ; RR ; IC   ; 1      ; CLKCTRL_G12             ; osc_clock~inputclkctrl|inclk[0]                                 ;
;   0.853 ;   0.000 ; RR ; CELL ; 165    ; CLKCTRL_G12             ; osc_clock~inputclkctrl|outclk                                   ;
;   1.971 ;   1.118 ; RR ; IC   ; 1      ; DDIOOUTCELL_X16_Y29_N11 ; sram_controller_ins|ddr_clk|auto_generated|ddio_outa[0]|muxsel  ;
;   3.137 ;   1.166 ; RR ; CELL ; 1      ; DDIOOUTCELL_X16_Y29_N11 ; sram_controller_ins|ddr_clk|auto_generated|ddio_outa[0]|dataout ;
;   3.137 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X16_Y29_N9       ; sram_clk~output|i                                               ;
;   5.610 ;   2.473 ; RR ; CELL ; 1      ; IOOBUF_X16_Y29_N9       ; sram_clk~output|o                                               ;
;   5.610 ;   0.000 ; RR ; CELL ; 0      ; PIN_E10                 ; sram_clk                                                        ;
+---------+---------+----+------+--------+-------------------------+-----------------------------------------------------------------;

Note that this isn’t a register-to-pin analysis, but clock-to-pin. A set_output_delay constraint constraint will include this path nevertheless. However a set_max_delay constraint from registers to ports, if used, won’t include this path, so it has to be handled separately. In other words, if set_max_delay is used, it has to be of the form:

set_max_delay -from [get_clocks main_clk] -to [get_ports sram_clk] 3.8

Now, compare this with another pin with the same voltage standard etc., only driven by a register:

+----------------------------------------------------------------------------------------------------------------------+
; Data Arrival Path                                                                                                    ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------+
; Total   ; Incr    ; RF ; Type ; Fanout ; Location                ; Element                                           ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------+
; 0.000   ; 0.000   ;    ;      ;        ;                         ; launch edge time                                  ;
; 2.507   ; 2.507   ;    ;      ;        ;                         ; clock path                                        ;
;   0.000 ;   0.000 ;    ;      ;        ;                         ; source latency                                    ;
;   0.000 ;   0.000 ;    ;      ; 1      ; PIN_B12                 ; osc_clock                                         ;
;   0.000 ;   0.000 ; RR ; IC   ; 1      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|i                                 ;
;   0.667 ;   0.667 ; RR ; CELL ; 2      ; IOIBUF_X19_Y29_N8       ; osc_clock~input|o                                 ;
;   0.853 ;   0.186 ; RR ; IC   ; 1      ; CLKCTRL_G12             ; osc_clock~inputclkctrl|inclk[0]                   ;
;   0.853 ;   0.000 ; RR ; CELL ; 165    ; CLKCTRL_G12             ; osc_clock~inputclkctrl|outclk                     ;
;   1.970 ;   1.117 ; RR ; IC   ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller_ins|dq_wr_data[6]|clk             ;
;   2.507 ;   0.537 ; RR ; CELL ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller:sram_controller_ins|dq_wr_data[6] ;
; 5.645   ; 3.138   ;    ;      ;        ;                         ; data path                                         ;
;   2.717 ;   0.210 ;    ; uTco ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller:sram_controller_ins|dq_wr_data[6] ;
;   3.182 ;   0.465 ; RR ; CELL ; 1      ; DDIOOUTCELL_X37_Y29_N11 ; sram_controller_ins|dq_wr_data[6]|q               ;
;   3.182 ;   0.000 ; RR ; IC   ; 1      ; IOOBUF_X37_Y29_N9       ; sram_dq[6]~output|i                               ;
;   5.645 ;   2.463 ; RR ; CELL ; 1      ; IOOBUF_X37_Y29_N9       ; sram_dq[6]~output|o                               ;
;   5.645 ;   0.000 ; RR ; CELL ; 1      ; PIN_G14                 ; sram_dq[6]                                        ;
+---------+---------+----+------+--------+-------------------------+---------------------------------------------------;

The total clock-to-output time differs by no more than 35 ps, even though the latter path is completely different on the face of it. This isn’t a coincidence. The FPGA is clearly designed to produce this similarity. Specifically, the timing analysis above is slow 1200 mV at 100 degrees, but this small difference is consistent in the other analyzed conditions as well.

 

Reader Comments

Hi Eli, Thank you for your insightful articles on Altera FPGA Timing issues. At present I am faced with making a faster and more complex version of my working base design.Can I ask a simple philosophical question ?

I have seen several instances of expert users showing how set_max_delay can fix setup timing problems. Also I see expert advice cautioning against using set_max_delay.

One Altera note includes “For cases where you want to use set_max_delay and set_min_delay to establish
an I/O timing requirement (tSU, tH, tCO, and tCO-min), you must constrain the port using set_input_delay/output_delay with a virtual clock. The delay value can be 0 for -max/min in set_output_delay/set_input_delay because
set_max_delay/set_min_delay is used to override the setup and hold requirement and thereby establishing the tSU, tH, tCO, and tCO-min requirement. Because you set your requirement in set_max/min_delay, you do not need to specify a value for the set_input_delay or set_output_delay constraint, but you still must use a virtual clock to make the clock transfer be correctly identified as an I/O transfer. In this way, derive_clock_uncertainty applies uncertainty correctly on this path.”

Question – If Quartus has AI to achieve timing closure, why do we sometimes need additional undesirable? set_max_delay constraints on some paths, to achieve closure ? i.e. if a solution is possible, why does Quartus make us add these quote Altera “dangerous” constraints to make it happen ? Thanks, Steve

#1 
Written By Steve Maslen on August 28th, 2018 @ 12:44

Hello,

FPGA vendor often allow for practices that aren’t recommended, simply because there are users out there that employ them.

Specifically, I have a follow-up post to this one, which shows a not-too-exotic case which required set_max_delay assignments:

http://billauer.co.il/blog/2018/08/quartus-sdc-constraining-pins-derived-clock/

#2 
Written By eli on August 28th, 2018 @ 14:59

Hi Eli, Thanks for the additional pointers. Seems odd that in some cases the only way to get closure is to use non-recommended methods.

#3 
Written By Steve Maslen on August 28th, 2018 @ 16:06

Add a Comment

required, use real name
required, will not be published
optional, your blog address