PCIe: Xilinx’ pipe_clock module and its timing constraints
Introduction
In several versions of Xilinx’ wrapper for the integrated PCIe block, it’s the user application logic’s duty to instantiate the module which generates the “pipe clock”. It typically looks something like this:
pcie_myblock_pipe_clock # ( .PCIE_ASYNC_EN ( "FALSE" ), // PCIe async enable .PCIE_TXBUF_EN ( "FALSE" ), // PCIe TX buffer enable for Gen1/Gen2 only .PCIE_LANE ( LINK_CAP_MAX_LINK_WIDTH ), // PCIe number of lanes // synthesis translate_off .PCIE_LINK_SPEED ( 2 ), // synthesis translate_on .PCIE_REFCLK_FREQ ( PCIE_REFCLK_FREQ ), // PCIe reference clock frequency .PCIE_USERCLK1_FREQ ( PCIE_USERCLK1_FREQ ), // PCIe user clock 1 frequency .PCIE_USERCLK2_FREQ ( PCIE_USERCLK2_FREQ ), // PCIe user clock 2 frequency .PCIE_DEBUG_MODE ( 0 ) ) pipe_clock_i ( //---------- Input ------------------------------------- .CLK_CLK ( sys_clk ), .CLK_TXOUTCLK ( pipe_txoutclk_in ), // Reference clock from lane 0 .CLK_RXOUTCLK_IN ( pipe_rxoutclk_in ), .CLK_RST_N ( pipe_mmcm_rst_n ), // Allow system reset for error_recovery .CLK_PCLK_SEL ( pipe_pclk_sel_in ), .CLK_PCLK_SEL_SLAVE ( pipe_pclk_sel_slave), .CLK_GEN3 ( pipe_gen3_in ), //---------- Output ------------------------------------ .CLK_PCLK ( pipe_pclk_out), .CLK_PCLK_SLAVE ( pipe_pclk_out_slave), .CLK_RXUSRCLK ( pipe_rxusrclk_out), .CLK_RXOUTCLK_OUT ( pipe_rxoutclk_out), .CLK_DCLK ( pipe_dclk_out), .CLK_OOBCLK ( pipe_oobclk_out), .CLK_USERCLK1 ( pipe_userclk1_out), .CLK_USERCLK2 ( pipe_userclk2_out), .CLK_MMCM_LOCK ( pipe_mmcm_lock_out) );
Consequently, some timing constraints that are related to the PCIe block’s internal functionality aren’t added automatically by the wrapper’s own constraints, but must be given explicitly by the user of the block, typically by following an example design.
This post discusses the implications of this situation. Obviously, none of this applies to PCIe block wrappers which handle this instantiation internally.
What is the pipe clock?
For our narrow purposes, the PIPE interface is the parallel data part of the SERDES attached to the Gigabit Transceivers (MGTs), which drive the physical PCIe lanes. For example, data to a Gen1 lane, running at 2.5 GT/s, requires 2.0 Gbit/s of payload data (as it’s expanded by a 10/8 ratio with 10b/8b encoding). If the SERDES is fed with 16 bits in parallel, a 125 MHz clock yields the correct data rate (125 MHz * 16 = 2 GHz).
By the same coin, a Gen2 interface requires a 250 MHz clock to support a payload data rate of 4.0 Gbit/s per lane (expanded into 5 GT/s with 10b/8b encoding).
The clock mux
If a PCIe block is configured for Gen2, it’s required to support both rates: 5 GT/s, and also be able to fall back to 2.5 GT/s if the link partner doesn’t support Gen2 or if the link doesn’t work properly at the higher rate.
In the most common setting (or always?), the pipe clock is muxed between two source clocks by this piece of code (in the pipe_clock module):
//---------- PCLK Mux ---------------------------------- BUFGCTRL pclk_i1 ( //---------- Input --------------------------------- .CE0 (1'd1), .CE1 (1'd1), .I0 (clk_125mhz), .I1 (clk_250mhz), .IGNORE0 (1'd0), .IGNORE1 (1'd0), .S0 (~pclk_sel), .S1 ( pclk_sel), //---------- Output -------------------------------- .O (pclk_1) ); end
So pclk_sel, which is a registered version of the CLK_PCLK_SEL input port is used to switch between a 125 MHz clock (pclk_sel == 0) and a 250 MHz clock (clk_sel == 1), both clocks generated from the same MMCM_ADV block in the pipe_clock module.
The BUFGMUX’ output, pclk_1 is assigned as the pipe clock output (CLK_PCLK). It’s also used in other ways, depending on the instantiation parameters of pipe_clock.
Constraints for Gen1 PCIe blocks
If a PCIe block is configured for Gen1 only, there’s no question about the pipe clock’s frequency: It’s 125 MHz. As a matter of fact, if the PCIE_LINK_SPEED instantiation parameter is set to 1, one gets (by virtue of Verilog’s generate commands)
BUFG pclk_i1 ( //---------- Input --------------------------------- .I (clk_125mhz), //---------- Output -------------------------------- .O (clk_125mhz_buf) ); assign pclk_1 = clk_125mhz_buf;
But never mind this — it’s never used: Even when the block is configured as Gen1 only, PCIE_LINK_SPEED is set to 3 in the example design’s instantiation, and we all copy from it.
Instead, the clock mux is used and fed with pclk_sel=0. The constraints reflect this with the following lines appearing in the example design’s XDC file for Gen1 PCIe blocks (only!):
set_case_analysis 1 [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S0}] set_case_analysis 0 [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S1}] set_property DONT_TOUCH true [get_cells -of [get_nets -of [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S0}]]]
The first two commands tell the timing analysis tools to assume that the clock mux’ inputs are S0=1 and S1=0, and hence that the mux forwards the 125 MHz clock (connected to I0).
The DONT_TOUCH constraint works around a bug in early Vivado revisions, as explained in AR #62296: The S0 input is assigned ~pclk_sel, which requires a logic inverter. This inverter was optimized into the BUFCTRL primitive by the synthesizer, flipping the meaning of the first set_case_analysis constraints. Which caused the timing tools to analyze the design as if both S0 and S1 were set to zero, hence no clock output, and no constraining of the relevant paths.
The problem with this set of constraints is their cryptic nature: It’s not clear at all why they are there, just by reading the XDC file. If the user of the PCIe block decides, for example, to change from a 8x Gen1 configuration to 4x Gen2, everything will appear to work nicely, since all clocks except the pipe clock remain the same. It takes some initiative and effort to figure out that these constraints are incorrect for a Gen2 block.
To make things even worse, almost all relevant paths will meet the 250 MHz (4 ns) requirement even when constrained for 125 MHz on a sparsely filled FPGA, simply because there’s little logic along these paths. So odds are that everything will work fine during the initial tests (before the useful logic is added to the design), and later on the PCIe interface may become shaky throughout the design process, as some paths accidentally exceed the 4 ns limit.
Dropping the set_case_analysis constraints
As these constraints are relaxing by their nature, what happens if they are dropped? Once could expect that the tools would work a bit harder to ensure that all relevant paths meet timing with either 125 MHz or 250 MHz, or simply put, that the constraining would occur as if pclk_1 was always driven with a 250 MHz clock.
But this isn’t how timing calculations are made. The tools can’t just pick the faster clock from a clock mux and follow through, since the logic driven by the clock might interact with other clock domains. If so, a slower clock might require stricter timing due to different relations between the source and target clock’s frequencies.
So what actually happens is that the timing tools mark all logic driven by the pipe clock as having multiple clocks: The timing of each path going to and from any such logic element is calculated for each of the two clocks. Even the timing for paths going between logic elements that are both driven by the pipe clock are calculated four times, covering the four combinations of the 125 MHz and 250 MHz clocks, as source and destination clocks.
From a practical point of view, this is rather harmless, since both clocks come from the same MMCM_ADV, and are hence aligned. Making these excessive timing calculations always ends up with the equivalent for the 250 MHz clock only (some clock skew uncertainty possibly added for going between the two clocks). Since timing is met easily on these paths, this extra work adds very little to the implementation efforts (and how long it takes to finish).
On the other hand, this adds some dirt to the timing report. First, the multiple clocks are reported (excerpt from the Timing Report):
7. checking multiple_clock -------------------------- There are 2598 register/latch pins with multiple clocks. (HIGH)
Later on, the paths between logic driven by the pipe clock are counted as inter clock paths: Once from 125 MHz to 250 MHz, and vice versa. This adds up to a large number of bogus inter clock paths:
------------------------------------------------------------------------------------------------ | Inter Clock Table | ----------------- ------------------------------------------------------------------------------------------------ From Clock To Clock WNS(ns) TNS(ns) TNS Failing Endpoints TNS Total Endpoints WHS(ns) THS(ns) THS Failing Endpoints THS Total Endpoints ---------- -------- ------- ------- --------------------- ------------------- ------- ------- --------------------- ------------------- clk_250mhz clk_125mhz 0.114 0.000 0 5781 0.053 0.000 0 5781 clk_125mhz clk_250mhz 0.114 0.000 0 5764 0.053 0.000 0 5764
Since a single endpoint might produce many paths (e.g. a block RAM), there’s no need for a correlation between the number of endpoints and the number of paths. However the similarity between the figures of the two directions seems to indicate that the vast majority of these paths are bogus.
So dropping the set_case_analysis constraints boils down to some noise in the timing report. I can think of two ways to eliminate it:
- Issue set_case_analysis constraints setting S0=0, S1=1, so the tools assume a 250 MHz clock. This covers the Gen2 case as well as Gen1.
- Use the constraints of the example design for a Gen2 block (shown below).
Even though both ways (in particular the second) seem OK to me, I prefer taking the dirt in the timing report and not add constraints without understanding the full implications. Being more restrictive never hurts (as long as the design meets timing).
Constraints for Gen2 PCIe blocks
If a PCIe block is configured for Gen2, it has to be able to work a Gen1 as well. So the set_case_analysis constraints are out of the question.
Instead, this is what one gets in the example design:
create_generated_clock -name clk_125mhz_x0y0 [get_pins pcie_myblock_support_i/pipe_clock_i/mmcm_i/CLKOUT0] create_generated_clock -name clk_250mhz_x0y0 [get_pins pcie_myblock_support_i/pipe_clock_i/mmcm_i/CLKOUT1] create_generated_clock -name clk_125mhz_mux_x0y0 \ -source [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I0] \ -divide_by 1 \ [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/O] # create_generated_clock -name clk_250mhz_mux_x0y0 \ -source [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I1] \ -divide_by 1 -add -master_clock [get_clocks -of [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I1]] \ [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/O] # set_clock_groups -name pcieclkmux -physically_exclusive -group clk_125mhz_mux_x0y0 -group clk_250mhz_mux_x0y0
This may seem tangled, but says something quite simple: The 125 MHz and 250 MHz clocks are physically exclusive (see AR #58961 for an elaboration on this). In other words, these constraints declare that no path exists between logic driven by one clock and logic driven by the other. If such path is found, it’s bogus.
So this drops all the bogus paths mentioned above. Each path between logic driven by the pipe clock is now calculated twice (for 125 MHz and 250 MHz, but not across the clocks). This seems to yield the same practical results as without these constraints, but without complaints about multiple clocks, and of course no inter-clock paths.
Both clocks are still related to the pipe clock however. For example, checking a register driven by the pipe clock yields (Tcl session):
get_clocks -of_objects [get_pins -hier -filter {name=~*/pipe_clock_i/pclk_sel_reg1_reg[0]/C}] clk_250mhz_mux_x0y0 clk_125mhz_mux_x0y0
Not surprisingly, this register is attached to two clocks. The multiple clock complaint disappeared thanks to the set_clock_groups constraint (even the lower “asynchronous” flag is enough for this purpose).
So can these constraints be used for a Gen1-only block, as a safer alternative for the set_case_analysis constraints? It seems so. Is it a good bargain for getting rid of those extra notes in the timing report? It’s a matter of personal choice. Or knowing for sure.
Bonus: Meaning of some instantiation parameters of pipe_clock
This is the meaning according to dissection of Kintex-7′s pipe_clock Verilog file. It’s probably the same for other targets.
PCIE_REFCLK_FREQ: The frequency of the reference clock
- 1 => 125 MHz
- 2 => 250 MHz
- Otherwise: 100 MHz
CLKFBOUT_MULT_F is set to that the MCMM_ADV’s internal VCO always runs at 1 GHz. Hence the constant CLKOUT0_DIVIDE_F = 8 makes clk_125mhz run at 125 MHz (dividing by 8), and CLKOUT1_DIVIDE = 4 makes clk_250mhz run at 250 MHz (dividing by 8)
PCIE_USERCLK1_FREQ: The frequency of the module’s CLK_USERCLK1 output, which is among others the clock with the user interface (a.k.a. user_clk_out or axi_clk)
- 1 => 31.25 MHz
- 2 => 62.5 MHz
- 3 => 125 MHz
- 4 => 250 MHz
- 5 => 500 MHz
- Otherwise: 62.5 MHz
PCIE_USERCLK2_FREQ: The frequency of the module’s CLK_USERCLK2 output. Not used in most applications. Same frequency mapping as PCIE_USERCLK1_FREQ.