Intel / Altera: Proper access of the Configuration Space Registers (tl_cfg_ctl)

This post was written by eli on August 4, 2019
Posted Under: FPGA,Intel FPGA (Altera)


The PCIe blocks on Intel (formerly Altera) FPGAs have a somewhat peculiar, yet useful interface for exposing some of the interface’s configuration information, called “Transaction Layer Configuration”. For the purposes of this post, it consists of two output signals, tl_cfg_add[3:0] and tl_cfg_ctl[31:0]. Both are driven by the PCIe block, and clocked by the same clock. The documentation seems to be a bit confused on whether this clock is pld_clk or coreclkout_hip however (even in different versions of the same user guide).

Recall that coreclkout_hip is driven by the PCIe block, and pld_clk is an input to it. The straightforward choice is to connect the pld_clk input to coreclkout_hip, so the PCIe block, as well as the application logic, are both driven by the clock the PCIe block generates. It’s actually not clear to me why anyone would go for a different solution, but the user guide keeps this option open.

So I’ll assume pld_clk and coreclkout_hip are the same signal here. Hence the said confusion in the documentation makes no difference. However it may be hint to why Intel changed the guidelines to interacting with the tl_cfg_add / tl_cfg_ctl pair, probably around the release of Quartus 16.0.

The interface in brief

It’s more or less like a cyclic slide show: tl_cfg_add is incremented every 4th or 8th clock (i.e. pld_clk = coreclkout_hip), and tl_cfg_ctl contains a value that corresponds to the position in some register array, at the address shown in tl_cfg_add. Both change on the same rising edge of the same clock.

This register array appears in the relevant FPGA family’s user guide, has no relation to anything standard, and varies from one FPGA family to another. The data is packed into this 16-word register array in a creative manner at times. The information that can be obtained on an endpoint PCIe interface is in particular the bus address, Max_Payload_Size, Max_Read_Request_Size and if the RCB is 128 bytes. On post-Cyclone IV devices, the data for producing an MSI interrupt can be obtained from this interface as well.

The old API

In older revisions of the user guides (and even the current revision for Cyclone IV), the guideline is to treat tl_cfg_add and tl_cfg_ctl as synchronous signals, so it made perfect sense to go (on Cyclone 10 and several others)

always @(posedge pld_clk)
  case (tl_cfg_add)
    0: cfg_dcommand <= tl_cfg_ctl[31:16];
    2: cfg_lcommand <= tl_cfg_ctl[31:16];

The fact that tl_cfg_add dwells a few clocks on each address makes no difference, as tl_cfg_ctl contains the same, correct, value on all of these clock cycles.

The update

In more recent user guides issued by Intel, in particular for Series V and Series 10 FPGAs, the guidelines have changed. Apparently, the timing constraints were not properly applied before Quartus 16.0.1, and one can’t treat tl_cfg_add and tl_cfg_ctl as synchronous signals anymore. Rather, the guidelines require that application logic detects the change in tl_cfg_add’s least significant bit, and then sample both signals with a safe time margin. Without saying it explicitly, the user guides treat these two as signals from an unrelated clock domain.

The fact that both signals remain constant for a fixed number of clock cycles makes it possible to write simple logic that ensures proper sampling.

From a practical point of view, I can testify that the old API works well regardless of the new, stricter, guidelines. It’s not clear whether the sampling mechanism is actually required for proper operation, or if it’s a leftover in the guidelines for Quartus revisions that didn’t enforce the timing constraints properly on the said signals. This way or another, if the user guide requires something, do it. Nevertheless, odds are that FPGA designs already out there, based upon the old API, are still fine.

The new guideline

In short, updated user guides require that the tl_cfg_add and tl_cfg_ctl are sampled in the middle of their time window with stable values. The beginning of each such time window is detected by the application logic by a change in the least significant bit of tl_cfg_add.

There are however a few things to note:

  • The user guides state that the said time window is either 4 clocks or 8 clocks, “depending on the parameterization”, but don’t say how to tell which one applies for a given design.
  • As tl_cfg_add is considered an asynchronous signal, detecting changes on its least significant bit must be done on a register that samples it on each clock, and not comparing directly with tl_cfg_add[0].
  • It’s not possible to sample in the middle of a time window consisting of an even number of clock cycles. In the user guide’s example, the sampling is timed at the clock cycle after the middle of the time window.

The user guide shows sample Verilog code for a clock window of 8 clocks, but doesn’t relate to the 4 clock case.

Verilog code

The following Verilog code can be used to implement the generation of the sampling strobe:

reg [3:0] tl_cfg_add0_d;
reg       strobe;
reg [3:0] counter, prev_counter;

always @(posedge pld_clk)
    tl_cfg_add0_d <= { tl_cfg_add0_d, tl_cfg_add[0] };

    if (prev_counter > 5)
      strobe <= (tl_cfg_add0_d[2] != tl_cfg_add0_d[3]);
      strobe <= (tl_cfg_add0_d[0] != tl_cfg_add0_d[1]);

    if (tl_cfg_add0_d[0] == tl_cfg_add0_d[1])
      counter <= counter + 1;
        prev_counter <= counter;
        counter <= 0;

and then this strobe can be used to sample the two signals:

always @(posedge pld_clk)
  if (strobe)
      tl_cfg_add_samp <= tl_cfg_add;
      tl_cfg_ctl_samp <= tl_cfg_ctl;

The idea is to detect whether the time window is 4 or 8 clock cycles by counting them, and storing the value just before resetting the counter back to zero in prev_counter. For a 4 clock window, it’s expected to be 4, but may also turn out 3 or 5 due to momentary timing glitches. Likewise, prev_counter may turn out either 7, 8 or 9 when an 8 clock window is in effect.

prev_counter is used to select when to assert strobe: If it’s larger than 5, a timing suitable for an 8-clock window is selected. If not, the timing for a 4 clock window. Even though prev_counter may fluctuate from one window to another, it’s not expected to change in a way that alters the selection. Therefore, the fact that prev_counter was measured on one time window, and is used to time the sampling of a another, has no significance. It might as well have been measured on one time window and applied forever afterwards, but that would have required logic that determines when it’s valid.

The sampling instance for a 4-clock time window is two clocks after the change in tl_cfg_add[0], which is consistent with the guideline to sample at the middle of the window (actually, on the closest clock cycle after the middle). For an 8-clock cycle, the delay is 4 clocks, exactly as demonstrated in the examples in the user guides.

Finally, the sampled data can be consumed, e.g.

always @(posedge pld_clk)
    strobe_d <= strobe;

    if (strobe_d)
      case (tl_cfg_add_samp)
         0: cfg_dcommand <= tl_cfg_ctl_samp[31:16];
         2: cfg_lcommand <= tl_cfg_ctl_samp[31:16];

Note that strobe_d is used rather than strobe for consuming that sampled values (even though it would work likewise with strobe, just with a slight delay of the update).

Add a Comment

required, use real name
required, will not be published
optional, your blog address