Preface
While writing a custom SDMA script for copying data arriving from an eCSPI peripheral into memory, it occurred to me that there is more than one way to fetch the data from the peripheral. This post summarizes my rather decisive finding in this matter. Spoiler: Linux’ driver could have done better (Freescale’s v4.1.15)
I’ve written a tutorial on SDMA scripts in general, by the way, which is recommended before diving into this one.
Using the Peripheral DMA Unit
This is the method used by the official eCSPI driver for Linux. That is, the one obtained from Freescale’s / NXP’s Linux git repository. Specifically, spi_imx_sdma_init() in drivers/spi/spi-imx.c sets up the DMA transaction with
spi_imx->rx_config.direction = DMA_DEV_TO_MEM;
spi_imx->rx_config.src_addr = res->start + MXC_CSPIRXDATA;
spi_imx->rx_config.src_addr_width = DMA_SLAVE_BUSWIDTH_1_BYTE;
spi_imx->rx_config.src_maxburst = spi_imx_get_fifosize(spi_imx) / 2;
ret = dmaengine_slave_config(master->dma_rx, &spi_imx->rx_config);
if (ret) {
dev_err(dev, "error in RX dma configuration.\n");
goto err;
}
Since res->start points at the address resource obtained from the device tree (0x2008000 for eCSPI1), this is the very same address used for accessing the peripheral registers (only the software uses the virtual address mapped to the relevant region).
In essence, it means issuing an stf command to set the PSA (Peripheral Source Address), and then reading the data with an ldf command on the PD register. For example, if the physical address (e.g. 0x2008000) is in register r1:
69c3 (0110100111000011) | stf r1, 0xc3 # PSA = r1 for 32-bit frozen periheral read
62c8 (0110001011001000) | ldf r2, 0xc8 # Read peripheral register into r2
One would expect this to be correct way, or why does this unit exist? Or why does Linux’ driver use it? On the other hand, if this is the right way, why is there a “DMA mapping”?
Using the Burst DMA Unit
This might sound like a bizarre idea: Use the DMA unit intended for accessing RAM for peripheral registers. I wasn’t sure this would work at all, but it does: If the same address that was fed into PSA for accessing a peripheral goes into MSA instead, the data can be read correctly from MD. After all, the same address space is used by the processor, Peripheral DMA unit and Burst DMA unit, and it turns out that the buses are interconnected (which isn’t obvious).
So the example above changes into
6910 (0110100100010000) | stf r1, 0x10 # To MSA, NO prefetch, address is frozed
620b (0110001000001011) | ldf r2, 0x0b # Read peripheral register into r2
The motivation for this type of access is using copy mode — a burst of up to 8 read/write operations in a single SDMA command. This is possible only from PSA to PDA, or from MSA to MDA. But there is no burst mode from PSA to MDA. So treating the peripheral register as a memory element works around this.
Spoiler: It’s not such a good idea. The speed results below tell why.
Using the SDMA internal bus mapping
The concept is surprisingly simple: It’s possible to access some peripherals’ registers directly in the SDMA assembly code’s memory space. In other words, to access eCSPI1, one can go just
5201 (0101001000000001) | ld r2, (r1, 0) # Read peripheral register from plain SDMA address space
and achieve the equivalent result of the examples above. But r1 needs to be set to a different address. And this is where it gets a bit confusing.
The base address is fairly easy to obtain. For example, i.MX6′s reference manual lists the address for eCSPI1 as 0x2000 in section 2.4 (“DMA memory map”), where it also says that the relevant section spans 4 kB. Table 55-14 (“SDMA Data Memory Space”) in the same document assigns the region 0x2000-0x2fff to “per2″, declares its size as 16 kB, and in the description it says “peripheral 2 memory space (4 Kbyte peripheral’s address space)”. So what is it? 4 kB or 16 kB?
The answer is both: The address 0x2000 is given in SDMA data address format, meaning that each address points at a 32-bit word. Therefore, the SDMA map region of 0x2000-0x2fff indeed spans 16 kB. But the mapping to the peripheral registers was done in a somewhat creative way: The address offsets of the registers apply directly on the SDMA mapping’s addresses.
For example, let’s consider the ECSPI1_STATREG, which is placed at “Base address + 18h offset”. In the Application Processor’s address space, it’s quite clear that it’s 0x2008000 + 0x18 = 0x2008018. The 0x18 offset means 0x18 (24 in decimal) bytes away from the base.
In the SDMA mapping, the same register is accessed at 0x2000 + 0x18 = 0x2018. At first glance, this might seem obvious, but an 0x18 offset means 24 x 4 = 96 bytes away from the base address. A bit odd, but that’s the way it’s implemented.
So even though each address increment in SDMA data address space moves 4 bytes, they mapped the multiply-by-4 offsets directly, placing the registers 16 bytes apart. Attempting to access addresses like 0x2001 yield nothing noteworthy (in my experiments, they all read zero). I believe that the SDMA submodule was designed in France, by the way.
Almost needless to say, these addresses (e.g. 0x2000) can’t be used to access peripherals with Peripheral / Burst DMA units — these units work with the Application Processor’s bus infrastructure and memory map.
Speed tests
As all three methods work, the question is how fast each is. So I ran a speed test. I only tested the peripheral read operation (my application didn’t involve writes), but I would expect more or less the same results for writes. The speed tests were carried out by starting the SDMA script from a Linux kernel module, and issuing a printk when the SDMA script was kicked off. When the interrupt arrived at the completion of the script (resulting from a “done 3″ opcode, not shown in the table below), another printk was issued. The timestamps in dmeg’s output was used to measure the time difference.
In order to keep the influence of the Linux overhead delays low, the tested command was executed within a hardware loop, so that the overall execution would take a few seconds. A few milliseconds of printk delay hence became fairly negligible.
The results are given in the following table:
|
Peripheral DMA Unit |
Burst DMA Unit |
Internal bus mapping |
Non-IO command |
Assembly code |
stf r1, 0xc3
loop endloop, 0
ldf r2, 0xc8
endloop:
|
stf r1, 0x10
loop endloop, 0
ldf r2, 0x0b
endloop: |
loop endloop, 0
ld r2, (r1, 0)
endloop: |
loop endloop, 0
addi r5, 2
endloop: |
Execution rate |
7.74 Mops/s |
3.88 Mops/s |
32.95 Mops/s |
65.97 Mops/s |
Before concluding the results, a word on the rightmost one, which tested the speed of a basic command. The execution rate, almost 66 Mops/s, shows the SDMA machine’s upper limit. Where this came from isn’t all that clear, as I couldn’t find a matching clock rate in any of the three clocks enabled by Linux’ SDMA driver: clk_ahb, clk_ipg and clk_per.
The reference manual’s section 55.4.6 claims that the SDMA core’s frequency is limited to 104 MHz, but calling clk_get_rate() for clk_ahb returned 132 MHz (which is 2 x 66 MHz…). For the two other which the imx-sdma.c driver declares that it uses, clk_ipg and clk_per (the same clock, I believe), clk_get_rate() returned 60 MHz, so it’s not that one. In short, it’s not 100% what’s going on, except that the figure is max 66 Mops/s.
By the way, I verified that the hardware loop doesn’t add extra cycles by duplicating the addi command, so it ran10 times for each loop. The execution rate dropped to exactly 1/10, so there’s definitely no loop overhead.
OK, so now to the conclusions:
- The clear winner is using the internal bus. Note that the result isn’t all that impressing, after all. With 33 Mops, 4 bytes each, there’s a theoretical limit of 132 MB/s for just reading. That doesn’t include doing something with the data. More about that below.
- Note that reading from the internal bus takes just 2 execution cycles.
- There is a reason for using the Peripheral DMA Unit, after all: It’s twice as fast compared with the Burst DMA Unit.
- It probably doesn’t pay off to use the Burst DMA Unit for burst copying from a peripheral to memory, even though I didn’t give it a go: The read is twice as slow, and writing to memory with autoflush is rather quick (see below).
- The use of the Peripheral DMA Unit in the Linux kernel driver is quite questionable, given the results above. On the other hand, the standard set of scripts aren’t really designed for efficiency anyhow.
Copying data from peripheral to RAM
In this last pair of speed tests, the loop reads one value from the peripheral with Internal bus mapping (the fastest way found) and writes it to the general RAM with an stf command, using autoincrement. This is hence a realistic scenario for bulk copying of data from a peripheral data register into memory that is available to the Application Processor.
The test code had to be modified slightly, so the destination address is brought back to the beginning of the buffer every 1,000,000 write operations, since the buffer size is limited, quite naturally. So when the script begins, r7 contains the number of times to loop until resetting the destination address (that is, r7 = 1000000) and r3 contains the number of such sessions to run (was set to 200). The overhead of this larger loop is literally one in a million.
The assembly code used was:
| bigloop:
0000 008f (0000000010001111) | mov r0, r7
0001 6e04 (0110111000000100) | stf r6, 0x04 # MDA = r6, incremental write
|
0002 7802 (0111100000000010) | loop endloop, 0
0003 5201 (0101001000000001) | ld r2, (r1, 0)
0004 6a0b (0110101000001011) | stf r2, 0x0b # Write 32-bit word, no flush
| endloop:
0005 2301 (0010001100000001) | subi r3, 1 # Decrement big loop counter
0006 7cf9 (0111110011111001) | bf bigloop # Loop until r3 == 0
| quit:
0007 0300 (0000001100000000) | done 3 # Quit MCU execution
The result was 20.70 Mops/s, that is 20.7 Million pairs of read-writes per second. This sets the realistic hard upper limit for reading from a peripheral to 82.8 MB/s. Note that deducing the known time it takes to execute the peripheral read, one can estimate that the stf command runs at ~55.5 Mops/s. In other words, it’s a single cycle instruction until an autoflush is forced every 8 writes. However dropping the peripheral read command (leaving only the stf command) yields only 35.11 Mops/s. So it seems like the DMA burst unit takes advantage of the small pauses between accesses to it.
I should mention that the Linux system was overall idle while performing these tests, so there was little or no congestion on the physical RAM. The results were repeatable within 0.1% of the execution time.
Note that automatic flush was enabled during this test, so the DMA burst unit received 8 writes (32 bytes) before flushing the data into RAM. When reattempting this test, with explicit flush on each write to RAM (exactly the same assembly code as listed above, with a peripheral read and then stf r7, 0x2b instead of 0x0b), the result dropped to 6.83 Mops/s. Which is tantalizingly similar to the 7.74 Mops result obtained for reading from the Peripheral DMA Unit.
Comparing with non-DMA
Even though not directly related, it’s worth comparing how fast the host accesses the same registers. For example, how much time will this take (in Linux kernel code, of course)?
for (i=0; i<10000000; i++)
rc += readl(ecspi_regs + MX51_ECSPI_STAT);
So the results are as follows:
- Reading from an eCSPI register (as shown above): 4.10 Mops/s
- The same, but from RAM (non-cacheable, allocated with dma_alloc_coherent): 6.93 Mops/s
- The same, reading with readl() from a region handled by RAM cache (so it’s considered volatile): 58.14 Mops/s
- Writing to an eCSPI register (with writel(), loop similar to above): 3.8696 Mops/s
This was carried out on an i.MX6 processor with a clock frequency of 996 MHz.
The figures echo well with those found in the SDMA tests, so it seems like the dominant delays come from i.MX6′s bus bridges. It’s also worth nothing the surprisingly slow performance of readl() from cacheable, maybe because of the memory barriers.
Motivation
Even though SPI is commonly used for controlling rather low-speed peripherals on an embedded system, it can also come handy for communicating data with an FPGA.
When using the official Linux driver, the host can only be the SPI master. It means, among others, that transactions are initiated by the host: When the bursts take place is completely decided by software, and so is how long they are. It’s not just about who drives which lines, but also the fact that the FPGA is on the responding side. This may not be a good solution when the data rates are anything but really slow: If the FPGA is slave, it must wait for the host to poll it for data (a bit like a USB peripheral). That can become a bit tricky at the higher end of data rates.
For example, if the FPGA’s FIFO is 16 kbit deep, and is filled at 16 Mbit/s, it takes 1 ms for it to overflow, unless drained by the host. This can be a difficult real-time task for a user-space Linux program (based upon spidev, for example). Not to mention how twisted such a solution will end up, having the processor constantly spinning in a loop collecting data, whether there is data to collect or not.
Another point is that the SPI clock is always driven by the SPI master, and it’s usually not a free-running one. Rather, bursts of clock edges are presented on the clock wire to advance the data transaction.
Handling a gated clock correctly on an FPGA isn’t easy when it’s controlled by an external device (unless its frequency is quite low). From an FPGA design point of view, it’s by far simpler to drive the SPI clock and handle the timing of the MOSI/MISO signals with respect to it.
And finally: If a good utilization of the upstream (FPGA to host) SPI channel is desired, putting the FPGA as master has another advantage. For example, on i.MX6 Dual/Quad, the SPI clock cycle is limited to a cycle of 15 ns for write transactions, but to 40 ns or 55 ns on read transactions, depending on the pins used. The same figures are true, regardless of whether the host is master or slave (compare sections 4.11.2.1and 4.11.2.2 in the relevant datasheet, IMX6DQCEC.pdf). So if the FPGA needs to send data faster than 25 Mbps, it can only use write cycles, hence it has to be the SPI master.
CS is useless…
This is the “Chip Select” signal, or “Slave Select” (SS) in Freescale / NXP terminology.
The reference manual, along with NXP’s official errata ERR009535, clearly state that deasserting the SPI’s CS wire is not a valid way to end a burst. Citing the description for the SS_CTL field of ECSPIx_CONFIGREG, section 21.7.4 in the i.MX6 Reference Manual:
In slave mode – an SPI burst is completed when the number of bits received in the shift register is equal to (BURST_LENGTH + 1). Only the n least-significant bits (n = BURST_LENGTH[4:0] + 1) of the first received word are valid. All bits subsequent to the first received word in RXFIFO are valid.
So the burst length is fixed. The question is, what value to pick. Short answer: 32 bits (set BURST LENGTH to 31).
Why 32? First, let’s recall that RXFIFO is 32 bits wide. So what is more natural than packing the incoming data into full 32 bits entries in the RXFIFO, fully utilizing its storage capacity? Well, maybe the natural data alignment isn’t 32 bits, so another packing scheme could have been better. In theory.
That’s where the second sentence in the citation above comes in. What it effectively says is that if BURST_LENGTH + 1 is chosen anything else than a multiple of 32, the first word, which is ever pushed into RXFIFO since the SPI module’s reset, will contain less than 32 received bits. All the rest, no matter what BURST_LENGTH is set to, will contain 32 bits of received data. This is really what happens. So in the long run, data is packet into 32 bit words no matter what. Choosing BURST_LENGTH + 1 other than a multiple of 32 will just mess up things on the first word the RXFIFO receives after waking up from reset. Nothing else.
So why not set BURST_LENGTH to anything else than 31? Simply because there’s no reason to do so. We’re going to end up with an SPI slave that shifts bits into RXFIFO as 32 bit words anyhow. The term “burst” has no significance, since deassertions of CS are ignored anyhow. In fact, I’m not sure if it makes any difference between different values satisfying multiple of 32 rule.
Note that since CS doesn’t function as a frame for bursts, it’s important that the eCSPI module is brought out of reset while there’s no traffic (i.e. clock edges), or it will pack the data in an unaligned and unpredictable manner. Also, if the FPGA accidentally toggles the clock (due to a bug), alignment it lost until the eCSPI is reset and reinitialized.
Bottom line: The SPI slave receiver just counts 32 clock edges, and packs the received data into RXFIFO. Forever. There is no other useful alternative when the host is slave.
… but must be taken care of properly
Since the burst length doesn’t depend on the CS signal, it might as well be kept asserted all the time. With the register setting given below, that means holding the pin constantly low. It’s however important to select the correct pin in the CHANNEL_SELECT field of ECSPIx_CONREG: The host will ignore the activity on the SPI bus unless CS is selected. In other words, you can’t terminate a burst with CS, but if it isn’t asserted, bits aren’t sampled.
Another important thing to note, is that the CS pin must be IOMUXed as a CS signal. In the typical device tree for the mainstream Linux SPI master driver, it’s assigned as a GPIO pin. That’s no good for an SPI slave.
So, for example, if the ECSPI entry in the device tree says:
&ecspi1 {
[ ... ]
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_ecspi1_1>;
status = "okay";
};
meaning that the IOMUX settings given in pinctrl_ecspi1_1 should be applied, when the Linux driver related to ecspi1 is probed. It should say something like
&iomuxc {
imx6qdl-var-som-mx6 {
[ ... ]
pinctrl_ecspi1_1: ecspi1grp {
fsl,pins = <
MX6QDL_PAD_DISP0_DAT22__ECSPI1_MISO 0x1f0b1
MX6QDL_PAD_DISP0_DAT21__ECSPI1_MOSI 0x1f0b1
MX6QDL_PAD_DISP0_DAT20__ECSPI1_SCLK 0x130b1
MX6QDL_PAD_DISP0_DAT23__ECSPI1_SS0 0x1f0b1
>;
};
[ ... ]
The actual labels differ depending on the processor’s variant, which pins were chosen etc. The point is that the _SS0 usage was selected for the pin, and not the GPIO alternative (in which case it would say MX6QDL_PAD_DISP0_DAT23__GPIO5_IO17). The list of IOMUX defines for the i.MX6 DL variant can be found in arch/arm/boot/dts/imx6dl-pinfunc.h.
Endianness
The timing diagrams for SPI communication in the Reference Manual show only 8 bit examples, with MSB received first. But this applies to 32 bit words as well. But what happens if 4 bytes are sent with the intention of being treated as a string of bytes?
Because the first byte is treated as the MSB of a 32-bit word, it’s going to end up as the last byte when the 32-bit word is copied (by virtue of a single 32-bit read and write) into RAM, whether done by the processor or by SDMA. This ensures that a 32-bit integer is interpreted correctly by the Little Endian processor when transmitted over the SPI bus, but messes up single bytes transmitted.
Where exactly this flipping takes place, I”m not sure, but it doesn’t really matter. Just be aware that if a sequence of bytes are sent over the SPI link, they need to be byte swapped in groups of 4 bytes to appear in the correct order in the processor’s memory.
Register setting
In terms of a Linux kernel driver, the probe of an SPI slave is pretty much the same as the SPI master, with a few obvious differences. For example, the SPI clock’s frequency isn’t controlled by the host, so it probably doesn’t matter so much how the dividers are set (but it’s probably wise to set these dividers to 1, in case the internal clock is used for something).
ctrl = MX51_ECSPI_CTRL_ENABLE | /* Enable module */
/* MX51_ECSPI_CTRL_MODE_MASK not set, so it's slave mode */
/* Both clock dividers set to 1 => 60 MHz, not clear if this matters */
MX51_ECSPI_CTRL_CS(which_cs) | /* Select CSn */
(31 << MX51_ECSPI_CTRL_BL_OFFSET); /* Burst len = 32 bits */
cfg = 0; /* All defaults, in particular, no clock phase / polarity change */
/* CTRL register always go first to bring out controller from reset */
writel(ctrl, regs + MX51_ECSPI_CTRL);
writel(cfg, regs + MX51_ECSPI_CONFIG);
/*
* Wait until the changes in the configuration register CONFIGREG
* propagate into the hardware. It takes exactly one tick of the
* SCLK clock, but we will wait 10 us to be sure (SCLK is 60 MHz)
*/
udelay(10);
/*
Turn off DMA requests (revert the register to its defaults)
But set the RXFIFO watermark as required by device tree.
*/
writel(MX51_ECSPI_DMA_RX_WML(rx_watermark),
regs + MX51_ECSPI_DMA);
/* Enable interrupt when RXFIFO reaches watermark */
writel(MX51_ECSPI_INT_RDREN, regs + MX51_ECSPI_INT);
The example above shows the settings that apply when the the host reads from the RXFIFO directly. Given the measurements I present in another post of mine, showing ~4 Mops/s with a plain readl() call, it means that at the maximal bus rate of 66 Mbit/s, which is ~2.06 Mops/s (32 bits per read), we have the a processor core 50% busy just on readl() calls.
So for higher data rates, SDMA is pretty much a must.
The speed test
Eventually, I ran a test. With a dedicated SDMA script, SPI clock running at 112 MHz, 108.6 Mbit/s actual throughput:
# time dd if=/dev/myspi of=/dev/null bs=64k count=500
500+0 records in
500+0 records out
32768000 bytes (33 MB, 31 MiB) copied, 2.41444 s, 13.6 MB/s
real 0m2.434s
user 0m0.000s
sys 0m1.610s
This data rate is, of course, way above the allowed SPI clock frequency of 66 MHz, but it’s not uncommon that real-life results are so much better. I didn’t bother pushing the clock higher.
I ran a long and rigorous test looking for errors on the data transmission line (~ 1 TB of data) and it was completely clean with the 112 MHz, so the SPI slave is reliable. For a production system, I don’t think about exceeding 66 MHz, despite this result. Just to have that said.
But the bottom line is that the SPI slave mode can be used as a simple transmission link of 32-bit words. Often that’s good enough.
For some weird reason (actually, running an old Atari 800 emulator), my autorepeat was suddenly not working. On any window. That’s the moment one becomes aware of how often it’s used.
Quick solution:
$ xset r on
and it was all fine again. Or
$ xset q
to see all settings.
About this post
These are things I wrote down at different stages of introducing myself to Nios II and its environment. Nothing really consistent nor necessarily the right way to do things.
Jots
- Open Qsys. Follow this post.
- Went for Nios II classic, used Nios/e (no Hardware multiplication, as the target device doesn’t have it. Set instruction cache to 2 kB, and no data cache
- Add 16 kB on-chip memory (Basic > On-Chip Memory > On-Chip Memory (RAM or ROM) ). Data width 32 bits, set hex file to raminit.hex (to be found at verilog/raminit.hex)
- Attach memory to processor’s Avalon master
- Attach peripherals
- Connect clk_0′s clock output to all clock inputs (including processor’s).
- Same with reset
- Assign base addresses automatically: System > Assign Base Addresses
- Enter the CPU configuration, and assign the Reset and Exception Vectors to the onchip memory (this issues an offset to the addresses, per the peripheral’s offsets).
- Build the Qsys project. Among all (Verilog) files, it generates a processor.sopcinfo file.
Software
Running against hardware
Note: Quartus’ programmer and the “Run” environment on Eclipse are mutually exclusive, competing for the USB bitblaster.
- Make sure you’ve quit Quartus’ programmer (actually not necessary. Just be sure that the blue LED on the USB Blaster is off).
- Also make sure to “terminate launch” on the Eclipse side before attempting to reprogram the FPGA (pressing the red stop-like button on the Nios Console is enough.
- Pick the “hello” project (that is, not the BSP) and go to top menu: Run > Run configurations…, pick Target Connection tab. Both a processor and a byte stream device should be enlisted (the latter is the jtaguart). Refresh to make sure it’s actually there.
- If it says “Connected system ID hash not found on target at the expected base address” at the top, select “Ignore mismatched system ID” and “Ignore mismatched system timestamp”. This happens when there’s no system ID peripheral in the Qsys design.
- The “Hello world from NIOS!!” should appear in the Nios II console
- The base addresses etc. are listed in system.h inside the BSP (hello_bsp in my case).
- This program printed out “Hello world” as well as blinked the LEDs:
#include <stdio.h>
#include <unistd.h>
#include <io.h>
#include <system.h>
#include <altera_avalon_pio_regs.h>
int main()
{
int i;
printf("Hello from Nios II!\n");
while (1) {
IOWR_ALTERA_AVALON_PIO_DATA(PIO_BASE, ((-i) & 7));
i++;
usleep(100000);
}
return 0;
}
- To generate a hex file, right-click the project (“hello”) and pick Make Targets > Build…, chooise mem_init_generate and click the Build button. The juicy part in the process was
elf2hex hello.elf 0x00010000 0x00019fff --width=32 --little-endian-mem --create-lanes=0 mem_init/raminit.hex
- Alternatively, go (skip to the “make” statement if already in a NIOS shell)
/path/to/altera/15.1/nios2eds/nios2_command_shell.sh make mem_init_generate
- It noteworthy that the tools spotted my choice of the file name, even though it’s not located where Quartus expects it.
- Giving the hex file to Quartus resulted in a lot of lines saying
Warning (113015): Width of data items in "raminit.hex" is greater than the memory width. Wrapping data items to subsequent addresses. Found 1280 warnings, reporting 10
Warning (113009): Data at line (2) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.
Warning (113009): Data at line (3) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.
Warning (113009): Data at line (4) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.
etc.
But this is most probably OK, as the processor worked immediately after FPGA configuration.
- Redirect printf() and other stdout to UART: By default, the standard output goes to the JTAG UART. To change this, right-click the BSP project, pick Nios II > BSP Editor. Pick the “Main” tab, navigate to “hal > common” (it usually starts there anyhow) and change the stdout target to the desired UART. And regenerate the BSP.
Introduction
This post relates to Altera (or should I say Intel FPGA?) Cyclone IV FPGAs loaded from an ECPQ flash in Active Serial x 1 (AS x 1) mode. Things written below are probably relevant to other Altera FPGAs as well, but keep in mind that Cyclone IV FPGAs have several peculiarities you won’t find on other Altera device families.
“Remote Update” is the feature in some Altera FPGAs, which allows application logic / software to safely update the bitstream from which the FPGA is loaded. The trick is to always have a Factory (“Golden”) bitstream image on the flash, and update the “Application” image only. When powers goes up, the FPGA ends up with the Application bitstream if it’s OK, or the Factory bitstream if it’s absent or corrupt.
Since the bitstreams carry a CRC, it’s guaranteed that only valid bitstreams are used. It’s therefore safe to overwrite a previous Application bitstream image: If something goes wrong in the middle of writing, it won’t be deemed a valid bitstream, so the FPGA will end up with the Factory bitstream.
The basics
To implement a remote update feature on an FPGA design, there are two functional elements needed:
- The ability to write data into the configuration flash with user-designed logic / software. This is discussed in this post.
- The logic / software that makes sure the FPGA ends up with the right configuration (and, in particular, prevents an endless configuration loop as explained next)
Note that the Remote Update IP Core has nothing to do with flash programming: Its function is merely to allow the FPGA’s logic to issue a reconfiguration, and offer some information on how and why the current bitstream was loaded.
When an FPGA powers up, it always configures from a constant address of the flash, which is zero on ECPQ flashes. In other words, the FPGA always powers up from the Factory bitstream, no matter what. It’s the user application logic / software’s duty to force the configuration of the Application bitstream when adequate. This means that during normal operation, there are always two configurations of the FPGA at powerup, one for the Factory bitstream, and one for the Application. This doubles the configuration time, of course.
How it happens: The FPGA is powered up, and loads the Factory bitstream from a fixed address. Through the Remote Update IP Core, the logic / software in the FPGA sets the address of the Application image at the flash, from which the FPGA should configure itself. It then can triggers a reconfiguration of the FPGA.
The FPGA’s configuration state machine attempts to load a bitstream from the flash at the given address. If the bitstream’s magic words are in place and the CRC is OK, it starts running on the new bitstream. If not, it loads the Factory bitstream again as a fallback.
By virtue of a register of the Remote Update IP Core, the software / logic in the Factory bitstream detects that it was loaded due to a failure, and takes action (or no action) accordingly. It may try another address at the flash, or refrain from another reconfiguration altogether (i.e. stay with the “Golden Image”). The decision depends on the design requirements, but the crucial point here is to prevent an endless loop of configurations.
Some reading
This post is not a user guide or a substitute for these two must-read documents:
Spoiler
This Nios II code implements the loading of the Application bitstream. It written so it can be used on any bitstream, as is does nothing when run from an Application bitstream.
It’s also safe for use with JTAG configuration: It won’t issue a reconfiguration in that case (well, sort of: see note on peculiarity below).
void do_remote_update(void) {
alt_u32 app_bitstream_addr = 0x100000;
alt_u32 mode = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0) & 3;
alt_u32 config_reason = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x64);
if ((mode == 0) && (config_reason == 0)) {
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x30, 0); // Turn off watchdog
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x40, app_bitstream_addr);
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x74, 1); // Trigger reconfiguration
while (1); // Wait briefly until configuration takes place
}
}
do_remote_update() should be called first thing in the Nios II code entry. If the function returns, the FPGA is either running on the Application bitstream, or the Factory (“Golden”) bitstream with a good reason not to reconfigure (i.e. a previous failure to load the Application bitstream or after a JTAG bitstream load).
Please refer to the “Programming the flash with NIOS software” section in this post on how to generate the image of the Application bitstream.
The code above works with the following setting:
- The FPGA’s NCONFIG pin is tied high. This will not work if the NCONFIG pin is driven by some power supply watch logic or alike, because config_reason won’t be zero if NCONFIG triggered the configuration.
- REMOTE_UPDATE_0_BASE is the base address in NIOS’ address space of a Remote Update IP core, which has the “writing configuration parameters” option enabled.
- The application bitstream image is loaded at flash address 0x100000 (i.e. can be read with epcs_read_buffer() using this address)
- The Golden image is at address zero, of course.
If loading the application image fails once, no other attempts are made. This is the straightforward thing to do if there’s no additional image to try from. There’s no sensible reason to try the same image again, unless the PCB designer has done a really bad job.
Now the peculiarity note promised above: If the Factory bitstream didn’t have the Remote Update IP instantiated (or STRATIXIII_UPDATE_MODE as mentioned below not set? Not clear), the first JTAG bitstream loaded after it will not be detected as a JTAG, so the bitstream loaded from JTAG might mistake itself for a first-attempt Factory bitstream and attempt to load the Application bitstream immediately. This kind-of makes sense, because the Current State register always reflects Factory mode after a JTAG bitstream load, and the difference is told from the Trigger Condition register, which reflects the reason for triggering the bitstream load. However this is a Read Past Status 1 (read_source = 1) register, reflecting something stored before the current bitstream load. In the absence of the Remote Update feature on the previous bitstream, it seems like this register wasn’t updated at all before the JTAG load, read all zeros after it, and hence the misinterpretation on the current bitstream.
This scenario is however irrelevant in all but rather messed up settings (why wouldn’t the Factory bitstream support Remote Update?). Anyhow, see another note below on another issue with reflecting the status after a JTAG configuration.
How this function works, briefly:
- It verifies that the configuration mode is 0, that is Factory mode. If we’re in Application mode, the function returns.
- It verifies that the trigger for configuration was a powerup by checking config_reason, or it returns. This prevents an endless loop of configurations in the case of a fallback into the Factory bitstream in the event of a failed attempt to load the Application bitstream.
Note that if the configuration was triggered as a result of an assertion of the FPGA’s NCONFIG pin, or on a JTAG configuration, config_reason will read 0x10 (most of the time, see note below).
- The watchdog is disabled, so the Application bitstream doesn’t have to deal with it
- The Application bitstream’s address is set
- A configuration is forced by writing to the dedicated register
- An endless while (1) loop is invoked for preventing the execution to go on — not that it would go anywhere far.
General notes
Accessing registers
Put short, the registers map is a mess. Out of the long list given in Tables 20 and 21 in the Remote Update IP Core User Guide, only a handful have a meaning.
It’s important to realize that some registers are valid when the Remote Update IP core is in Factory mode, and others when it’s in Application mode. These two register sets are mutually exclusive (except for the CURRENT_STATE_MODE register). The test program shown further down this post demonstrates which registers are valid in each mode.
This is a list of things to keep in mind regarding these registers:
- Reading from a Factory mode register in Application mode (and vice versa) returns meaningless (and rather confusing) data.
- The way to make sense of the registers from the docs is to refer to tables 16 and 17 in the Remote Update IP Core User Guide to tell what you want to access in terms of which param and which read_source, and then find the address for them in table 21. Several registers in table 21 constitute combinations of param and read_source that aren’t listed in table 17, which probably renders them meaningless.
- … except for RU_RESET_TIMER and RU_RECONFIG, which are interpreted in logic to generate a reset signal / reconfiguration signal respectively, and and therefore not listed in table 17.
- Too add more confusion, readbacks don’t work as one might expect. For example, the boot address for the next configuration is set at address offset 0x40, but reading back from the same address always yields the factory boot address. To get the boot address for the next configuration (i.e. the one written to 0x40), read it back at 0x4c.
- More confusion: The translation from the param numbers to the Nios access register isn’t some arithmetic operation, but rather some lookup logic of the avl_controller_cycloneiii_iv module in Qsys_remote_update_0_remote_update_controller.sv, which is generated automatically by Qsys.
- The registers listed in the BSP’s drivers/inc/altera_remote_update_regs.h are those of all Altera FPGAs except Cyclone IV. For example, the docs as well as the Qsys Verilog file () place RU_WATCHDOG_TIMEOUT at address 0x08 (actually, addresses 0x08-0x0b) but the BSP’s altera_remote_update_regs.h
- Note that compared with other FPGA families, Cyclone IV’s register interface is considerably more extensive, allowing the controller to query the status of two configuration cycles back in history. Seems like this feature was dropped on later FPGAs (due to lack of interest vs complication…?)
The RU_RECONFIG_TRIGGER_CONDITIONS register
This register is interesting in particular, as it tells us why that caused the FPGA to configure the bitstream that is currently running:
IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x64); // Register 0x19 in the guide
And to obtain the reason for the configuration before that:
IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x68); // Register 0x1a in the guide
These read the remote config core’s param 3′b111 with read source 2′b01 and 2′b10 respectively. Note that the translation from the param number of 3′b111 to the Nios access register isn’t just a multiplication, but rather some lookup logic as mentioned (but not detailed) above.
Running some tests of my own (with the test program below), I got the following values. There’s nothing surprising about these results; they are exactly as documented.
- On cold configuration: 0
- When not disabling the watchdog (not handling it after configuration): 2 (bit 1 set, User watchdog timer timeout)
- After failed application configuration due to lack of image: 4 (bit 2 set, nSTATUS asserted by an external device as the result of an error)
- After failed application configuration due to damaged image: 8 (bit 3 set, CRC error during application configuration)
- On configuration from JTAG; 0x10 (bit 4 set, External configuration reset (nCONFIG) assertion)
Ah, but there’s a thing with configuration from JTAG: Some other tests I’ve ran showed that if loading an application image with a CRC error (bit 3 set), this remains even after JTAG configurations (note the plural — even after several consecutive JTAG configurations). So instead of reading 0x10, this register reads 0x08, no matter how many times the bitstream was correctly loaded into the FPGA after that.
So the bottom line is that this register doesn’t work well with JTAG configurations (see peculiarity note above).
Without NIOS/Avalon interface
It’s also possible to instantiate a Remote Update IP without a NIOS processor. These are my few observations as I interfaced such IP without Avalon interface:
- The clock frequency should be below 20 MHz (or 10 MHz on some other device families. Refer to the Altera Remote Update IP Core User Guide on this.
- All of the Remote Update IP’s signals toggle on the rising edge of the clock.
- A read_param request assertion make busy rise on the rising edge of the clock for which it is asserted, and held high for about 62 clocks. In other words, (read_param && busy) is constantly zero (because of the one-clock nature of read_param) but (read_param || busy) will be logic ’1′ from the first assertion of read_param and until the read cycle is done. See scope shot below.
- write_param follows the same relation with busy, but the busy pulse is shorter: Around 47 clock cycles.
- The read_source bits are always bits 3:2 of the address as used by NIOS software to access the respective register.
- The param address is often bits 6:4 of the address as used by NIOS software to access the respective register with the exception when these are 3′b101, 3′b110 or 3′b111, in which cases the first to are replaced with 3′b110 and 3′b111 respectively. The last, 3′b111 addresses the FPGA logic directly (asserts a reconfiguration, among others).
- Nevertheless, refer to Table 17 of ug_altremote.pdf for the outline of param and read_source.
- The data_out is updated on the same clock cycle that busy goes low. In other words, for any clock cycle, if busy is low, data_out is valid. See scope shot below.
- data_out remains valid until the following read cycle. It seems like data_out goes zero soon a few clocks after busy goes high, but the actual value is of course valid only when it goes low again.
And now a couple of oscilloscope screenshots, made after wiring the some signals to external pins. In these samples, the signals shown from bottom to top: clock, busy, read_param, data_out[0].
First, this is the relation between read_param and busy:
And next, we have the deassertion of busy vs. the update of one of data[0]:
A test program
On my way to understanding how the whole thing works, I wrote a small test program that ran on the Nios II processor, which dumps all registers that are relevant for each mode. As a bonus, it can be used as a register reference, as it lists all registers available for reading Factory vs. Application mode in the respective structures.
#include <system.h>
#include <alt_types.h>
#include <io.h>
#include "sys/alt_stdio.h"
#include <unistd.h>
int main()
{
int mode;
struct regitem {
int read_source;
int param;
const char *desc;
};
const struct regitem factoryparams[] = {
{ 0, 0x00, "Current Machine State Mode" },
{ 0, 0x10, "Factory Boot Address" },
{ 1, 0x10, "Previous Boot Address" },
{ 1, 0x18, "Previous reconfiguration trigger source" },
{ 2, 0x10, "One before previous Boot Address" },
{ 2, 0x18, "One before previous reconfiguration trigger source" },
{ 3, 0x04, "Early confdone check bits" },
{ 3, 0x08, "Watchdog timeout value" },
{ 3, 0x0c, "Watchdog enable bit" },
{ 3, 0x10, "Boot address" },
{ 3, 0x14, "Force internal oscillator" },
{}
};
const struct regitem applicationparams[] = {
{ 0, 0x00, "Current Machine State Mode" },
{ 1, 0x08, "Watchdog timeout value" },
{ 1, 0x0c, "Watchdog enable bit" },
{ 2, 0x10, "Boot address" },
{}
};
const struct regitem unknownparams[] = { {} };
const struct {
const struct regitem *list;
char *desc;
} modetab[4] = {
{ factoryparams, "Factory mode" },
{ applicationparams, "Application mode" },
{ unknownparams, "Unknown mode" },
{ applicationparams, "Application mode with watchdog enabled" },
};
const struct regitem *item;
alt_putstr("\r\n---------------- BASE IMAGE ---------------------\r\n\r\n");
mode = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0) & 3;
alt_printf("Remote update register dump\r\nMode: %s\r\n",
modetab[mode].desc);
alt_putstr("\r\nParameters:\r\n");
for (item = modetab[mode].list; item->desc; item++) {
int addr = (item->param + item->read_source) * 4;
alt_printf("%s (0x%x) = 0x%x\r\n", item->desc, addr,
IORD_32DIRECT(REMOTE_UPDATE_0_BASE, addr));
}
if (mode == 0) { // Factory mode only
usleep(500000);
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x30, 0); // Turn off watchdog
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x40, 0x100000);
IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x74, 1);
}
/* Event loop never exits. */
while (1);
return 0;
}
Tests results
The test program above was compiled and included in the bitstream that was loaded into flash address 0 (Factory image).
The first alt_putstr was then changed to say “Application Image”, and the compiled version of that was included in the bitstream loaded at address 0x100000 of the flash (Application Image).
Standard output was directed to a physical UART (instead of the JTAG UART) for the purpose of this test (Eclipse’s JTAG UART console didn’t like these games with configurations).
And then I powered on:
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- APPLICATION IMAGE ---------------------
Remote update register dump
Mode: Application mode
Parameters:
Current Machine State Mode (0x0) = 0x1
Watchdog timeout value (0x24) = 0x1ffe0008
Watchdog enable bit (0x34) = 0x0
Boot address (0x48) = 0x400000
Note that if the register writes in the example are done before showing the registers, these following two lines would replace their respective outputs in the Base Image parameter list:
Watchdog enable bit (0x3c) = 0x0
Boot address (0x4c) = 0x100000
The same, with the application image wiped out (zeros):
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x4
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x4
One before previous Boot Address (0x48) = 0x400000
One before previous reconfiguration trigger source (0x68) = 0x4
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
[ ... etc ... ]
The same, with the Application image loaded in place, but with a small error (changed a single bit):
(this caused a CRC error)
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x8
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x8
One before previous Boot Address (0x48) = 0x400000
One before previous reconfiguration trigger source (0x68) = 0x8
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
[ ... etc ... ]
Loading with JTAG: I set up both flash images properly, powered up so the FPGA stayed on the Application Image. At that point, I loaded the SOF of the Factory bitstream into the FPGA through JTAG (with a USB Blaster). The JTAG operation yielded this:
---------------- BASE IMAGE ---------------------
Remote update register dump
Mode: Factory mode
Parameters:
Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x10
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1
---------------- APPLICATION IMAGE ---------------------
Remote update register dump
Mode: Application mode
Parameters:
Current Machine State Mode (0x0) = 0x1
Watchdog timeout value (0x24) = 0x1ffe0008
Watchdog enable bit (0x34) = 0x0
Boot address (0x48) = 0x400000
When loading the same bitstream through JTAG once again the same result is obtained, only with “One before previous reconfiguration trigger source” set to 0x10 as well.
The classic way:
$ export QUARTUS_ROOTDIR=/path/to/altera/15.1/quartus
$ . $QUARTUS_ROOTDIR/adm/qenv.sh
Or open a shell (will set path, but not a full environment):
$ /path/to/altera/15.1/nios2eds/nios2_command_shell.sh
This is good for compiling for NIOS etc.
The disk is hammering
For some unknown reason, possibly after an VMplayer upgrade, running any Windows Virtual machine on my Linux machine with WMware Player caused some non-stop heavy hard disk activity, even when the guest machine was effectively idle, and made had no I/O activity of its own.
Except for being surprisingly annoying, it also made the mouse pointer non-responsive and the effect was adverse on the hosting machine as well.
So eventually I managed to get things normal by editing the virtual machine’s .vmx file as described below.
I have Vmplayer 6.0.2 on Fedora 12 (suppose both are considered quite old).
Following this post, add
isolation.tools.unity.disable = "TRUE"
unity.allowCompositingInGuest = "FALSE"
unity.enableLaunchMenu = "FALSE"
unity.showBadges = "FALSE"
unity.showBorders = "FALSE"
unity.wasCapable = "FALSE"
(unity.wasCapable was already in the file, so remove it first)
That appeared to help somewhat. But what really gave the punch was also adding
MemTrimRate = "0"
sched.mem.pshare.enable = "FALSE"
MemAllowAutoScaleDown = "FALSE"
Don’t ask me what it means. Your guess is as good as mine.
The Linux desktop freezes
Freezes = Cinnamon’s clock stops advancing for a minute or so. Apparently, it’s the graphics that doesn’t update for about 1.5 second for each time that the mouse pointer goes on or off the area belonging to the guest’s display. But it accumulates, so moving the mouse all over the place trying to figure out what’s going on easily turns this freeze out to a whole minute.
Just turn off the display’s hardware acceleration. That is, enter the virtual machine settings the GUI menus, pick the display, and uncheck “Accelerate 3D graphics”. Bliss.
Nope, it didn’t help. :(
November 2023 update: Could this be related to keyboard mapping? I had a similar issue when playing with xmodmap.
Also tried to turn off the usage of OpenGL with
mks.noGL = "FALSE"
and indeed there was nothing OpenGL related in the log file (vmware.log), but the problem remained.
This command was taken from a list of undocumented parameters (there also this one).
Upgrading to VMPlayer 15.5.6 didn’t help. Neither did adding vmmouse.present = “FALSE”.
But after the upgrade, my Windows XP got horribly slow, and it seems like it had problems accessing the disk as well (upgrading is always a good idea, as we all know). Programs didn’t seem to launch properly and such. I may have worked that around by setting the VM”s type to “Other” (i.e. not something Windows related). That turns VMTools off, and maybe that’s actually a good idea.
The solution I eventually adopted was to use VMPlayer as a VNC server. So I ignore the emulated display window that is opened directly by VMPlayer, and connect to it with a VNC viewer on the local machine instead. Rather odd, but works. The only annoying that Alt-Tab and Alt-Shift keystrokes etc. aren’t captured by the guest. To set this up, go to the virtual machine settings > Options > VNC Connections and set to enabled. If the port number is set to 5901 (i.e. 5900 with an offset of 1), the connection is done with
$ vncviewer :1 &
(or pick your other favorite viewer).
The computer is a slug
On a newer machine, with 64 GiB RAM and a more recent version of VMPlayer, it took a few seconds to go back and forth from the VMPlayer window to anything else. The fix, as root is:
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
taken from here. There’s still some slight freezes when working on a window that overlaps the VMPlayer window (and other kinds of backs and forths with VMPlayer), but it’s significantly better this way.
A few notes on where to find USB related kernel files on a Linux system (kernel 3.12.20 in my case)
$ lsusb
[ ... ]
Bus 001 Device 059: ID 046d:c52b Logitech, Inc.
Now find the position in the tree. It should be device 59 under bus number 1:
$ lsusb -t
[ ... ]
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/6p, 480M
|__ Port 4: Dev 4, If 0, Class=hub, Driver=hub/4p, 480M
|__ Port 1: Dev 59, If 0, Class=HID, Driver=usbhid, 12M
|__ Port 1: Dev 59, If 1, Class=HID, Driver=usbhid, 12M
|__ Port 1: Dev 59, If 2, Class=HID, Driver=usbhid, 12M
|__ Port 3: Dev 98, If 0, Class=vend., Driver=pl2303, 12M
|__ Port 6: Dev 94, If 0, Class=vend., Driver=rt2800usb, 480M
So it’s bus 1, hub on port 4 and then port 1. Verify by checking the IDs (the paths can be much shorter, see below):
$ cat /sys/bus/usb/devices/usb1/1-4/1-4.1/idVendor
046d
$ cat /sys/bus/usb/devices/usb1/1-4/1-4.1/idProduct
c52b
or look at the individual interfaces:
$ cat /sys/bus/usb/devices/usb1/1-4/1-4.1/1-4.1\:1.2/bInterfaceClass
03
or get everything in one go, with the “uevent” file:
$ cat /sys/bus/usb/devices/usb1/1-4/1-4.1/uevent
MAJOR=189
MINOR=56
DEVNAME=bus/usb/001/059
DEVTYPE=usb_device
DRIVER=usb
PRODUCT=46d/c52b/1209
TYPE=0/0/0
BUSNUM=001
DEVNUM=059
Even though “uevent” was originally intended for generating an udev event by writing to it, reading from it provides the variables supplied to the udev mechanism. The DRIVER entry, if present, contains the driver currently assigned to the device (or interface), and is absent if no such driver is assigned (e.g. after an rmmod of the relevant module). It will usually not contain anything interesting except for when looking at directories of interfaces, because all other parts of the hierarchy are USB infrastructure, driven by drivers for such.
The device file accessed for raw userspace I/O with a USB device (with e.g libusb) is in /dev/usb/ followed by the bus number and address. For example, the Logitech device mentioned above is at bus 1, address 59 (and note DEVNAME from uevent file), hence
$ ls -l /dev/bus/usb/001/059
crw-rw-r-- 1 root root 189, 58 2017-05-17 09:57 /dev/bus/usb/001/059
Note the permissions and major/minors. The major is 189 (usb_devices on my system, according to /proc/devices). The minor is the ((bus_number-1) * 128) + address – 1.
The permissions and ownership are those in effect for who’s allowed to access this device. This is the place to check if udev rules that allow wider access to a device have done their job.
/sys/bus/usb/devices/
This was mentioned briefly above, and now let’s do the deep dive. The sysfs structure for USB devices is rather tangled, because it has many references: Through the host controller it’s connected (typically as a PCI/PCIe device on a PC), as the device itself, and as the interfaces it provides.
It helps to note that those numeric-dash-dot-colon directory names actually contain all information about the position in the USB bus hierarchy, and all of these are present directly in /sys/bus/usb/devices, as a symbolic link.
Also in /sys/bus/usb/devices, there are usbN directories, each representing a USB root hub, with N being the USB bus number. One can travel down the USB bus hierarchy starting from the usbN directories, and find the directories those symlinked directories, in a directory hierarchy that represents the bus hierarchy.
So let’s look, for example, at a directory name 1-5.1.4:1.3.
- The “1-” part means bus number one.
- The “5.1.4″ part describes the path through hubs until the device is reached: port number 5 of the root hub, port 1 of the hub connected to it, and port 4 of the hub connected to that one. Without any physical hubs, this part is just one digit, so one gets those short “1-4″ names.
Note that the chain of ports is delimited by dots. It seems like there used to be dashes a long time ago, so it would read “5-1-4″ instead. But that’s probably ancient history.
- Then we have the “:1.3″ part, which means interface number 3 on the device running in configuration number 1.
This specific directory can be found in /sys/bus/usb/devices/usb1/1-5/1-5.1/1-5.1.4/1-5.1.4:1.3/, where it appears to be a plain directory, or as /sys/bus/usb/devices/1-5.1.4:1.3, where it appears to be a symbolic link to ../../../devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.1/1-5.1.4/1-5.1.4:1.3/. But the symbolic link actually points at the former, because /sys/devices/pci0000:00/0000:00:14.0/usb1/ and /sys/bus/usb/devices/usb1/ is exactly the same. The bonus with having the symbolic link pointing at the PCI device is that we can tell which PCI/PCIe device it belongs to.
Part of the reason for this mess is that the sysfs directory tree is a representation of the references between device structures inside the Linux kernel. Since these structures point at each other in every possible direction, so does the directory structure, sometimes with symbolic links, and sometimes with identical entries.
Messy or not, this directory structure allows traveling down the USB bus tree quite easily. For example, starting from /sys/bus/usb/devices/usb1/, one can travel down all the way to 1-5.1.4:1.3, each directory providing product and vendor IDs in both numerical and string format. Except for the final leaf (with the name including a colon-suffix, e.g. :1.3) which represents an interface, so it carries different information (about endpoints, for example).
The numbers in the directories in sysfs relate to the physical topology, and should not be confused with the bus address that is assigned to each device. The only thing they have in common is the bus number, and I’m not sure that can be trusted either. But in reality, that initial “1″ and the “usb1″ part in the path actually represent the bus number of all devices in that hierarchy. Recall that all devices that are connected to a USB root port have the same bus number, even if there are hubs inbetween (unlike PCI/PCIe and switches).
Ah, and once again: “usb1″ means USB bus 1. If you were temped to interpret this as a USB protocol level, well, no.
To obtain the enumerated addresses (those that are used to talk with the device, and appear with a plain lsusb), read the “uevent” file, which even supplies the path in /dev. Or read “busnum” and “devnum” files in each directory. Now one can ask if “busnum” is redundant, since it’s supposed to be known from the directory path itself. But one could likewise ask what “devpath” is doing there, as it consists the part that comes after the dash in the directory name. Go figure.
/sys/bus/usb/drivers/
But hey, it’s not over yet. The USB devices are also divided according to their drivers. This happens in /sys/bus/usb/drivers, which is a nice place to start if you’re looking for a device doing a specific task, hence easily found by its driver.
Back to the example above, /sys/bus/usb/drivers/usb-storage has a symbolic link named 1-5.1.4:1.3, pointing at ../../../../devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.1/1-5.1.4/1-5.1.4:1.3/, which is exactly the same leaf directory as before. Not surprisingly, the driver is attached to an interface, and not to a device. So this provides the entire path to the functional part, going from the PCI entry to the USB bus, and all the way down. If we want the bus address, fetch it from the leaf directory’s parent directory, 1-5.1.4 in this case.
The dedicated usb-storage directory also has the “bind” and “unbind” files, which allow to detach and re-attach the driver to the USB device. This may be equivalent to unplugging the device and plugging it back, but not necessarily — it will result in a certain level of re-initialization, but not as full as detaching the device completely (or unbinding the USB controller from the PCI bus).
Note that a device can have multiple interfaces, which are possibly handled by different drivers. For example, a camera can function as webcam for showing a live picture, but also as a mass storage device for exposing the SD card. It’s still one USB device, with one Vendor / Product ID, and with a single pool of endpoints. So a device may appear under several directories of /sys/bus/usb/drivers/.
Try lsusb -t and lsusb -vv. And now also appreciate what this utility does…
Introduction
This post outlines some technical details on accessing an Altera ECPQ flash from a Nios II processor for read, write and erase. A non-OS settings (“bare metal”) setting is assumed.
And as a bonus (at the bottom of this post), how to program the flash based upon a SOF file, both with JTAG and by writing directly.
Remote Update is discussed in this post.
Hardware setup
In the Qsys project, there should be an instance of the Legacy EPCS/EPCQx1 Flash Controller, configured with the default parameters (not that there is much to configure). The peripheral’s epcs_control_port should be connected to the Nios II’s data master Avalon port (no point connecting it to the instruction master too).
In this example, we’ll assume that the name of Flash Controller in Qsys is epcs_flash_controller_0.
The interrupt signal isn’t used in the software setting given below, but as the connection to the Nios processor, as well as the interrupt number assignment is automatic, let it be.
Clock and reset — like the other peripherals.
The external conduit is connected as follows to an ECPQ flash, for a x1 access:
- Flash pin DATA0 to epcs_flash_controller_0_sdo (FPGA pin ASDO)
- Flash pin DCLK to epcs_flash_controller_0_dclk (FPGA pin DCLK)
- Flash pin nCS to epcs_flash_controller_0_sce (FPGA pin NCSO)
- Flash pin DATA1 to epcs_flash_controller_0_data (FPGA pin DATA0)
The FPGA pins above relate to dual-use of the configuration, which allows the FPGA to configure in Active Serial (AS) x 1 mode. Once the configuration is done, these pins become general-purpose I/O (when so required by assignments), which allows regular access to the flash device.
Note that the flash pin DATA1 is connected to the FPGA pin DATA0 — this is not a mistake, but the correct wiring for AS x 1 interface.
It’s of course possible to connect the flash to regular I/O pins, but then the FPGA won’t be able to configure from the flash.
Software
Altera’s BSP includes drivers for flash operations with multiple layers of abstraction. This abstraction is not always necessary, and makes it somewhat difficult to figure out what’s going on (in particular when things go wrong). In particular, the higher-level drivers erase flash sectors automatically before writing, which can render some counterintuitive behavior, for example if multiple write requests are made on the same sector.
I therefore prefer working with the lowest-level drivers, which merely translate the flash commands into SPI communication. It leaves the user with the responsibility to erase sectors before writing to them.
The rule is simple: The flash is divided into sectors of 64 kB each. An erase operation is performed on such 64 kB sector, leaving all its bytes in all-1′s (all bytes are 0xff).
Writing can then be done to arbitrary addresses, but effectively the data in the flash is the written data ANDed with the previous content of the memory cells. Which means a plain write, if the region has been previously erased. It’s commonly believed that it’s unhealthy for the flash to write to a byte cell twice without an erase in the middle.
This is a simple program that runs on the Nios II processor, which demonstrates read, write and erase.
#include <system.h>
#include <alt_types.h>
#include <io.h>
#include "sys/alt_stdio.h"
#include "epcs_commands.h"
static void hexprint(alt_u8 *buf, int num) {
int i;
const char hexes[] = "0123456789abcdef";
for (i = 0; i < num; i++) {
alt_putchar(hexes[(buf[i] >> 4) & 0xf]);
alt_putchar(hexes[buf[i] & 0xf]);
if ((i & 0xf) == 0xf)
alt_putchar(10); // "\n"
else
alt_putchar(32); // " "
}
alt_putchar(10); // "\n"
}
int main()
{
alt_u32 register_base = EPCS_FLASH_CONTROLLER_0_BASE + EPCS_FLASH_CONTROLLER_0_REGISTER_OFFSET;
alt_u32 silicon_id;
alt_u8 buf[256];
alt_u32 junk = 0x12345678;
const alt_u32 flash_address = 0x100000;
silicon_id = epcs_read_device_id(register_base);
alt_printf("ID = %x\n", silicon_id);
// epcs_read_buffer always returns the length of the buffer, so no
// point checking its return value.
alt_printf("Before doing anything:\n");
epcs_read_buffer(register_base, flash_address, buf, sizeof(buf), 0);
hexprint(buf, 16);
// epcs_sector_erase erases the 64 kiB sector that contains the address
// given as its second argument, and waits for the erasure to complete
// by polling the status register and waiting for the WIP (write in progress)
// bit to clear.
epcs_sector_erase(register_base, flash_address, 0);
alt_printf("After erasing\n");
epcs_read_buffer(register_base, flash_address, buf, sizeof(buf), 0);
hexprint(buf, 16);
// epcs_write_buffer must be used on a region previously erased. The
// command waits for the operation to complete by polling the status
// register and waiting for the WIP (write in progress) bit to clear.
epcs_write_buffer(register_base, flash_address, (void *) &junk, sizeof(junk), 0);
alt_printf("After writing\n");
epcs_read_buffer(register_base, flash_address, buf, sizeof(buf), 0);
hexprint(buf, 16);
/* Event loop never exits. */
while (1);
return 0;
}
The program reads 256 bytes each time, even though only 16 bytes are displayed. Any byte count is allowed in read and write. Needless to say, flash_address can be changed to any address in the device’s range. The choice of 0x100000 kept it off the configuration bitstream for the relevant FPGA.
This is the output of the program above running against an EPCQ16:
ID = 20ba15
Before doing anything:
78 56 34 12 ff ff ff ff ff ff ff ff ff ff ff ff
After erasing
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
After writing
78 56 34 12 ff ff ff ff ff ff ff ff ff ff ff ff
The data in the “Before doing anything” part can be anything that was left in the flash when the program ran. In the case above, it’s the results of the previous run of the same program.
As a side note, all EPCQ flashes also support erasing subsectors, each of 4 kiB size (hence 16 subsectors per sectors). Altera’s low-level drivers don’t support subsector erase, but it’s quite easy to expand the code to do so.
Programming the flash with a SOF file
As promised, here’s the outline of how to program the EPCQ flash with a bitstream configuration file. Not as fancy as the topic above, but nevertheless useful. The flash needs to be connected as follows:
- Flash pin DATA0 to FPGA pin ASDO
- Flash pin DCLK to FPGA pin DCLK
- Flash pin nCS to FPGA pin NCSO
- Flash pin DATA1 to FPGA pin DATA0 (once again, this is not a mistake. DATA1 to DATA0 indeed)
First thing first: Generate a JIC file. Command-line style, e.g.:
quartus_cpf -c -d EPCQ16 -s EP4CE15 projname.sof projname.jic
In the example above, the EPCQ16 argument is the flash device, and the EP4CE15 is the FPGA that will be used to program the flash, which is most likely the same FPGA the SOF targets.
Or do it with GUI:
- In Quartus, pick File > Convert Programming File…
- Choose jic output file format, and set the output file name.
- Set the configuration device to e.g. EPCQ16, Active Serial (not x4).
- Pick the SOF Data row, Page_0, click Add File… and pick SOF file.
- Pick the Flash Loader and click Add Device…, and choose e.g. Cyclone IV E, and then the same device as listed for the SOF file.
- If you want to write to the flash with your own utility, check “Create Config data RPD”
- Click Generate. A window saying the JIC file has been generated successfully should appear.
- Click Close to close this tool.
Programming the flash with JTAG:
- Open the regular JTAG programmer in Quartus (not the one in Eclipse). The one used to configure the FPGA via JTAG with a bitstream, that is.
- Click Add File… and select the JIC file created above.
- The FPGA with its flash attached should appear in the diagram part of the window.
- Select the Program/Configure checkbox on the flash’ (e.g. EPCQ16) row
- Click Start.
- This should take some 10 seconds or so (for EP4CE15′s bitfile), and end successfully.
- The flash is now programmed.
Note that there’s an “Erase” checkbox on the flash’ row — there is no need to enable it along with Program/Configure, and neither is it necessary. The Programmer gets the hint, and erases the flash before programming it.
Programming the flash with NIOS software (or similar)
Note that I have another post focusing on remote update.
To program the flash with your own utility, make sure that you’ve checked “Create Config data RPD” when generating the JIC. Then, using the flash API mentioned above, copy the RPD file into the flash from address 0 to make it load when the FPGA powers up, or to a higher address for using the bitstream with a Remote Update core (allowing configuration from higher addresses).
And note the following, which relates to my experience with using the EPCQ16 flash for AS configuring an Cyclone IV E FPGA, and running Quartus Prime Version 15.1.0 Build 185 (YMMV):
- Bit reversal is mandatory if epcs_write_buffer() is used for writing to the flash (or any other Nios API, I suppose). That means that for each byte in the RPD file, move bit 7 to bit 0, bit 6 to bit 1 etc. There are small hints of bit reversal spread out in the docs, for example, in the “Read Bytes Operation” section of the Quad-Serial Configuration (EPCQ) Devices Datasheet.
- All my attempts to generate RBF or RPD files in other ways, including using the command line tool (quartus_cpf) to create an RBF from the SOF or an RPD from a POF failed. That is, I got RBF and RPD files, but they slightly different from the file that eventually worked. In particular, the RBF file obtained with
quartus_cpf -c project.sof project.rbf
was almost identical to the RPD file that worked properly, with a few bytes different in the 0x20-0x4f positions of the files. And that difference probably made the FPGA refuse to configure from it. Go figure.
- If you’re really into generating the flash image with command line tools, generate a COF file (containing the configuration parameters) with the GUI, and use it with something like
quartus_cpf -c project.cof
The trick about this COF is that it should generate a proper JIC file, but have the <auto_create_rpd> part set to “1″.
And finally, just a few sources I found (slightly unrelated):
- Srunner is a command line utility for programming a EPCS flash. Since source code is given, it can give some insights, as well as its documentation.
- The format of POF files is outlined in fmt_pof.pdf.
Using an EPCQ16A device instead
The EPCQ16 device is obsolete, and replaced with EPCQ16A. Unfortunately, the AN822 Migration Guide supplies confusing and rather discouraging information, but in the end of the day, it’s a drop-in replacement for all purposes mentioned above. Except that it replies with an ID = ef4015 instead of the 20ba15 shown above. Which is fine, because it’s only the lower 8 bits that Altera / Intel stand behind. The other 16 bits are considered junk data during “dummy clock cycles” according to the datasheet (even though they are taken seriously somewhere in Altera’s old drivers, don’t ask me where I saw it).
The Migration Guide lists different Altera IP cores related to the flash, and points at which are compatible and which are not. The Legacy Flash EPCS/EPCQx1 flash controller isn’t mentioned at all in this list, but as this controller is merely and SPI controller, it’s the opcode compatibility that matters. According to the Migration Guide, the relevant opcodes remain the same, which is probably all that matters: The 4BYTEADDREN/4BYTEADDEX commands that are gone in EPCQA are never used (the flash writing application never requests 4-byte write), and the 0x0b / 0xeb (fast read commands) aren’t even listed in epcs_commands.h.
Bottom line: No problem using the “A” version in the usage scenarios shown above.
It worked all so nicely on my Fedora 12 machine, and then on Ubuntu 14.04.1 it failed colossally:
$ make
gcc -Wall -O3 -g -lusb-1.0 -c -o bulkread.o bulkread.c
gcc -Wall -O3 -g -lusb-1.0 -c -o usberrors.o usberrors.c
gcc -Wall -O3 -g -lusb-1.0 bulkread.o usberrors.o -o bulkread
bulkread.o: In function `main':
bulkread.c:39: undefined reference to `libusb_init'
bulkread.c:46: undefined reference to `libusb_set_debug'
bulkread.c:48: undefined reference to `libusb_open_device_with_vid_pid'
[ ... ]
And it went on and on. Note that there was no complaint about not finding the library, and yet it failed to find the symbols.
The problem was the position of the -l flag. It turns out that Ubuntu silently adds an –as-needed flag to the linker, which effectively means that the -l flag must appear after the object file that needs the symbols, or it will be effectively ignored.
So the correct way is:
$ make
gcc -Wall -O3 -g -c -o bulkread.o bulkread.c
gcc -Wall -O3 -g -c -o usberrors.o usberrors.c
gcc -Wall -O3 -g bulkread.o usberrors.o -o bulkread -lusb-1.0
It’s all about the flag’s position…