NXP / Freescale SDMA and the art of accessing peripheral registers

This post was written by eli on August 10, 2017
Posted Under: ARM,Linux kernel,NXP (Freescale)

Preface

While writing a custom SDMA script for copying data arriving from an eCSPI peripheral into memory, it occurred to me that there is more than one way to fetch the data from the peripheral. This post summarizes my rather decisive finding in this matter. Spoiler: Linux’ driver could have done better (Freescale’s v4.1.15)

I’ve written a tutorial on SDMA scripts in general, by the way, which is recommended before diving into this one.

Using the Peripheral DMA Unit

This is the method used by the official eCSPI driver for Linux. That is, the one obtained from Freescale’s / NXP’s Linux git repository. Specifically, spi_imx_sdma_init() in drivers/spi/spi-imx.c sets up the DMA transaction with

	spi_imx->rx_config.direction = DMA_DEV_TO_MEM;
	spi_imx->rx_config.src_addr = res->start + MXC_CSPIRXDATA;
	spi_imx->rx_config.src_addr_width = DMA_SLAVE_BUSWIDTH_1_BYTE;
	spi_imx->rx_config.src_maxburst = spi_imx_get_fifosize(spi_imx) / 2;
	ret = dmaengine_slave_config(master->dma_rx, &spi_imx->rx_config);
	if (ret) {
		dev_err(dev, "error in RX dma configuration.\n");
		goto err;
	}

Since res->start points at the address resource obtained from the device tree (0x2008000 for eCSPI1), this is the very same address used for accessing the peripheral registers (only the software uses the virtual address mapped to the relevant region).

In essence, it means issuing an stf command to set the PSA (Peripheral Source Address), and then reading the data with an ldf command on the PD register. For example, if the physical address (e.g. 0x2008000) is in register r1:

69c3 (0110100111000011) | 	stf	r1, 0xc3	# PSA = r1 for 32-bit frozen periheral read
62c8 (0110001011001000) | 	ldf	r2, 0xc8	# Read peripheral register into r2

One would expect this to be correct way, or why does this unit exist? Or why does Linux’ driver use it? On the other hand, if this is the right way, why is there a “DMA mapping”?

Using the Burst DMA Unit

This might sound like a bizarre idea: Use the DMA unit intended for accessing RAM for peripheral registers. I wasn’t sure this would work at all, but it does: If the same address that was fed into PSA for accessing a peripheral goes into MSA instead, the data can be read correctly from MD. After all, the same address space is used by the processor, Peripheral DMA unit and Burst DMA unit, and it turns out that the buses are interconnected (which isn’t obvious).

So the example above changes into

6910 (0110100100010000) | 	stf	r1, 0x10    # To MSA, NO prefetch, address is frozed
620b (0110001000001011) | 	ldf	r2, 0x0b    # Read peripheral register into r2

The motivation for this type of access is using copy mode — a burst of up to 8 read/write operations in a single SDMA command. This is possible only from PSA to PDA, or from MSA to MDA. But there is no burst mode from PSA to MDA. So treating the peripheral register as a memory element works around this.

Spoiler: It’s not such a good idea. The speed results below tell why.

Using the SDMA internal bus mapping

The concept is surprisingly simple: It’s possible to access some peripherals’ registers directly in the SDMA assembly code’s memory space. In other words, to access eCSPI1, one can go just

5201 (0101001000000001) | 	ld	r2, (r1, 0) # Read peripheral register from plain SDMA address space

and achieve the equivalent result of the examples above. But r1 needs to be set to a different address. And this is where it gets a bit confusing.

The base address is fairly easy to obtain. For example, i.MX6′s reference manual lists the address for eCSPI1 as 0x2000 in section 2.4 (“DMA memory map”), where it also says that the relevant section spans 4 kB. Table 55-14 (“SDMA Data Memory Space”) in the same document assigns the region 0x2000-0x2fff to “per2″, declares its size as 16 kB, and in the description it says “peripheral 2 memory space (4 Kbyte peripheral’s address space)”. So what is it? 4 kB or 16 kB?

The answer is both: The address 0x2000 is given in SDMA data address format, meaning that each address points at a 32-bit word. Therefore, the SDMA map region of 0x2000-0x2fff indeed spans 16 kB. But the mapping to the peripheral registers was done in a somewhat creative way: The address offsets of the registers apply directly on the SDMA mapping’s addresses.

For example, let’s consider the ECSPI1_STATREG, which is placed at “Base address + 18h offset”. In the Application Processor’s address space, it’s quite clear that it’s 0x2008000 + 0x18 = 0x2008018. The 0x18 offset means 0x18 (24 in decimal) bytes away from the base.

In the SDMA mapping, the same register is accessed at 0x2000 + 0x18 = 0x2018. At first glance, this might seem obvious, but an 0x18 offset means 24 x 4 = 96 bytes away from the base address. A bit odd, but that’s the way it’s implemented.

So even though each address increment in SDMA data address space moves 4 bytes, they mapped the multiply-by-4 offsets directly, placing the registers 16 bytes apart. Attempting to access addresses like 0x2001 yield nothing noteworthy (in my experiments, they all read zero). I believe that the SDMA submodule was designed in France, by the way.

Almost needless to say, these addresses (e.g. 0x2000) can’t be used to access peripherals with Peripheral / Burst DMA units — these units work with the Application Processor’s bus infrastructure and memory map.

Speed tests

As all three methods work, the question is how fast each is. So I ran a speed test. I only tested the peripheral read operation (my application didn’t involve writes), but I would expect more or less the same results for writes. The speed tests were carried out by starting the SDMA script from a Linux kernel module, and issuing a printk when the SDMA script was kicked off. When the interrupt arrived at the completion of the script (resulting from a “done 3″ opcode, not shown in the table below), another printk was issued. The timestamps in dmeg’s output was used to measure the time difference.

In order to keep the influence of the Linux overhead delays low, the tested command was executed within a hardware loop, so that the overall execution would take a few seconds. A few milliseconds of printk delay hence became fairly negligible.

The results are given in the following table:

	Peripheral DMA Unit	Burst DMA Unit	Internal bus mapping	Non-IO command
Assembly code	`stf r1, 0xc3 loop endloop, 0 ldf r2, 0xc8 endloop:`	`stf r1, 0x10 loop endloop, 0 ldf r2, 0x0b endloop:`	`loop endloop, 0 ld r2, (r1, 0) endloop:`	`loop endloop, 0 addi r5, 2 endloop:`
Execution rate	7.74 Mops/s	3.88 Mops/s	32.95 Mops/s	65.97 Mops/s

Before concluding the results, a word on the rightmost one, which tested the speed of a basic command. The execution rate, almost 66 Mops/s, shows the SDMA machine’s upper limit. Where this came from isn’t all that clear, as I couldn’t find a matching clock rate in any of the three clocks enabled by Linux’ SDMA driver: clk_ahb, clk_ipg and clk_per.

The reference manual’s section 55.4.6 claims that the SDMA core’s frequency is limited to 104 MHz, but calling clk_get_rate() for clk_ahb returned 132 MHz (which is 2 x 66 MHz…). For the two other which the imx-sdma.c driver declares that it uses, clk_ipg and clk_per (the same clock, I believe), clk_get_rate() returned 60 MHz, so it’s not that one. In short, it’s not 100% what’s going on, except that the figure is max 66 Mops/s.

By the way, I verified that the hardware loop doesn’t add extra cycles by duplicating the addi command, so it ran10 times for each loop. The execution rate dropped to exactly 1/10, so there’s definitely no loop overhead.

OK, so now to the conclusions:

The clear winner is using the internal bus. Note that the result isn’t all that impressing, after all. With 33 Mops, 4 bytes each, there’s a theoretical limit of 132 MB/s for just reading. That doesn’t include doing something with the data. More about that below.
Note that reading from the internal bus takes just 2 execution cycles.
There is a reason for using the Peripheral DMA Unit, after all: It’s twice as fast compared with the Burst DMA Unit.
It probably doesn’t pay off to use the Burst DMA Unit for burst copying from a peripheral to memory, even though I didn’t give it a go: The read is twice as slow, and writing to memory with autoflush is rather quick (see below).
The use of the Peripheral DMA Unit in the Linux kernel driver is quite questionable, given the results above. On the other hand, the standard set of scripts aren’t really designed for efficiency anyhow.

Copying data from peripheral to RAM

In this last pair of speed tests, the loop reads one value from the peripheral with Internal bus mapping (the fastest way found) and writes it to the general RAM with an stf command, using autoincrement. This is hence a realistic scenario for bulk copying of data from a peripheral data register into memory that is available to the Application Processor.

The test code had to be modified slightly, so the destination address is brought back to the beginning of the buffer every 1,000,000 write operations, since the buffer size is limited, quite naturally. So when the script begins, r7 contains the number of times to loop until resetting the destination address (that is, r7 = 1000000) and r3 contains the number of such sessions to run (was set to 200). The overhead of this larger loop is literally one in a million.

The assembly code used was:

                             | bigloop:
0000 008f (0000000010001111) | 	mov	r0, r7
0001 6e04 (0110111000000100) | 	stf	r6, 0x04	# MDA = r6, incremental write
                             |
0002 7802 (0111100000000010) | 	loop endloop, 0
0003 5201 (0101001000000001) | 	ld	r2, (r1, 0)
0004 6a0b (0110101000001011) | 	stf	r2, 0x0b	# Write 32-bit word, no flush
                             | endloop:
0005 2301 (0010001100000001) | 	subi	r3, 1		# Decrement big loop counter
0006 7cf9 (0111110011111001) | 	bf	bigloop		# Loop until r3 == 0
                             | quit:
0007 0300 (0000001100000000) | 	done 3			# Quit MCU execution

The result was 20.70 Mops/s, that is 20.7 Million pairs of read-writes per second. This sets the realistic hard upper limit for reading from a peripheral to 82.8 MB/s. Note that deducing the known time it takes to execute the peripheral read, one can estimate that the stf command runs at ~55.5 Mops/s. In other words, it’s a single cycle instruction until an autoflush is forced every 8 writes. However dropping the peripheral read command (leaving only the stf command) yields only 35.11 Mops/s. So it seems like the DMA burst unit takes advantage of the small pauses between accesses to it.

I should mention that the Linux system was overall idle while performing these tests, so there was little or no congestion on the physical RAM. The results were repeatable within 0.1% of the execution time.

Note that automatic flush was enabled during this test, so the DMA burst unit received 8 writes (32 bytes) before flushing the data into RAM. When reattempting this test, with explicit flush on each write to RAM (exactly the same assembly code as listed above, with a peripheral read and then stf r7, 0x2b instead of 0x0b), the result dropped to 6.83 Mops/s. Which is tantalizingly similar to the 7.74 Mops result obtained for reading from the Peripheral DMA Unit.

Comparing with non-DMA

Even though not directly related, it’s worth comparing how fast the host accesses the same registers. For example, how much time will this take (in Linux kernel code, of course)?

  for (i=0; i<10000000; i++)
    rc += readl(ecspi_regs + MX51_ECSPI_STAT);

So the results are as follows:

Reading from an eCSPI register (as shown above): 4.10 Mops/s
The same, but from RAM (non-cacheable, allocated with dma_alloc_coherent): 6.93 Mops/s
The same, reading with readl() from a region handled by RAM cache (so it’s considered volatile): 58.14 Mops/s
Writing to an eCSPI register (with writel(), loop similar to above): 3.8696 Mops/s

This was carried out on an i.MX6 processor with a clock frequency of 996 MHz.

The figures echo well with those found in the SDMA tests, so it seems like the dominant delays come from i.MX6′s bus bridges. It’s also worth nothing the surprisingly slow performance of readl() from cacheable, maybe because of the memory barriers.

Add a Comment

Next Post: Notes on USB 1.1 low-level protocol for FPGA implementation

Previose Post: NXP / Freescale i.MX6 as an SPI slave

my tech blog

Popular Posts

Latest Posts

Archives