NXP / Freescale i.MX6 as an SPI slave

This post was written by eli on August 10, 2017
Posted Under: ARM,Linux kernel,NXP (Freescale)

Motivation

Even though SPI is commonly used for controlling rather low-speed peripherals on an embedded system, it can also come handy for communicating data with an FPGA.

When using the official Linux driver, the host can only be the SPI master. It means, among others, that transactions are initiated by the host: When the bursts take place is completely decided by software, and so is how long they are. It’s not just about who drives which lines, but also the fact that the FPGA is on the responding side. This may not be a good solution when the data rates are anything but really slow: If the FPGA is slave, it must wait for the host to poll it for data (a bit like a USB peripheral). That can become a bit tricky at the higher end of data rates.

For example, if the FPGA’s FIFO is 16 kbit deep, and is filled at 16 Mbit/s, it takes 1 ms for it to overflow, unless drained by the host. This can be a difficult real-time task for a user-space Linux program (based upon spidev, for example). Not to mention how twisted such a solution will end up, having the processor constantly spinning in a loop collecting data, whether there is data to collect or not.

Another point is that the SPI clock is always driven by the SPI master, and it’s usually not a free-running one. Rather, bursts of clock edges are presented on the clock wire to advance the data transaction.

Handling a gated clock correctly on an FPGA isn’t easy when it’s controlled by an external device (unless its frequency is quite low). From an FPGA design point of view, it’s by far simpler to drive the SPI clock and handle the timing of the MOSI/MISO signals with respect to it.

And finally: If a good utilization of the upstream (FPGA to host) SPI channel is desired, putting the FPGA as master has another advantage. For example, on i.MX6 Dual/Quad, the SPI clock cycle is limited to a cycle of 15 ns for write transactions, but to 40 ns or 55 ns on read transactions, depending on the pins used. The same figures are true, regardless of whether the host is master or slave (compare sections 4.11.2.1and 4.11.2.2 in the relevant datasheet, IMX6DQCEC.pdf). So if the FPGA needs to send data faster than 25 Mbps, it can only use write cycles, hence it has to be the SPI master.

CS is useless…

This is the “Chip Select” signal, or “Slave Select” (SS) in Freescale / NXP terminology.

The reference manual, along with NXP’s official errata ERR009535, clearly state that deasserting the SPI’s CS wire is not a valid way to end a burst. Citing the description for the SS_CTL field of ECSPIx_CONFIGREG, section 21.7.4 in the i.MX6 Reference Manual:

In slave mode – an SPI burst is completed when the number of bits received in the shift register is equal to (BURST_LENGTH + 1). Only the n least-significant bits (n = BURST_LENGTH[4:0] + 1) of the first received word are valid. All bits subsequent to the first received word in RXFIFO are valid.

So the burst length is fixed. The question is, what value to pick. Short answer: 32 bits (set BURST LENGTH to 31).

Why 32? First, let’s recall that RXFIFO is 32 bits wide. So what is more natural than packing the incoming data into full 32 bits entries in the RXFIFO, fully utilizing its storage capacity? Well, maybe the natural data alignment isn’t 32 bits, so another packing scheme could have been better. In theory.

That’s where the second sentence in the citation above comes in. What it effectively says is that if BURST_LENGTH + 1 is chosen anything else than a multiple of 32, the first word, which is ever pushed into RXFIFO since the SPI module’s reset, will contain less than 32 received bits. All the rest, no matter what BURST_LENGTH is set to, will contain 32 bits of received data. This is really what happens. So in the long run, data is packet into 32 bit words no matter what. Choosing BURST_LENGTH + 1 other than a multiple of 32 will just mess up things on the first word the RXFIFO receives after waking up from reset. Nothing else.

So why not set BURST_LENGTH to anything else than 31? Simply because there’s no reason to do so. We’re going to end up with an SPI slave that shifts bits into RXFIFO as 32 bit words anyhow. The term “burst” has no significance, since deassertions of CS are ignored anyhow. In fact, I’m not sure if it makes any difference between different values satisfying multiple of 32 rule.

Note that since CS doesn’t function as a frame for bursts, it’s important that the eCSPI module is brought out of reset while there’s no traffic (i.e. clock edges), or it will pack the data in an unaligned and unpredictable manner. Also, if the FPGA accidentally toggles the clock (due to a bug), alignment it lost until the eCSPI is reset and reinitialized.

Bottom line: The SPI slave receiver just counts 32 clock edges, and packs the received data into RXFIFO. Forever. There is no other useful alternative when the host is slave.

… but must be taken care of properly

Since the burst length doesn’t depend on the CS signal, it might as well be kept asserted all the time. With the register setting given below, that means holding the pin constantly low. It’s however important to select the correct pin in the CHANNEL_SELECT field of ECSPIx_CONREG: The host will ignore the activity on the SPI bus unless CS is selected. In other words, you can’t terminate a burst with CS, but if it isn’t asserted, bits aren’t sampled.

Another important thing to note, is that the CS pin must be IOMUXed as a CS signal. In the typical device tree for the mainstream Linux SPI master driver, it’s assigned as a GPIO pin. That’s no good for an SPI slave.

So, for example, if the ECSPI entry in the device tree says:

&ecspi1 {
[ ... ]
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_ecspi1_1>;
	status = "okay";
 };

meaning that the IOMUX settings given in pinctrl_ecspi1_1 should be applied, when the Linux driver related to ecspi1 is probed. It should say something like

&iomuxc {
	imx6qdl-var-som-mx6 {
[ ... ]

		pinctrl_ecspi1_1: ecspi1grp {
			fsl,pins = <
				MX6QDL_PAD_DISP0_DAT22__ECSPI1_MISO	0x1f0b1
				MX6QDL_PAD_DISP0_DAT21__ECSPI1_MOSI	0x1f0b1
				MX6QDL_PAD_DISP0_DAT20__ECSPI1_SCLK	0x130b1
				MX6QDL_PAD_DISP0_DAT23__ECSPI1_SS0	0x1f0b1
			>;
		};
[ ... ]

The actual labels differ depending on the processor’s variant, which pins were chosen etc. The point is that the _SS0 usage was selected for the pin, and not the GPIO alternative (in which case it would say MX6QDL_PAD_DISP0_DAT23__GPIO5_IO17). The list of IOMUX defines for the i.MX6 DL variant can be found in arch/arm/boot/dts/imx6dl-pinfunc.h.

Endianness

The timing diagrams for SPI communication in the Reference Manual show only 8 bit examples, with MSB received first. But this applies to 32 bit words as well. But what happens if 4 bytes are sent with the intention of being treated as a string of bytes?

Because the first byte is treated as the MSB of a 32-bit word, it’s going to end up as the last byte when the 32-bit word is copied (by virtue of a single 32-bit read and write) into RAM, whether done by the processor or by SDMA. This ensures that a 32-bit integer is interpreted correctly by the Little Endian processor when transmitted over the SPI bus, but messes up single bytes transmitted.

Where exactly this flipping takes place, I”m not sure, but it doesn’t really matter. Just be aware that if a sequence of bytes are sent over the SPI link, they need to be byte swapped in groups of 4 bytes to appear in the correct order in the processor’s memory.

Register setting

In terms of a Linux kernel driver, the probe of an SPI slave is pretty much the same as the SPI master, with a few obvious differences. For example, the SPI clock’s frequency isn’t controlled by the host, so it probably doesn’t matter so much how the dividers are set (but it’s probably wise to set these dividers to 1, in case the internal clock is used for something).

  ctrl = MX51_ECSPI_CTRL_ENABLE | /* Enable module */
    /* MX51_ECSPI_CTRL_MODE_MASK not set, so it's slave mode */
    /* Both clock dividers set to 1 => 60 MHz, not clear if this matters */
    MX51_ECSPI_CTRL_CS(which_cs) | /* Select CSn */
    (31 << MX51_ECSPI_CTRL_BL_OFFSET); /* Burst len = 32 bits */

  cfg = 0; /* All defaults, in particular, no clock phase / polarity change */

  /* CTRL register always go first to bring out controller from reset */
  writel(ctrl, regs + MX51_ECSPI_CTRL);

  writel(cfg, regs + MX51_ECSPI_CONFIG);

  /*
   * Wait until the changes in the configuration register CONFIGREG
   * propagate into the hardware. It takes exactly one tick of the
   * SCLK clock, but we will wait 10 us to be sure (SCLK is 60 MHz)
   */

  udelay(10);

  /*
    Turn off DMA requests (revert the register to its defaults)
    But set the RXFIFO watermark as required by device tree.
  */
  writel(MX51_ECSPI_DMA_RX_WML(rx_watermark),
	 regs + MX51_ECSPI_DMA);

  /* Enable interrupt when RXFIFO reaches watermark */
  writel(MX51_ECSPI_INT_RDREN, regs + MX51_ECSPI_INT);

The example above shows the settings that apply when the the host reads from the RXFIFO directly. Given the measurements I present in another post of mine, showing ~4 Mops/s with a plain readl() call, it means that at the maximal bus rate of 66 Mbit/s, which is ~2.06 Mops/s (32 bits per read), we have the a processor core 50% busy just on readl() calls.

So for higher data rates, SDMA is pretty much a must.

The speed test

Eventually, I ran a test. With a dedicated SDMA script, SPI clock running at 112 MHz, 108.6 Mbit/s actual throughput:

# time dd if=/dev/myspi of=/dev/null bs=64k count=500
500+0 records in
500+0 records out
32768000 bytes (33 MB, 31 MiB) copied, 2.41444 s, 13.6 MB/s

real	0m2.434s
user	0m0.000s
sys	0m1.610s

This data rate is, of course, way above the allowed SPI clock frequency of 66 MHz, but it’s not uncommon that real-life results are so much better. I didn’t bother pushing the clock higher.

I ran a long and rigorous test looking for errors on the data transmission line (~ 1 TB of data) and it was completely clean with the 112 MHz, so the SPI slave is reliable. For a production system, I don’t think about exceeding 66 MHz, despite this result. Just to have that said.

But the bottom line is that the SPI slave mode can be used as a simple transmission link of 32-bit words. Often that’s good enough.

Reader Comments

Did you try with more than one slave on the bus? From what I’ve seen on an i.MX6ULL the MISO line doesn’t go tristate properly on deasserting SS.

#1 
Written By Chris Fryer on July 13th, 2018 @ 12:11

Do you have a full working example of the implemented SPI slave driver?

#2 
Written By Curtis on January 8th, 2019 @ 23:55

Yes, but it’s commissioned work, so I can’t disclose it.

#3 
Written By eli on January 8th, 2019 @ 23:59

Ah, bummer!
In that case can you recommend any resources to achieve the same?

#4 
Written By Curtis on January 9th, 2019 @ 23:09

I’m afraid not. Maybe googling around will help.

#5 
Written By eli on January 9th, 2019 @ 23:14

I have a SPI driver for slave mode that works, but have found that it will clock 1 byte out when the SPI clock runs even if CS is inactive. So ideally, the SPI clock should be gated with CS on the slave device if there are other slave devices on the bus.

#6 
Written By Bill on January 15th, 2019 @ 21:24

“Maybe googling around will help” Funny guy. :-)

#7 
Written By ralph on May 17th, 2019 @ 05:49

My SPI Slave driver works, but without DMA..
it seems DMA is disable in slave mode, how can you enable it?

#8 
Written By noodlefighter on July 21st, 2022 @ 09:53

Add a Comment

required, use real name
required, will not be published
optional, your blog address