Examples of SDMA-assembler for Freescale i.MX51

This post was written by eli on November 5, 2011
Posted Under: ARM,Linux,NXP (Freescale)

These are a couple of examples of SDMA assembly code, which performs data copy using the DMA functional unit. The first one shows how to copy data from application memory space to SDMA memory. The second example copies data from one application memory chunk to another, and hence works as an offload memcpy().

To actually use this code and generally understand what’s going on here, I’d warmly suggest reading a previous post of mine about SDMA assembly code, which also explains how to compile the code and gives the context for the C functions given below.

Gotchas

  • Never let either the source address nor the destination address cross a 32-byte boundary during a burst from or to the internal FIFO. Even though I haven’t seen this restriction in the official documentation, several unexplained misbehaviors have surfaces when allowing this happen, in particular when accessing EIM. So just don’t.
  • When accessing EIM, the EIM’s maximal burst length must be set to allow 32 bytes in one burst with the BL parameter, or data gets corrupted.

Application space memory to SDMA space

The assembly code goes

$ ./sdma_asm.pl app2sdma.asm
 | # Always in context (not altered by script):
 | #
 | # r4 : Physical address to source in AP memory space
 | # r6 : Address in SDMA space to copy to
 | # r7 : Number of DWs to copy   
 | #
 | # Both r4 and r5 must be DW aligned.
 | # Note that prefetching is allowed, so up to 8 useless DWs may be read.
 |
 | # First, load the status registers into SDMA space
                             | start:
0000 6c20 (0110110000100000) |     stf r4, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 008f (0000000010001111) |     mov r0, r7
0002 018e (0000000110001110) |     mov r1, r6
0003 7803 (0111100000000011) |     loop postloop, 0
0004 622b (0110001000101011) |     ldf r2, 0x2b # Read from 32 bits from MD with prefetch
0005 5a01 (0101101000000001) |     st r2, (r1, 0) # Address in r1
0006 1901 (0001100100000001) |     addi r1, 1
                             | postloop:
0007 0300 (0000001100000000) |     done 3
0008 0b00 (0000101100000000) |     ldi r3, 0
0009 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
000a 7df5 (0111110111110101) |     bt start # Always branches

------------ CUT HERE -----------

static const int sdma_code_length = 6;
static const u32 sdma_code[6] = {
 0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
};

Note that the arguments for sdf and ldf are given as numbers, and not following the not-so-helpful notation used in the Reference Manual.

The basic idea behind the assembly code is that each DW (Double Word, 32 bits) is read automatically by the functional unit from application space memory, and then fetched from the FIFO into r2. Then the register is written to SDMA memory with a plain “st” opcode.

The relevant tryrun() function to test this is:

static int tryrun(struct sdma_engine *sdma)
{
 dma_addr_t src_phys;
 void *src_virt;

 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];
 static const u32 sdma_code[6] = {
   0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
 };

 static const u32 sample_data[8] = {
   0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
   0xebeb0000, 0, 0xffffffff, 0xabcdef00 };

 const int origin = 0xe00; // In data space terms (32 bits/address)

 struct sdma_context_data *context = sdma->context;

 int ret;

 src_virt = dma_alloc_coherent(NULL,
                               4096, // 4096 bytes, just any buffer size
                               &src_phys, GFP_KERNEL);
 if (!src_virt) {
   printk(KERN_ERR "Failed to allocate source buffer memory\n");
   return -ENOMEM;
 }

 memset(src_virt, 0, 4096);

 memcpy(src_virt, sample_data, sizeof(sample_data));

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; // In program space addressing...
 context->gReg[4] = src_phys;
 context->gReg[6] = 0xe80;
 context->gReg[7] = 3; // Number of DWs to copy

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
 0x800 + (sizeof(*context) / 4) * channel);

 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 ret = sdma_run_channel(&sdma->channel[1]);

 sdma_print_mem(sdma, 0xe80, 128);

 if (ret) {
   printk(KERN_ERR "Failed to run script!\n");
   return ret;
 }

 return 0; /* Success! */
}

Note that the C code snippet, which is part of the output of the assembler compilation, actually appears in the tryrun() function.

Fast memcpy()

Assembly goes

$ ./sdma_asm.pl copydma.asm
 | # Should be set up at invocation
 | #
 | # r0 : Number of DWs to copy (is altered as script runs)
 | # r1 : Source address (DW aligned)
 | # r2 : Destination address (DW aligned)
 |
0000 6920 (0110100100100000) |     stf r1, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 6a04 (0110101000000100) |     stf r2, 0x04 # To MDA, address is nonfrozen
0002 0c08 (0000110000001000) |     ldi r4, 8 # Number of DWs to copy each round
                             | copyloop:
0003 04d8 (0000010011011000) |     cmphs r4, r0 # Is 8 larger or equal to the number of DWs left to copy?
0004 7d03 (0111110100000011) |     bt lastcopy  # If so, jump to last transfer label
0005 6c18 (0110110000011000) |     stf r4, 0x18 # Copy 8 words from MSA to MDA address.
0006 2008 (0010000000001000) |     subi r0, 8   # Decrement counter
0007 7cfb (0111110011111011) |     bf copyloop  # Always branches, because r0 > 0
                             | lastcopy:
0008 6818 (0110100000011000) |     stf r0, 0x18 # Copy 8 or less DWs (r0 is always > 0)
                             | exit:
0009 0300 (0000001100000000) |     done 3
000a 0b00 (0000101100000000) |     ldi r3, 0
000b 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
000c 7dfc (0111110111111100) |     bt exit # Endless loop, just to be safe

------------ CUT HERE -----------

static const int sdma_code_length = 7;
static const u32 sdma_code[7] = {
 0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
}

For a frozen (constant) source address (e.g. when reading from a FIFO) the first stf should be done with argument 0x30 rather than 0x20. For a frozen destination address, the seconds stf has the argument 0x14 instead of 0x04.

This script should be started with r0 > 0. It may be OK to have r0=0, but I’m not sure about that (and if there’s no issue with not reading any data after a prefetch, as possibly related to section 52.22.1 in the Reference Manual).

The endless loop to “exit” should never be needed. It’s there just in case the script is rerun by mistake, so it responds with a “done” right away. And the example above is not really optimal: To make a for-sure branch, I could have gone “bt exit” and “bf exit” immediately after it, making this in two opcodes instead of three. Wasteful me.

The tryrun() function for this case then goes

static int tryrun(struct sdma_engine *sdma)
{
 dma_addr_t buf_phys;
 u8 *buf_virt;

 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];

 static const u32 sdma_code[7] = {
   0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
 };

 static const u32 sample_data[8] = {
                                    0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
                                    0xebeb0000, 0, 0xffffffff, 0xabcdef00 };

 const int origin = 0xe00; // In data space terms (32 bits/address)

 struct sdma_context_data *context = sdma->context;

 int ret;

 buf_virt = dma_alloc_coherent(NULL, 4096,
                               &buf_phys, GFP_KERNEL);
 if (!buf_virt) {
   printk(KERN_ERR "Failed to allocate source buffer memory\n");
   return -ENOMEM;
 }

 memset(buf_virt, 0, 4096);

 memcpy(buf_virt, sample_data, sizeof(sample_data));

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; // In program space addressing...
 context->gReg[0] = 18; // Number of DWs to copy
 context->gReg[1] = buf_phys;
 context->gReg[2] = buf_phys + 0x40;

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
                          0x800 + (sizeof(*context) / 4) * channel);

 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 ret = sdma_run_channel(&sdma->channel[1]);

do {
 int i;
 const int len = 0xa0;

 unsigned char line[128];
 int pos = 0;

 for (i=0; i<len; i++) {
   if ((i % 16) == 0)
   pos = sprintf(line, "%04x ", i);

   pos += sprintf(&line[pos], "%02x ", buf_virt[i]);

   if ((i % 16) == 15)
     printk(KERN_WARNING "%s\n", line);
   }
 } while (0);

 if (ret) {
   printk(KERN_ERR "Failed to run script!\n");
   return ret;
 }

 return 0; /* Success! */
}

The memory’s content  is printed out here from tryrun() directly, since the dumped memory is in application space.

Reader Comments

Eli,

Thanks for this series of quite helpful articles on the SDMA.

I am wondering if you have more details on the “Gotchas” you put on this page, in particular the 32 bytes boundary crossing on the EIM bus.

I think I have the same kind of trouble on an i.MX25 processor. So far it seems I need the EIM address to be 32 bytes aligned to get the SDMA working correctly.

I was wondering if you got more insight on the problem and how to work with/arround it (as mandating 32 bytes alignement is not always practical in particular when “speaking” to a device).

Thanks

JC

#1 
Written By Jean-Christophe DUBOIS on March 8th, 2012 @ 23:33

I’m afraid I can’t help any further on this. I haven’t done anything more with the SDMA, and I don’t think I will do so again. And neither have I investigated these issues further.

#2 
Written By eli on March 8th, 2012 @ 23:39

OK, thanks, I will try to sort out things.

#3 
Written By JEAN-CHRISTOPHE DUBOIS on March 9th, 2012 @ 21:26

Hi
I would suggest some simplification to ‘mx51_sdma_set.pm’ – if you change your ‘get_number()’ to be like this

sub get_number {
my ($arg) = @_;
return eval($arg);
}

(perl) operations in numbers may be used:
stf r3, 0x12 | 23

and if you have in your asm-file (say ‘my-code.asm’) ‘#include “defs.h”‘ (a file with defines like:

#define CPY 0x10

and pass it through cpp:

cat my-code.asm | cpp | sdma_asm.pl

you will be able to use in your code this kind of expressions:

stf r1, MD | SZ0 | FL

provided (of course) ‘MD’ ‘SZ0′ (and so on) are defined in ‘defs.h’

Voila

#4 
Written By Michael on February 14th, 2013 @ 18:34

Add a Comment

required, use real name
required, will not be published
optional, your blog address