my tech blog - Anything I found worthy to write down.

Freescale i.MX SDMA tutorial (part IV)

This is part IV of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux (this page)

Running custom scripts

I’ll try to show the basics of getting a simple custom script to run on the SDMA core. Since there’s a lot of supporting infrastructure involved, I’ll show my example as a hack on the drivers/dma/imx-sdma.c Linux kernel module per version 2.6.38. I’m not going to explain the details of kernel hacking, so without experience in that field, it will be pretty difficult to try this out yourself.

The process of running an application-driven custom script consists of the following steps:

Initialize the SDMA module
Initialize the SDMA channel and clearing its HE flag
Copy the SDMA assembly code from application space memory to SDMA memory space RAM.
Set up the channel’s context
Enable the channel’s HE flag (so the script runs pretty soon)
Wait for interrupt (assuming that the script ends with a “DONE 3″)
Possibly copy back the context to application processor space, to inspect the registers upon termination, and verify that their values are as expected.
Possibly copy SDMA memory to application processor space in order to inspect if the script worked as expected (if the script writes to SDMA RAM)

The first two steps are handled by the imx-smda.c kernel module, so I won’t cover them. I’ll start with the assembly code, which has to be generated first.

The assembler

Freescale offers their assembler, but I decided to write my own in Perl. It’s simple and useful for writing short routines, and its output is snippets of C code, which can be inserted directly into the source, as I’ll show later. It’s released under GPLv2, and you can download it from this link.

The sample code below does nothing useful. For a couple of memory related examples, please see another post of mine.

To try it out quickly, just untar it on some UNIX system (Linux included, of course), change directory to sdma_asm, and go

$ ./sdma_asm.pl looptry.asm
                             | start:
0000 0804 (0000100000000100) |     ldi r0, 4
0001 7803 (0111100000000011) |     loop exit, 0
0002 5c05 (0101110000000101) |     st r4, (r5, 0) # Address r5
0003 1d01 (0001110100000001) |     addi r5, 1
0004 1c10 (0001110000010000) |     addi r4, 0x10
                             | exit:
0005 0300 (0000001100000000) |     done 3
0006 1c40 (0001110001000000) |     addi r4, 0x40
0007 0b00 (0000101100000000) |     ldi r3, 0
0008 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
0009 7df6 (0111110111110110) |     bt start # Always branches

------------ CUT HERE -----------

static const int sdma_code_length = 5;
static const u32 sdma_code[5] = {
 0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};

The output should be pretty obvious. In particular, note that there’s a C declaration of a const array called sdma_code, which I’ll show how to use below. The first part of the output is a plain assembly listing, with the address, hex code and binary representation of the opcodes. There are a few simple syntax rules to observe:

Anything after a ‘;’ or ‘#’ sign is ignored (comments)
Empty lines are ignored, of course
A label starts the line, and is followed by a colon sign, ‘:’
Everything is case-insensitive, including labels (all code is lowercased internally)
The first alphanumeric string is considered the opcode, unless it’s a label
Everything following an opcode (comments excluded) is considered the arguments
All registers are noted as r0, r1, … r7 in the argument fields, and not as plain numbers, unlike the way shown in the reference manual. This makes a clear distinction between registers and values. It’s “st r7, (r0,9)” and not “~~st 7, (0,9)~~“.
Immediate arguments can be represented as decimal numbers (digits only), possibly negative (with a plain ‘-’ prefix). Positive hexadecimal numbers are allowed with the classic C “0x” prefix.
Labels are allowed for loops, as the first argument. The label is understood to be the first statement after the loop, so the label is the point reached when the loop is finished. See the example above. The second argument may not be omitted.
Other than loops, labels are accepted only for branch instructions, where the jump is relative. Absolute jump addresses can’t be generated automatically for jmp and jsr because the absolute address is not known during assembly.

A few words about why labels are not allowed for absolute jumps: It would be pretty simple to tell the Perl script the origin address, and allow absolute addressed jumps. I believe absolute jumps within a custom script should be avoided at any cost, so that the object code can be stored and run anywhere vacant. This is why I wasn’t keen on implementing this.

A simple test function

This is a simple function, which loads a custom script and runs it a few times. I added it, and a few additional functions (detailed later) to the Linux kernel’s SDMA driver, imx-sdma.c, and called it at the end of sdma_probe(). This is the simplest, yet not most efficient way to try things out: The operation takes place once when the module is inserted into the kernel, and then a reboot is necessary, since the module can’t be removed from the kernel. But with the reboot being fairly quick on an embedded system, it’s pretty OK.

So here’s the tryrun() function. Mind you, it’s called after the SDMA subsystem has been initialized, with one argument, the pointer to the sdma_engine structure (there’s only one for the entire system).

static int tryrun(struct sdma_engine *sdma)
{
 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];
 static const u32 sdma_code[5] = {
  0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
 };

 const int origin = 0xe00; /* In data space terms (32 bits/address) */

 struct sdma_context_data *context = sdma->context;

 int ret;
 int i;

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; /* In program space addressing... */
 context->gReg[4] = 0x12345678;
 context->gReg[5] = 0xe80;

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
                          0x800 + (sizeof(*context) / 4) * channel);
 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 for (i=0; i<4; i++) {
   ret = sdma_run_channel(&sdma->channel[1]);
   printk(KERN_WARNING "*****************************\n");
   sdma_print_mem(sdma, 0xe80, 128);

   if (ret) {
     printk(KERN_ERR "Failed to run script!\n");
     return ret;
   }
 }
 return 0; /* Success! */
}

Copying the code into SDMA memory

First, note that sdma_code is indeed copied from the output of the assembler, when it’s executed on looptry.asm as shown above. The assembler adds the “static” modifier as well as an sdma_code_length variable which were omitted, but otherwise it’s an exact copy.

The first thing the function actually does, is calling sdma_write_datamem() to copy the code into SDMA space (and I don’t check the return value, sloppy me). This is a function I’ve added, but its clearly derived from sdma_load_context(), which is part of imx-sdma.c:

static int sdma_write_datamem(struct sdma_engine *sdma, void *buf,
                              int size, u32 address)
{
 struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
 void *buf_virt;
 dma_addr_t buf_phys;
 int ret;

 buf_virt = dma_alloc_coherent(NULL, size, &buf_phys, GFP_KERNEL);
 if (!buf_virt)
 return -ENOMEM;

 bd0->mode.command = C0_SETDM;
 bd0->mode.count = size / 4;
 bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
 bd0->buffer_addr = buf_phys;
 bd0->ext_buffer_addr = address;

 memcpy(buf_virt, buf, size);

 ret = sdma_run_channel(&sdma->channel[0]);

 dma_free_coherent(NULL, size, buf_virt, buf_phys);

 return ret;
}

The sdma_write_datamem()’s principle of operation is pretty simple: First a buffer is allocated, with its address in virtual space given in buf_virt and its physical address is buf_phys. Both addresses are related to the application processor, of course.

Then the buffer descriptor is set up. This piece of memory is preallocated globally for the entire sdma engine (in application processor’s memory space), which isn’t the cleanest way to do it, but since these operations aren’t expected to happen in parallel processes, this is OK. The sdma_buffer_descriptor structure is defined in imx-smda.c itself, and is initialized according to section 52.23.1 in the Reference Manual. Note that this calling convention interfaces with the script running on channel 0, and not with any hardware interface. This chunk is merely telling the script what to do. In particular, the C0_SETDM command tells it to copy from application memory space to SDMA data memory space (see section 53.23.1.2).

Note that in the function’s arguments, “size” is given in bytes, but address in SDMA data address space (that is, in 32-bit quanta). This is why “size” is divided by four to become the element count (mode.count).

Just before kicking off, the input buffer’s data is copied into the dedicated buffer with a plain memcpy() command.

And then sdma_run_channel() (part of imx-sdma.c) is called to make channel 0 runnable. This function merely sets the HE bit of channel 0, and waits (sleeping) for the interrupt to arrive, or errors on timeout after a second.

At this point we have the script loaded into SDMA RAM (at data address 0xe00).

Some housekeeping calls on channel 1

Up to this point, nothing was done on the channel we’re going to use, which is channel #1. Three calls to functions defined in imx-sdma.c prepare the channel for use:

sdma_request_channel() sets up the channel’s buffer descriptor and data structure, and enables the clock global to the entire sdma engine, actions which I’m not sure are necessary. It also sets up the channel’s priority and the Linux’ wait queue (used when waiting for interrupt).
sdma_disable_channel() clears the channel’s HE flag
sdma_config_ownership() clears HO, sets EO and DO for the channel, so the channel is driven (“owned”) by the processor (as opposed to driven by external events).

Setting up the context

Even though imx-sdma.c has a sdma_load_context() function, it’s written for setting up the context as suitable for running the channel 0 script. To keep things simpler, we’ll set up the context directly.

After zeroing the entire structure, three registers are set in tryrun(): The program counter, r4 and r5. Note that the program counter is given the address to which the code was copied, multiplied by 2, since the program counter is given in program memory space. The two other registers are set merely as an initial state for the script. The structure is then copied into the per-channel designated slot with sdma_write_datamem().

Again, note that the “context” data structure, which is used as a source buffer from which the context is copied into SDMA memory, is allocated globally for the entire SDMA engine. It’s not even protected by a mutex, so in a real project you should allocate your own piece of memory to hold the sdma_context structure.

Running the script

In the end, we have a loop of four subsequent runs of the script, without updating the context, so from the second time and on, the script continues after the “done 3″ instruction. This is possible, because the script jumps to the beginning upon resumption (the three last lines in the assembly code, see above).

Each call to sdma_run_channel() sets channel 1′s HE flag, making it do its thing and then trigger off an interrupt with the DONE instruction, which in turn wakes up the process telling it the script has finished. sdma_print_mem() merely makes a series of printk’s, consisting of hex dumps of data from the SDMA memory. As used, it’s aimed on the region which the script is expected to alter, but the same function can be used to verify that the script is indeed in its place, or look at the memory. The function goes

static int sdma_print_mem(struct sdma_engine *sdma, int start, int len)
{
 int i;
 u8 *buf;
 unsigned char line[128];
 int pos = 0;

 len = (len + 15) & 0xfff0;

 buf = kzalloc(len, GFP_KERNEL);

 if (!buf)
   return -ENOMEM;

 sdma_fetch_datamem(sdma, buf, len, start);

 for (i=0; i<len; i++) {
   if ((i % 16) == 0)
     pos = sprintf(line, "%04x ", i);

   pos += sprintf(&line[pos], "%02x ", buf[i]);

   if ((i % 16) == 15)
     printk(KERN_WARNING "%s\n", line);
 }

 kfree(buf);

 return 0;
}

and it uses this function (note that the instruction is C0_GETDM):

static int sdma_fetch_datamem(struct sdma_engine *sdma, void *buf,
                              int size, u32 address)
{
 struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
 void *buf_virt;
 dma_addr_t buf_phys;
 int ret;

 buf_virt = dma_alloc_coherent(NULL, size,
                               &buf_phys, GFP_KERNEL);
 if (!buf_virt)
   return -ENOMEM;

 bd0->mode.command = C0_GETDM;
 bd0->mode.count = size / 4;
 bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
 bd0->buffer_addr = buf_phys;
 bd0->ext_buffer_addr = address;

 ret = sdma_run_channel(&sdma->channel[0]);

 memcpy(buf, buf_virt, size);

 dma_free_coherent(NULL, size, buf_virt, buf_phys);

 return ret;
}

Dumping context

This is the poor man’s debugger, but it’s pretty useful. A “done 3″ function can be seen as a breakpoint, and the context dumped to the kernel log with this function:

static int sdma_print_context(struct sdma_engine *sdma, int channel)
{
 int i;
 struct sdma_context_data *context;
 u32 *reg;
 unsigned char line[128];
 int pos = 0;
 int start = 0x800 + (sizeof(*context) / 4) * channel;
 int len = sizeof(*context);
 const char *regnames[22] = { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
                              "mda", "msa", "ms", "md",
                              "pda", "psa", "ps", "pd",
                              "ca", "cs", "dda", "dsa", "ds", "dd" };

 context = kzalloc(len, GFP_KERNEL);

 if (!context)
   return -ENOMEM;

 sdma_fetch_datamem(sdma, context, len, start);

 printk(KERN_WARNING "pc=%04x rpc=%04x spc=%04x epc=%04x\n",
   context->channel_state.pc,
   context->channel_state.rpc,
   context->channel_state.spc,
   context->channel_state.epc
 );

 printk(KERN_WARNING "Flags: t=%d sf=%d df=%d lm=%d\n",
   context->channel_state.t,
   context->channel_state.sf,
   context->channel_state.df,
   context->channel_state.lm
 );       

 reg = &context->gReg[0];

 for (i=0; i<22; i++) {
   if ((i % 4) == 0)
     pos = 0;

   pos += sprintf(&line[pos], "%s=%08x ", regnames[i], *reg++);

   if (((i % 4) == 3) || (i == 21))
     printk(KERN_WARNING "%s\n", line);
 }

 kfree(context);

 return 0;
}

Clashes with Linux’ SDMA driver

Playing around with the SDMA subsystem directly is inherently problematic, since the assigned driver may take contradicting actions, possibly leading to a system lockup. Running custom scripts using the existing driver isn’t possible, since it has no support for that as of kernel 2.6.38. On the other hand, there’s a good chance that the SDMA driver wasn’t enabled at all when the kernel was compiled, in which case there is no chance for collisions.

The simplest way to verify if the SDMA driver is currently present in the kernel, is to check in /proc/interrupts whether interrupt #6 is taken (it’s the SDMA interrupt).

The “imx-sdma” pseudodevice is always registered on the platfrom pseudobus (I suppose that will remain in the transition to Open Firmware), no matter the configuration. It’s the driver which may not be present. The “i.MX SDMA support” kernel option (CONFIG_IMX_SDMA) may not be enabled (it can be a module). Note that it depends on the general “DMA Engine Support” (CONFIG_DMADEVICES), which may not be enabled to begin with.

Anyhow, for playing with the SDMA module, it’s actually better when these are not enabled. In the long run, maybe there’s a need to expand imx-sdma.c, so it supports custom SDMA scripting. The question remaining is to what extent it should manage the SDMA RAM. Well, the real question is if there’s enough community interest in custom SDMA scripting at all.

Posted Under: ARM,NXP (Freescale)
This post was written by eli on October 26, 2011 Comments (5)

Freescale i.MX51 SDMA tutorial (part III)

This is part III of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts (this page)
Part IV: Running custom SDMA scripts in Linux

Events

Even though an SDMA script can be kicked off (or made eligible for running, to be precise) by the application processor, regardless of any external events, there’s a lot of sense in letting the peripheral kick off the script(s) directly, so the application processor doesn’t have to be bothered with an interrupt every time.

So the system has 48 predefined SDMA events, listed in section 3.3 of the Reference Manual. Each of these events can turn one or several channels eligible for executing by automatically setting their EP flag. Which of the channels will have its EP flag set is determined by the SDMA event’s CHNENBL register. There are 48 such registers, one for each SMDA register, with each of its 32 bits corresponding to an SDMA channel: If bit i is set, the event linked with the register will set EP[i]. Note that these registers have unknown values on powerup, so if event driven SDMA is enabled, all registers must be initialized, or hell breaks loose.

In a normal flow, EP[i] is zero when an event is about to set this flag: If it was set by a previous event, the respective SDMA script should have finished, and hence cleared the flag before the next event occurred. Since attempting to set EP[i] when it’s already set may indicate that the event came too early (or the script is too late), there’s an CHNERR[i] flag, which latches such errors, so that the application processor can make itself informed about such a condition. This can also trigger an interrupt, if the respective bit in INTRMASK is set. The application processor can read these flags (and reset them at the same time) in the EVTERR register.

I’d like to draw special attention to events #14 and #15, which are driven by external pins, namely GPIO1_4 and GPIO1_5. These two make it possible for an external chip (e.g. an FPGA) request service without involving the application processor. A rising edge on these lines creates an event when the IOMUX is set to ALT1 (SDMA_EXT_EVENT) on the relevant pins. Note that setting the IOMUX to just GPIO won’t do it.

It’s important to note, that the combination of the EP[i] flag being cleared by the script itself with the edge-triggered nature of the event signal creates an inevitable risk for a race condition: There is no rigorous way for the script to make sure that a “DONE 4″ instruction, which was intended to clear a previous event won’t clear one that just arrived to create another. The CHNERR[i] flag will indicate that the event arrived before the previous one was cleared, but in some implementations, that can actually be a legal condition. This can be solved by emulating a level-triggered event with a constantly toggling event line, when the external hardware wants servicing. This will make CHNERR[i] go high for sure, but otherwise it’s fine.

This possible race condition is not a design bug of the SDMA subsystem. Rather, it was designed with SDMA script which finish faster than the next event in mind. The “I need service” kind of design was not considered.

Interrupts

By executing a “DONE 3″ command, the SDMA scripts can generate interrupts on the application processor by setting the HI[i] flag, where i is the channel number of the currently running script. This will assert interrupt #6 on the application processor, which handles it like any other interrupt.

The H[i] flags can be read by the application processor in the INTR register (see section 52.12.3.2 in the Reference Manual). An interrupt handler should scan this register to determine which channel requests an interrupt. There is no masking mechanism for individual H[i]‘s. The global interrupt #6 can be disabled, but an individual channel can’t be masked from generating interrupts.

If any of the INTRMASK bits is set, the EVTERR register should also be scanned, or at least cleared, since CHNERR[i] conditions generate interrupts which are indistinguishable from H[i] interrupts.

“DONE 3″, which is the only instruction available for setting HI[i] also clears HE[i], so it was clearly designed to work with scripts kicked off directly by the application processor. In order to issue an interrupt from a script, which is kicked off by an event, a little trick can be used: According to section 52.21.2 in the Reference Manual (the detail for the DONE instruction), “DONE 3″ means “clear HE, set HI for the current channel and reschedule”. In other words, make the current channel ineligible of execution unless HO[i] is set, and set HI[i] so an interrupt is issued. But event-driven channels do have HO[i] set, so clearing HE[i] has no significance whatsoever. According to table 52-4, the context will be saved, and then restored immediately. So there will be a slight waste of time with context writes and reads, but since the most likely instruction following this “DONE 3″ is a “DONE 4″ (that is, clear EP[i], the event-driven script has finished), the impact is rather minimal. Anyhow, I still haven’t tried this for real, but I will soon.

So much for part III. You may want to go on with Part IV: Running custom SDMA scripts in Linux

Posted Under: ARM,NXP (Freescale)
This post was written by eli on October 26, 2011 Comments (0)

Freescale i.MX51 SDMA tutorial (part II)

This is part II of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution (this page)
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux

Contexts and channels

The SDMA’s purpose is to service requests from hardware or from the application processor. In a way, it’s like a processor with no idle task, just interrupts. But the way the service is performed is different from interrupt handling.

Let’s assume that all scripts (those SDMA programs) are already present in the SDMA’s memory space. They may reside in the on-chip ROM or they’ve been loaded into RAM. How are they executed?

The answer lies in the contexts: Some of the SDMA’s RAM space is allocated for containing an array of structures. There are 32 such structures, each occupying 128 bytes (or 32 32-bit words), so all in all this block takes up 4 kB of memory (there’s a 96-byte variant as well, but we’ll leave it for now).

These structures do what their name implies: They contain the context of a certain execution thread. In other words, they contain everything that needs to be stored to resume execution at some point, as if it was never stopped. Since the SDMA core doesn’t have a stack, this information has to go to a fixed place. This includes the program counter, the registers and flags. Section 52.13.4 in the Reference Manual describes this structure in detail.

As mentioned, there’s an array of 32 of these structures. It means that the SDMA subsystem can maintain 32 contexts, or if you like, resemble a multitasking system with 32 independent threads. Or in SDMA terms: The SDMA core supports 32 DMA channels. This kinda connects with the common concept of DMA channels: Each channel has a certain purpose and particular flow.

The method to kick off a channel, so it will execute a certain script, is to write directly to the channel’s context structure, and then set up some flags to make it runnable. This is demonstrated in part IV. Since the context includes the program counter register, this controls where the execution starts. Other registers can be used to pass information to the script (that is, the SDMA “program”). What each register means upon such an invocation is up to the script’s API.

A script’s life cycle (scheduling)

So there are 32 context, each corresponding to 32 channels. What makes a context load into the registers, making its channel’s script execute? It’s time to talk about the scheduler. It’s described in painstaking detail in the Reference Manual, so let’s stick to the main points.

The scheduler’s main function is to decide which channel is the most eligible to spend time on the processor core. This decision is relevant only when the SDMA core isn’t running anything at all (a.k.a. “sleeping”) or when the currently running script voluntarily yields the processor. The SDMA core’s execution is non-preemptive, so the scheduler can’t force any script to stop running. In other words, if any script is (mistakenly) caught in an infinite loop, all DMA activity is as good as dead, most possibly leading to a complete system hangup. Nothing can force a script to stop running (expect for a reset or the debugger). Just a small thing to bear in mind when writing those scripts.

The SDMA core has a special instruction for yielding the processor, with the mnemonic “done”, which takes a parameter for choosing its variant. Two variants of this instructions have earned their own mnemonics, “yield” and “yieldge”. While “done” variant #3 (usually called just “done”) always yields the processor, the two others yield it if there are other channels ready for executing with higher priority (or higher-or-equal priority for “yieldge”). But never mind the details. The overall picture is that the script runs until it issues a command saying “you must stop me now” (as in “done”) or “you may stop me now” (as in the two other variants).

Yielding only means that the registers are stored back into the context structure (with optimizations to speed this process up) and that another context may be loaded instead of it. Depending on which variant of “done” was used, plus some other factors, the scheduler may or may not reschedule the same channel automatically at a later time. That is, the context may be reloaded into the registers. So unless designed otherwise, the opcode directly after the “done” instructions will be executed at some later time. Hence a carefully written script never “ends”, it just gives up the processor until the next time the relevant channel is scheduled.

Channel eligibility

Now let’s look at what makes a channel eligible for execution. Leaving priority issues aside, let’s ask what makes a certain channel a candidate for having its context pushed into the SDMA core.

In some cases, the setup is that the channel becomes eligible for execution without any other condition. This is the case for offload memory copy, for example. In other cases, the channel’s eligibility depends on some hardware event, typically some peripheral requesting service. The latter scenario resembles old-school interrupt handlers, only the interrupt isn’t serviced by the application processor, but wakes up a service thread (channel) in the SDMA core. And exactly as waking up a thread in a modern operating system doesn’t cause immediate execution, but rather sets some flag to make the thread eligible for getting a processor time slice, so does the SDMA channel wakeup work: It’s just a flag telling the scheduler to push the channel’s context into the SDMA’s core when it sees fit.

The Reference Manual sums this up in section 52.4.3.5, saying the channel i is eligible to run if and only if the following expression is logical ’1′:

(HE[i] or HO[i]) and (EP[i] or EO[i])

where HE[i], HO[i], EP[i], and EO[i] are flags belonging to the i’th channel. Let’s take them one by one:

HE[i] stands for “Host Enable”, and is set and reset by the application processor by writing to registers. It’s also cleared by the “done” instruction, so it’s suitable for a scenario where the host kicks off a channel, and the script quits it.
EP[i] stands for “External Peripheral”, and is set when an external peripheral wants service (more about that mechanism later on). It’s cleared by one of the “done” variants, so this is the flag used when a peripheral kicks off a channel, and the script quits.
HO[i] stands for “Host override”, and is controlled solely by a register written to by the application processor. Its purpose is to make the left hand of the expression always true, when we want the channel’s eligibility be controlled by the peripheral only.
EO[i] stands for “External override”, and is like HO[i] in the way it’s handled. This flag is set when we want the channel’s eligibility controlled by the host only.

There are four registers in the application processor’s memory space, which are used to alter these flags: STOP_STAT, HSTART, EVTOVR and HOSTOVR. They are outlined in sections 52.12.3.3-52.12.3.7 in the Reference Manual.

The full truth is that there’s also a DO[i] flag mentioned (controlled by the DSPOVR register), but it must be held ’1′ on i.MX51 devices, so let’s ignore it.

So if our case is the application processor controlling the i’th SDMA channel for offload operation, it sets EO[i], clears HO[i], and then sets HE[i] whenever it wants to have the script running. The script may clear HE[i] with a “done” instruction, or the application processor may clear it when appropriate. For example, the script can trigger an interrupt on the application processor, which clears the flag (even though I can’t see when this would be right way to do it).

In the case of channels being started by a peripheral, the application processor sets HO[i] and clears EO[i]. Certain events (as discussed next) set the EP[i] flag directly, and the script’s “done” instruction clears it.

Keep in mind that the script may not run continuously: It should execute “yield” instructions every now and then to give other channels a chance to use the SDMA core, but since neither HE[i] nor EP[i] are affected by yields, the script will keep running until it’s, well, done.

There is a possibility to reset the SDMA core or force a reschedule with the SDMA’s RESET register, but that’s really something for emergencies (e.g. a runaway script).

So much for part II. You may want to go on with Part III: Events and Interrupts

Posted Under: ARM,NXP (Freescale)
This post was written by eli on October 25, 2011 Comments (0)

Freescale i.MX SDMA tutorial (part I)

This is part I of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

Freescale’s Linux drivers for DMA also vary significantly across different kernel releases. It looks like they had two competing sets of code, and couldn’t make up their minds which one to publish.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map (this page)
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux

NOTE: For more information, in particular on SDMA for i.MX6 and i.MX7, there’s a follow-up post written by Jonah Petri.

Introduction

Behind all the nice words, the SDMA subsystem is just a small and simple RISC processor core, with its private memory space and some specialized functional units. It works side-by-side with the main ARM processor (the application processor henceforth), and pretty much detached from it. Special registers allow the application processor to control the SDMA’s core, and special commands on the SDMA’s core allow it to access the application processor’s memory space and send it interrupts. But in their natural flow, each of these two don’t interact.

The underlying idea behind the SDMA core is that instead of hardwiring the DMA subsystem’s capabilities and possible behaviors, why not write small programs (scripts henceforth), which perform the necessary memory operations? By doing so, the possible DMA operations and variants are not predefined by the chip’s vendor; the classic DMA operations are still possible and available with vendor-supplied scripts, but the DMA subsystem can be literally programmed to do a lot of other things. Offload RAID xoring is an example of something than can be taken off the main processor, as the data is being copied from disk buffers to the peripherals with DMA.

Scripts are kicked off either by some internal event (say, some peripheral has data to offer) or directly by the main processor’s software (e.g. an offload memcpy). The SDMA processor’s instruction set is simple, all opcodes occupying exactly 16 bits in program memory. Its assembler can be acquired from Freescale, or you can download my mini-assembler, which is suitable for small projects (in part IV).

Chapter 52 in the Reference Manual is dedicated to the SDMA, but unfortunately it’s not easy reading. In the hope to clarify a few things, I’ve written down the basics. Please keep in mind that the purpose of my own project was to perform memory-to-memory transfers triggered autonomously by an external device, so I’ve given very little attention to the built-in scripts and handling DMA from built-in peripherals.

Quirky memory issues

I wouldn’t usually start the presentation of a processor with its memory map and addressing, but in this case it’s necessary, as it’s a major source of confusion.

The SDMA core processor has its own memory space, which is completely detached from the application processor’s. There are two modes of access to the memory space: Instruction mode and data mode.

Instruction mode is used in the context of jumps, branches and when calling built-in subroutines which were written with program memory in mind. In this mode, the address points at a 16-bit word (which matches the size of an opcode), so the program counter is incremented (by one) between each instruction (except for jumps, of course).

Data mode is used when reading from the SDMA’s memory (e.g. loading registers) or writing to it. This should not be confused with the application processor’s memory (the one Linux sees, for example), which is not directly accessible by the SDMA core. In data mode, addressing works on 32-bit words, so incrementing the data mode address (by one) means moving forward four bytes.

Instruction mode and data mode addressing points at exactly the same physical memory space. It’s possible to write data to RAM in data mode, and then execute it as a script, the latter essentially reading from RAM in instruction mode. It’s important to note, that different addresses will be used for each. This is best explained with a simple example:

Suppose that we want to run a routine (script) written by ourselves. To do so, it has to be copied into the internal RAM first. How to do that is explained in part IV, but let’s assume that we want to execute our script with a JMP instruction to 0x1800. This is 12 kB from the zero-address of the memory map, since the 0x1800 address is given in 16-bit quanta (2 bytes per address count). After the script is loaded in its correct place, we’ll be able to read the first instruction (as a piece as data) as follows: Set one of the SDMA’s processor’s registers to the value 0x0c00, and then load from the address pointed by that register. The address, 0x0c00, is given in 32-bit quanta (4 bytes per address count), so it hits exactly the same place: 12 kB from zero-address. And since we’re reading 32 bits, we’ll read the first instruction as well as the second at the same time.

Let’s say it loud and clear:

Instruction mode addresses are always double their data mode equivalents.

As for endianess, the SDMA core thinks Big Endian all the way through. That means, that when reading two assembly opcodes from memory in data mode, we get a 32-bit word, for which the first instruction is on bits [31:16] and the instruction following it on bits [15:0].

The memory map

Since we’re at it, and since the Reference Manual has this information spread all over, here’s a short outline of what’s mapped where, in data addresses.

0x0000-0x03ff: 4 kB of internal ROM with boot code and standard routines
0x0400-0x07ff: 4 kB of reserved space. No access at all should take place here
0x0800-0x0bff: 4 kB of internal RAM, containing the 32 channels’ contexts (each context is 32 words of 4 bytes each, when SMSZ is set in the CHN0ADDR register). More about this in part II. For the details, see Section 52.13.4 in the Reference Manual. When SMSZ is clear, this segment is 3 kB only (see 52.4.4).
0x0c00-0x0fff: 4 kB of internal RAM, free for end-user application scripts and data.
0x1000-0x6fff: Peripherals 1-6 memory space
0x7000-0x7fff: SDMA registers, as accessed directly by the SDMA core (as detailed in section 52.14 of the reference manual)
0x8000-0xffff: Peripherals 7-14 memory space (not accessible in program memory space)

The two regions of peripherals memory space is the preferred way to access peripherals (unlike the implementation in Linux drivers using SDMA script) as discussed in another post of mine.

And once again: The memory map above is given in data addresses. The memory map in program memory space is the same, only all addresses are double.

So much for part I. You may want to go on with Part II: Contexts, Channels, Scripts and their execution

Posted Under: ARM,NXP (Freescale)
This post was written by eli on October 25, 2011 Comments (1)

WordPress: Displaying C-like hexadecimal prefix “0x” correctly

This is a small, but annoying thing about WordPress. They obviously didn’t consider the “0x” hexadecimal notation. What they did consider, was that if someone says “2x2″ that surely means “2 times 2″, so why not making that “x” in the middle fancy? Well, maybe because that makes “0x123″, which is a hexadecimal number, look weird.

The fix is in the core PHP files of WordPress, so this probably needs to be fixed every time WordPress is updated.

In wp-includes/formatting.php, in the wptexturize() function, probably around lines 50-51 there’s something like:

$dynamic_characters = array('/\'(\d\d(?:&#8217;|\')?s)/', '/(\s|\A|")\'/', '/(\d+)"/', '/(\d+)\'/', '/(\S)\'([^\'\s])/', '/(\s|\A)"(?!\s)/', '/"(\s|\S|\Z)/', '/\'([\s.]|\Z)/', '/\b(\d+)x(\d+)\b/');
$dynamic_replacements = array('&#8217;$1','$1&#8216;', '$1&#8243;', '$1&#8242;', '$1&#8217;$2', '$1&#8220;$2', '&#8221;$1', '&#8217;$1', '$1&#215;$2')

The last entry in both arrays replaces two numbers with an “x” between them with a fancy “times” symbol (Unicode character #215). So just remove those two entries from both arrays (marked in red above). Remove the commas as well, of course.

Maybe the 100% correct way to fix this, is to use a better regular expression, instead of ‘/\b(\d+)x(\d+)\b/’. I’m not sure about regular expressions in PHP, but in Perl I would try ‘/\b([1-9]\d*)x(\d+)\b/’, so it wouldn’t match the “0x” notation. It wouldn’t match “02 x 2 = 4″ or any other number prefixed with zeros, but this is not something normal people write anyhow.

Posted Under: Internet,Server admin,Software
This post was written by eli on October 24, 2011 Comments (0)

When request_irq() fails with -EINVAL

It may help investigating the interrupt descriptors. For a 2.6.38 kernel, putting this in a kernel module load supplies some information (includes, declarations and code mixed below. Organize properly in your own module)

#include <linux/irq.h>
#include <linux/interrupt.h>
#include <asm/irq.h>

int i;
struct irq_desc *desc;

for_each_irq_desc(i, desc) {
 if (!desc)
   continue;
 printk(KERN_INFO "%d: status=%08x, chip=%08x, handle_irq=%08x\n",
        i, (u32) desc->status, (u32) desc->chip, (u32) desc->handle_irq );
 }

This dumps some meta information about all possible IRQs on the system. Also be sure to look at /proc/interrupts.

Have a look in include/linux/irq.h for the meaning of the flags in desc->status and possibly include/linux/irqdesc.h for the irq_desc structure.

request_irq() may very well fail because the IRQ_NOREQUEST flag was set in status. On ARM architecture, this can be fixed by calling set_irq_flags(irq, IRQF_VALID) assuming that you have a fairly good idea of what you’re doing.

Note that set_irq_chip_and_handler() is usually called before validating an IRQ, so that Linux knows what to do with the interrupt as it happens. Looking at chip and handle_irq in the dump may give a clue about how necessary this is. Searching for the value of handle_irq in /proc/kallsyms (with a simple grep) tells who handles each interrupt.

The “chip” structure is a container for information and methods specific to the interrupt’s owner. In old days, these belonged to peripheral chips, but a “chip” is many times just a group of interrupts having a common way of handling them (setting trigger type, masking etc.).

A final note: It looks like the API is changing vividly in this area, so don’t expect things to be exactly the same on other kernels.

Posted Under: ARM,Linux kernel
This post was written by eli on October 17, 2011 Comments (0)

Armadeus APF51 / Freescale i.MX51: A kit for reverse engineering the EIM bus

What we have here

As one can guess from my notes about the i.MX51′s external bus and the oscilloscope shots I’ve published, I made myself a small dissection kit for watching the bus’ lines activity with a digital oscilloscope.

This is a good time to mention, that the kit was done quickly and dirty, so the code below should not be taken as an example of proper coding for FPGA nor the Linux kernel. Seriously, this is just lab code.

Anyhow, this little kit consists of two parts

Verilog code and UCF for programming the FPGA. Except for blinking the LED (at 1 Hz), it also wires all EIM-bus related signals to the FPGA pin headers on the development board, so they can be sampled easily with oscilloscope’s probes. You can download the bitfile directly if your board has the LX9 FPGA, or implement it from the sources below.
A kernel module, which performs a single bus operation when it’s loaded. It’s explained further below. If you happen to be running on a 2.6.38.1 Linux kernel on your board (in particular the 2.6.38.1 which comes preloaded on the board), you may try using the precompiled kernel module. Or do it the “right way” and compile the module from the sources below.

The Verilog code below pretty much explains itself. And as the comments in the UCF say, the “debug_pins_outer” pin vector runs from pin #38 downwards continuously, on even pins only, on the outer FPGA pin header. This may sound complicated, but it simply means that out of the two rows of this pin header, only the row reached easily with a probe is used. And since pin #40 (in the corner) isn’t attached to the FPGA, debug_outer_pins[0] is connected to pin #38, debug_outer_pins[1] to #36 and so on.

As for the “debug_pin_inner” it goes more or less the same. Going from pin #3 for debug_inner_pins[0] and up on odd pin numbers, only the inner pin row of the inner pin header is used for easy physical access.

This may look like a weird choice of pin assignments, but this was the only way to get the vectors assigned on the pin headers without any gaps between them, so it’s easy to reach any signal in the vectors just by counting pins on the pin header.

Please make sure that the two “FPGA bank” jumpers are installed on your board, or nothing will appear on the pin headers. These jumpers were installed on the board as I got it, so just check it’s OK.

It’s also worth to note that debug_pins_outer[4] happens to be connected to a pin which is shared with a pushbutton on the board. Since the line is pulled up with a 10 kOhm resistor, this line may have some timing skew.

Simple use

Assuming that both the bitfile and the kernel module are in the currect directory, first load the FPGA if you haven’t done so already:

# load_fpga armaled.bit

A green LED should start blinking as a result of this. Note that according to Armadeus’ wiki page on the FPGA loader, armaled.bit should not be on the on-board flash. Copy it to /tmp first (which is on RAM) or load it from an net drive (e.g. NFS) like I did.

And then, to kick off a bus cycle, load the module and catch it on the oscilloscope:

# insmod eimtest.ko

And then unload the module, so you can load it again for the next try:

# rmmod eimtest

The relevant bus parameters can be set directly when loading the module. For example, to add an extra bus wait state, disable continuous bus clock, run at 1/4 bus rate and use bus address OxABC0, go:

# insmod eimtest.ko WSC=2 BCD=3 BCM=0 addr=0xabc0

A list of kernel module parameters, which in turn changes the bus parameters, is found in the kernel module’s source. Anything declared with “module_param” can be set. The defaults are given in the variable declarations. Setting the address and data is also possible, but be sure not to exceed the address 0xFFFC, or you’ll get a kernel oops. Also note that addresses not aligned to 32-bit words will produce several bus cycles.

The Verilog code

Note that the direct wire connections have a variable delay. This results in some unknown skew (1-2ns, I suppose) between the outputs.

module armaled
 (
 input ext_clk,
 output reg led,
 output irq,

 input [15:0] imx51_da,

 input imx51_cs1,
 input imx51_cs2,
 input imx51_adv,
 input imx51_we,
 input imx51_eb0,
 input imx51_eb1,
 input imx51_oe,
 input imx51_dtack,
 input imx51_wait,
 input imx51_bclk,
 input imx51_clko,

 output [13:0] debug_pins_inner,
 output [12:0] debug_pins_outer
 );

 reg [27:0] counter;

 assign     irq = 0;

 assign     debug_pins_outer[0] = imx51_bclk;
 assign     debug_pins_outer[1] = imx51_clko;
 assign     debug_pins_outer[2] = imx51_oe;
 assign     debug_pins_outer[3] = imx51_cs1;
 assign     debug_pins_outer[4] = imx51_cs2;
 assign     debug_pins_outer[5] = imx51_adv;
 assign     debug_pins_outer[6] = imx51_we;
 assign     debug_pins_outer[7] = imx51_eb0;
 assign     debug_pins_outer[8] = imx51_eb1;
 assign     debug_pins_outer[9] = imx51_dtack;
 assign     debug_pins_outer[10] = imx51_wait;
 assign     debug_pins_outer[12:11] = imx51_da[15:14];

 assign     debug_pins_inner = imx51_da[13:0];

 always @(posedge ext_clk)
 begin
 if (counter >= 47500000)
 begin
 led <= !led;
 counter <= 0;        
 end
 else
 counter <= counter + 1;
 end

endmodule

The UCF file

NET "ext_clk" TNM_NET = "TN_ext_clk";
TIMESPEC "TS_ext_clk" = PERIOD "TN_ext_clk" 10.4 ns HIGH 50 %;

NET "led" LOC="G14" | IOSTANDARD=LVCMOS33;# IO_L41P_GCLK9_IRDY1_M1RASN_1
#NET "button" LOC="G15" | IOSTANDARD=LVCMOS33;# IO_L41N_GCLK8_M1CASN_1
NET "ext_clk" LOC="N8" | IOSTANDARD=LVCMOS33;# = BCLK, IO_L29P_GCLK3_2
NET "irq" LOC="P3" | IOSTANDARD=LVCMOS33;# FPGA_INITB

# Debug pins.

# The "inner" set starts from pin #3, running on odd pins only (effectively
# covering the pins convenient to attach a scope's probe to)

NET "debug_pins_inner[0]" LOC="L2" | IOSTANDARD=LVCMOS33;# IO_L39P_M3LDQS_3
NET "debug_pins_inner[1]" LOC="J2" | IOSTANDARD=LVCMOS33;# IO_L41P_GCLK27_M3DQ4_3
NET "debug_pins_inner[2]" LOC="K4" | IOSTANDARD=LVCMOS33;# IO_L43P_GCLK23_M3RASN_3
NET "debug_pins_inner[3]" LOC="K5" | IOSTANDARD=LVCMOS33;# IO_L45P_M3A3_3
NET "debug_pins_inner[4]" LOC="C2" | IOSTANDARD=LVCMOS33;# IO_L83P_3
NET "debug_pins_inner[5]" LOC="D4" | IOSTANDARD=LVCMOS33;# IO_L53P_M3CKE_3
NET "debug_pins_inner[6]" LOC="K3" | IOSTANDARD=LVCMOS33;# IO_L40P_M3DQ6_3
NET "debug_pins_inner[7]" LOC="H3" | IOSTANDARD=LVCMOS33;# IO_L42P_GCLK25_TRDY2_M3UDM_3
NET "debug_pins_inner[8]" LOC="G2" | IOSTANDARD=LVCMOS33;# IO_L44P_GCLK21_M3A5_3
NET "debug_pins_inner[9]" LOC="F3" | IOSTANDARD=LVCMOS33;# IO_L46P_M3CLK_3
NET "debug_pins_inner[10]" LOC="D3" | IOSTANDARD=LVCMOS33;# IO_L54P_M3RESET_3
NET "debug_pins_inner[11]" LOC="E2" | IOSTANDARD=LVCMOS33;# IO_L52P_M3A8_3
NET "debug_pins_inner[12]" LOC="K13" | IOSTANDARD=LVCMOS33;# IO_L44P_A3_M1DQ6_1
NET "debug_pins_inner[13]" LOC="H13" | IOSTANDARD=LVCMOS33;# IO_L42P_GCLK7_M1UDM_1

# The "outer" set starts from pin #38, running on even pins only (effectively
# covering the pins convenient to attach a scope's probe to). Note that the
# vectors runs from high board pin number to low.

NET "debug_pins_outer[0]" LOC="B15" | IOSTANDARD=LVCMOS33;# IO_L1N_A24_VREF_1
NET "debug_pins_outer[1]" LOC="C15" | IOSTANDARD=LVCMOS33;# IO_L33N_A14_M1A4_1
NET "debug_pins_outer[2]" LOC="D15" | IOSTANDARD=LVCMOS33;# IO_L35N_A10_M1A2_1
NET "debug_pins_outer[3]" LOC="E15" | IOSTANDARD=LVCMOS33;# IO_L37N_A6_M1A1_1
NET "debug_pins_outer[4]" LOC="G15" | IOSTANDARD=LVCMOS33;# IO_L41N_GCLK8_M1CASN_1
NET "debug_pins_outer[5]" LOC="J15" | IOSTANDARD=LVCMOS33;# IO_L43N_GCLK4_M1DQ5_1
NET "debug_pins_outer[6]" LOC="L15" | IOSTANDARD=LVCMOS33;# IO_L45N_A0_M1LDQSN_1
NET "debug_pins_outer[7]" LOC="G12" | IOSTANDARD=LVCMOS33;# IO_L30N_A20_M1A11_1
NET "debug_pins_outer[8]" LOC="F12" | IOSTANDARD=LVCMOS33;# IO_L31N_A18_M1A12_1
NET "debug_pins_outer[9]" LOC="H11" | IOSTANDARD=LVCMOS33;# IO_L32N_A16_M1A9_1
NET "debug_pins_outer[10]" LOC="G13" | IOSTANDARD=LVCMOS33;# IO_L34N_A12_M1BA2_1
NET "debug_pins_outer[11]" LOC="J13" | IOSTANDARD=LVCMOS33;# IO_L36N_A8_M1BA1_1
NET "debug_pins_outer[12]" LOC="K11" | IOSTANDARD=LVCMOS33;# IO_L38N_A4_M1CLKN_1

# i.MX51 related pins

NET "imx51_cs1" LOC="R11" | IOSTANDARD=LVCMOS33;# EIM_CS1
NET "imx51_cs2" LOC="N9" | IOSTANDARD=LVCMOS33;# EIM_CS2
NET "imx51_adv" LOC="R9" | IOSTANDARD=LVCMOS33;# EIM_LBA
NET "imx51_we" LOC="R6" | IOSTANDARD=LVCMOS33;# EIM_RW
NET "imx51_eb0" LOC="P7" | IOSTANDARD=LVCMOS33;
NET "imx51_eb1" LOC="P13" | IOSTANDARD=LVCMOS33;
NET "imx51_oe" LOC="R7" | IOSTANDARD=LVCMOS33;
NET "imx51_dtack" LOC="N4" | IOSTANDARD=LVCMOS33;
NET "imx51_wait" LOC="R4" | IOSTANDARD=LVCMOS33;
NET "imx51_bclk" LOC="N12" | IOSTANDARD=LVCMOS33; # Hardwired to N8
NET "imx51_clko" LOC="N7" | IOSTANDARD=LVCMOS33;

NET "imx51_da[7]" LOC="P11" | IOSTANDARD=LVCMOS33;# EIM_DA7
NET "imx51_da[6]" LOC="M11" | IOSTANDARD=LVCMOS33;# EIM_DA6
NET "imx51_da[5]" LOC="N11" | IOSTANDARD=LVCMOS33;# EIM_DA5
NET "imx51_da[13]" LOC="R10" | IOSTANDARD=LVCMOS33;# EIM_DA13
NET "imx51_da[12]" LOC="L9" | IOSTANDARD=LVCMOS33;# EIM_DA12
NET "imx51_da[11]" LOC="M10" | IOSTANDARD=LVCMOS33;# EIM_DA11
NET "imx51_da[10]" LOC="M8" | IOSTANDARD=LVCMOS33;# EIM_DA10
NET "imx51_da[9]" LOC="K8" | IOSTANDARD=LVCMOS33;# EIM_DA9
NET "imx51_da[8]" LOC="L8" | IOSTANDARD=LVCMOS33;# EIM_DA8
NET "imx51_da[0]" LOC="N6" | IOSTANDARD=LVCMOS33;# EIM_DA0
NET "imx51_da[4]" LOC="P5" | IOSTANDARD=LVCMOS33;# EIM_DA4
NET "imx51_da[3]" LOC="R5" | IOSTANDARD=LVCMOS33;# EIM_DA3
NET "imx51_da[2]" LOC="L6" | IOSTANDARD=LVCMOS33;# EIM_DA2
NET "imx51_da[1]" LOC="L5" | IOSTANDARD=LVCMOS33;# EIM_DA1
NET "imx51_da[15]" LOC="M5" | IOSTANDARD=LVCMOS33;# EIM_DA15
NET "imx51_da[14]" LOC="N5" | IOSTANDARD=LVCMOS33;# EIM_DA14

The kernel module

It currently reads one word from the bus. A write operation is obtained by commenting and uncommenting in the region marked in red.

#include <linux/version.h>
#include <linux/platform_device.h>
#include <linux/delay.h>
#include <linux/gpio.h>
#include <linux/io.h>
#include <asm/io.h>
#include <mach/iomux-mx51.h>
#include <mach/fpga.h>
#include <mach/hardware.h>

MODULE_DESCRIPTION("EIM interface test module");
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Eli Billauer");

#define EIMTEST ""

static int PSZ = 0;
static int AUS = 1;
static int BCS = 0;
static int BCD = 0;
static int BL = 0;
static int FL = 1; // Cover RFL and WFL alike
static int WC = 0;
static int ADH = 0;
static int WSC = 1;
static int ADVA = 0; // RADVA and WADVA
static int ADVN = 0; // RADVN and WADVN
static int OEA = 0;
static int CSA = 0; // RCSA and WCSA
static int RL = 0;
static int BEA = 0;
static int BE = 1;
static int WEA = 0;
static int INTPOL = 1; // Interrupt polarity
static int INTEN = 0; // Interrupt enable
static int GBCD = 0; // Burst clock divisor
static int BCM = 1; // Burst clock mode (set continuous here)
static int addr = 0x00001234;
static int data = 0xFFFF5555;

module_param(PSZ, int, 0);
module_param(AUS, int, 0);
module_param(BCS, int, 0);
module_param(BCD, int, 0);
module_param(BL, int, 0);
module_param(FL, int, 0);
module_param(WC, int, 0);
module_param(ADH, int, 0);
module_param(WSC, int, 0);
module_param(ADVA, int, 0);
module_param(ADVN, int, 0);
module_param(OEA, int, 0);
module_param(CSA, int, 0);
module_param(RL, int, 0);
module_param(BEA, int, 0);
module_param(BE, int, 0);
module_param(WEA, int, 0);
module_param(INTPOL, int, 0);
module_param(INTEN, int, 0);
module_param(GBCD, int, 0);
module_param(BCM, int, 0);
module_param(data, int, 0);
module_param(addr, int, 0);

static u32 readreg(int offset) {
 return __raw_readl( MX51_IO_ADDRESS(MX51_WEIM_BASE_ADDR) + offset);
}

static void writereg(int offset, u32 val) {
 __raw_writel(val, MX51_IO_ADDRESS(MX51_WEIM_BASE_ADDR) + offset);
}

static u32 bitfield(int shift, int bits, int val) {
 return ((val & ( ( 1 << bits ) - 1 ) ) << shift);
}

static void eimtest_cleanup_module(void) {

}

static int eimtest_init_module(void)
{
 int result = 0;
 void __iomem *cs2_base;
 u32 GCR1, GCR2, RCR1, RCR2, WCR1, WEIMCR;

 iomux_v3_cfg_t iomux_cs2 = MX51_PAD_EIM_CS2__EIM_CS2;

 mxc_iomux_v3_setup_pad(iomux_cs2);

 GCR1 = 0x0111008f |
 bitfield(28, 4, PSZ) |
 bitfield(23, 1, AUS) |
 bitfield(14, 2, BCS) |
 bitfield(12, 2, BCD) |
 bitfield(11, 1, WC) |
 bitfield(8, 3, BL) |
 bitfield(5, 1, FL) |
 bitfield(4, 1, FL);

 GCR2 = bitfield(0, 2, ADH);

 RCR1 =
 bitfield(24, 6, WSC) |
 bitfield(20, 3, ADVA) |
 bitfield(16, 3, ADVN) |
 bitfield(12, 3, OEA) |
 bitfield(4, 3, CSA);

 RCR2 =
 bitfield(8, 2, RL) |
 bitfield(4, 3, BEA) |
 bitfield(3, 1, BE);

 WCR1 =
 bitfield(30, 1, !BE) |
 bitfield(24, 6, WSC) |
 bitfield(21, 3, ADVA) |
 bitfield(18, 3, ADVN) |
 bitfield(15, 3, BEA) |
 bitfield(9, 3, WEA) |
 bitfield(3, 3, CSA);

 WEIMCR =
 bitfield(5, 1, INTPOL) |
 bitfield(4, 1, INTEN) |
 bitfield(1, 2, GBCD) |
 bitfield(0, 1, BCM);

 writereg(0x30, GCR1);
 writereg(0x34, GCR2);
 writereg(0x38, RCR1);
 writereg(0x3c, RCR2);
 writereg(0x40, WCR1);
 writereg(0x90, WEIMCR);

 printk(KERN_WARNING EIMTEST "CS2GCR1=%08x, CS2GCR2=%08x\n",
 readreg(0x30),
 readreg(0x34)
 );
 printk(KERN_WARNING EIMTEST "CS2RCR1=%08x, CS2RCR2=%08x\n",
 readreg(0x38),
 readreg(0x3c)
 );
 printk(KERN_WARNING EIMTEST "CS2WCR1=%08x, CS2WCR2=%08x\n",
 readreg(0x40),
 readreg(0x44)
 );

 printk(KERN_WARNING EIMTEST "WEIM Config register WCR=%08x\n",
 readreg(0x90));

 printk(KERN_WARNING EIMTEST "WEIM IP Access register WIAR=%08x\n",
 readreg(0x94));

 printk(KERN_WARNING EIMTEST "CCM_CBCDR=%08x\n",
 __raw_readl(MX51_IO_ADDRESS(0x73fd4014)));

 cs2_base = ioremap(MX51_CS2_BASE_ADDR, SZ_64K);

 if (!cs2_base) {
 printk(KERN_WARNING EIMTEST "Failed to obtain I/O space\n");
 return -ENODEV;
 }

 // Uncomment as necessary:

 //__raw_writel(data, cs2_base + addr);
 printk(KERN_WARNING EIMTEST "Read data=%08x\n",
 __raw_readl(cs2_base + addr));

 iounmap(cs2_base);

 return result;
}

module_init(eimtest_init_module);
module_exit(eimtest_cleanup_module);

The Makefile

This is a more-or-less standard Makefile for compiling a kernel. Please note that /path/to must be changed (twice) to where your Armadeus buildroot is, because both the crosscompiler and Linux kernel are referenced.

export CROSS_COMPILE=/path/to/armadeus-4.0/buildroot/output/build/staging_dir/usr/bin/arm-unknown-linux-uclibcgnueabi-

ifneq ($(KERNELRELEASE),)
obj-m    := eimtest.o

else
KDIR := /path/to/armadeus-4.0/buildroot/output/build/linux-2.6.38.1
PWD := $(shell pwd)

default:
 $(MAKE) CROSS_COMPILE=$(CROSS_COMPILE) -C $(KDIR) SUBDIRS=$(PWD) modules

clean:
 @rm -f *.ko *.o modules.order Module.symvers *.mod.? *~
 @rm -rf .tmp_versions module.target
 @rm -f .eimtest.*
endif

So that’s it. Hope it’s helpful!

Posted Under: ARM,FPGA,Linux kernel,NXP (Freescale)
This post was written by eli on October 15, 2011 Comments (1)

Oscilloscope views of the i.MX51′s EIM bus in action

These are a few oscilloscope samples, some of which are pretty crude, showing Freescale’s i.MX51 accessing its address/data bus.

I worked with an Armadeus APF51 board, which has a 16-bit multiplexed bus connected to the Xilinx Spartan-6 FPGA. The FPGA was used to wire bus signals to a pin header, so 1-2 ns skews between signals are possible.

I wrote some code for the FPGA and processor on the board, for the sake of making these samples, which is available in another post of mine. I also wrote a general post about the EIM bus, which may come handy.

A simple write cycle

With the default settings mentioned here, detailed registers in hex follow:

CS2GCR1=019100bf, CS2GCR2=00000000                                             
CS2RCR1=01000000, CS2RCR2=00000008                                             
CS2WCR1=01000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000021                                              
WEIM IP Access register WIAR=00000014                                          
CCM_CBCDR=59ab7180

Traces from top to bottom (CH4 to CH1): BCLK, WE, CS2 and ADV (trigger on falling edge of CS2).

The BCLK doesn’t look much like a clock, and the signals are cluttered since the clock frequency is 95 MHz, the oscilloscope’s bandwidth is 200 MHz and the signals are picked up with simple probes from the FPGA pin header, so there’s a lot of crosstalk and other issues. But it’s good enough to see the general picture.

You’ll have to believe me that the address is present on the multiplexed address/data lines while the ADV is low (one clock cycle) and that the two other clock cycles carry the two data halves of the 32 bit word (the data width is only 16 bits). Honestly. I checked it out.

What can be seen barely in the scope image is that the bus signals switch on BLK’s falling edges, and that they should be sampled on BCLK’s rising edges. But hey, that exactly what the datasheet says in section 4.6.7.3, table 53.

With non-continuous clock

The same as above, now with BCM=0, so the BCLK toggles only when the bus is working:

CS2GCR1=019100bf, CS2GCR2=00000000                                             
CS2RCR1=01000000, CS2RCR2=00000008                                             
CS2WCR1=01000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=00000014

Nothing really interesting about this, actually.

Delaying the assertion of WE

Returning to the continuous clock, let’s delay WE by one WEIM clock (which happens to be one BCLK) by setting WEA=1

CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000200, CS2WCR2=00000000
WEIM Config register WCR=00000021
WEIM IP Access register WIAR=00000014

And nothing really happened here, including the other signals, which are not shown. Except that WE was indeed asserted later.

Adding a wait state

With the “simple write cycle” as the starting point, setting WWSC=2 (its default is 1) an extra wait state cycle is added:

CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=02000000, CS2RCR2=00000008
CS2WCR1=02000000, CS2WCR2=00000000
WEIM Config register WCR=00000021
WEIM IP Access register WIAR=00000014

Again, you’ll have to believe me that the first 16-bit data word is on the bus on both the second and third BCLK cycle. That is, the waitstate dwells on the first piece of data.

By the way, the waitstate count for read bursts was changed here as well, but that’s irrelevant. It’s just something my test kit did.

Bus clock division

To get a cleaner look, the next scope traces will be done with BCD=3, so the clock is divided by four. Continuous BCLK is also disabled by setting BCM=0, or otherwise there is no phase relation between BCLK and the bus signals.

So just by making these two changes relative to the “simple write cycle” we have

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=01000000, CS2RCR2=00000008                                             
CS2WCR1=01000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=00000014

The time sweep is slower in this scope image, of course.

Bus clock division + adding a wait state

With the last trace as the starting point, setting WWSC=2 (its default is 1) an extra wait state cycle is added:

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=02000000, CS2RCR2=00000008                                             
CS2WCR1=02000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=00000014

So we have four BCLKs instead of three, as one should expect.

A read cycle

Keeping the bus division (BCD=3 and BCM=0), and reverting everything else to the original setting, we’ll have a look on a read cycle. There’s no point in sampling WE anymore, so the probe moves to the OE signal instead. All in all, the traces from top to bottom (CH4 to CH1) are from now on: BCLK, OE, CS2 and ADV (trigger on falling edge of CS2).

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=01000000, CS2RCR2=00000008                                             
CS2WCR1=01000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=00000014

As expected, there are two clock cycles with OE low. This is where the processor expects to get some data.

Delaying OE assertion

With the previous example as a starting point, setting OEA=2 yields the following:

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=01002000, CS2RCR2=00000008                                             
CS2WCR1=01000000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=00000014

This may come as a surprise: The OE’s assertion point was delayed by two WEIM clocks, which happens to be half a BCLK cycle. And nothing else changed.

Delaying ADV assertion

With “A read cycle” as a starting point, setting RADVA=2 yields the following:

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=01200000, CS2RCR2=00000008                                             
CS2WCR1=01400000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=0000001

What we can see here, is that the ADV signal was delayed, but not shortened. While OE’s deassertion point didn’t move, ADV’s did move as a result of delaying the assertion. What is not visible in this scope image, is that the processor keeps driving the address on the address/data lines as long as ADV is asserted, leaving less time for data (as evident by the shortened OE).

Delaying ADV assertion and deassertion

Setting RADVN=2 on top of the previous example, we have a two WEIM clock delay on both the assertion and deassertion, so the deassertion is delayed by 4 WEIM clocks, which is one BCLK. Or in simple words, the first data cycle is completely wiped out:

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=01220000, CS2RCR2=00000008                                             
CS2WCR1=01480000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=0000001

I don’t know if this setting is legal, but it was pretty evident that the data read by the processor during these cycles wasn’t consistent, not even the 16 LSB, which are read during the buried cycle.

Making it OK

Just to have a happy ending, let’s add a wait state. This will pull out the overridden data cycle and make the whole bus operation normal again.

So with RADVA=RADVN=2 and RWSC=2 (with the default as RWSC=1, this means a wait state) we have

CS2GCR1=019130bf, CS2GCR2=00000000                                             
CS2RCR1=02220000, CS2RCR2=00000008                                             
CS2WCR1=02480000, CS2WCR2=00000000                                             
WEIM Config register WCR=00000020                                              
WEIM IP Access register WIAR=0000001

So all in all there’s a longer ADV assertion, which is compensated with a wait state, so there’s time for both data cycles.

Posted Under: ARM,NXP (Freescale)
This post was written by eli on October 15, 2011 Comments (2)

i.MX51 EIM bus clarified

These are my notes as I made my way in grasping how the EIM bus works. Unfortunately, the information in the reference manual was far from complete, so except for the list of acronyms, this page consists of things I found out by reverse engineering the bus.

The actual bus cycle outlines and timings are given in section 4.6.7 of the datasheet. There are also timing diagrams in section 63.8 of the reference manual, which include the internal AXI bus as well, which may be a bit confusing.

I worked with an Armadeus APF51 board, which has a 16-bit multiplexed bus connected to the Xilinx Spartan-6 FPGA. My preferences and observations are related to this board.

I wrote some code for the FPGA and processor on the board, for the sake of this reverse engineering, which is available in another post of mine. I’ve also published some oscilloscope shots during the process, which you may look at here.

EMI, EIM and WEIM

Freescale’s nomenclature regarding the external bus is somewhat inconsistent, so here’s a quick clarification.

EMI is the external memory interface, which is the general crossconnect for making AXI masters talk with their slaves, internal or external. EIM is the external interface module, which is embodied as the Wireless External Interface Module (WEIM). Pin names in the datasheet and schematics are used in EIM terms, but the reference manual uses WEIM nomenclature. Section 4.6.7.1 in the datasheet contains a table which connects between the different names used for the same signals. It’s a must to look at.

Parameter acronyms

Section 63 in the reference manual covers the external memory interface. As it uses acronyms pretty extensively, sometimes with forward reference, I ended up making a cheat sheet.

Here’s a list of bus parameter acronyms, along with the values I chose by default for my tests with the bus (which are pretty much my final preferences). The acronyms themselves were taken directly from chapter 63.4.3 in the Reference Manual.

From Chip Select x General Configuration Register 1

PSZ = 0, Page Size
WP = 0, Write Protect
GBC = 1, Gap Between Chip Selects. That is, the gap between asserting one CS pin and then asserting another pin. Not to be confused with CSREC. At least 1 in sync mode.
AUS = 1, Address Unshifted.
CSREC = 1, CS Recovery, minimal unused clock cycles on the bus between operations on the same CS pin (back to back access). At least 1 in sync mode.
SP = 0, Supervisor Protect
DSZ = 1, Data Port Size (16 bit on Armadeus)
BCS = 0, Bus Clock Start, for fine shifting of the burst clock
BCD = 0, Bus Clock Divider (zero means don’t divide). Note that for BCM = 1 (BCLK running continuously), BCD only controls the bus signals’ rate, and not the BCLK signal itself. More about this is the “gotcha” section.
WC = 0, Write Continuous
BL= 0, Burst Length. When the internal DMA functional unit accesses the EIM bus, BL must be set to cover the longest burst possibly required (typically 32 bytes), or data is corrupted when the SDMA engine induces a burst longer than BL allows.
CREP = 1, Configuration Register Enable Polarity
CRE = 0, Configuration Register Enable (disabled in my case)
RFL/WFL = 1, Read/Write Fix Latency. Whether WAIT should be ignored (it is for RFL= WFL =1)
MUM = 1. Multiplexed Mode. If data and address are multiplexed on the same lines. True in this case.
SRD/SWD = 1, Synchronous Read/Write Data. Whether bus operations are synchronous. They are, of course.
CSEN = 1, CS Enable. If this isn’t set, attempting to write to the relevant region ends up with a bus error (and an oops in the Linux kernel).

From Chip Select x General Configuration Register 2

DAP = 0, Data Acknowledge polarity, irrelevant in sync mode
DAE = 0, Data Acknowledge Enable, irrelevant in sync mode
DAPS = 0, Data Acknowledge Polling Start, irrelevant in sync mode
ADH = 0, Address Hold Time

From Chip Select x Read/Write Configuration Register 1 and 2

RWSC/WWSC = 1, Read/Write Wait State Control. The number of wait states on a bus transaction, given in BCLK cycles (as opposed to WEIM cycles). Must be at least 1.
RADVA/WADVA = 0, Read/Write ADV Assertion. Tells when ADV is asserted in WEIM cycles. Note that while ADV is asserted, the address is present on the multiplexed address/data lines, no matter what, even at the cost of some or all data not appearing on the bus at all.
RADVN/WADNV = 0, Read/Write ADV Negation. How many extra WEIM cycles ADV stays asserted. The formula given in the reference manual says that by default, ADV is asserted for a BCLK worth’s of time, starting as required by RADVA/WADVA, and whatever is given by RADVA/WADVA is added to that.
RAL/WAL = 0, Read/Write ADV Low. When this bit is set, RADVN/WADVN are ignored, and ADV is asserted during the entire bus operation.
RCSA/WCSA = 0, Read/Write CS Assertion. The number of WEIM clocks the CS’s assertion is delayed.
RCSN/WCSN = 0, Read/Write CS Negation. Ignored in sync mode.
RBEA /WBEA= 0, Read/Write BE Assertion. The number of WEIM clocks to delay BE assertion.
RBEN/WBEN=0, Read/Write BE Negation. Ignored in sync mode.
RBE = 1, Read BE enable
WBED = 0, Write BE disable
OEA = 0, (Read) OE Assertion. How many WEIM clock cycles to delay the OE signal assertion. Note that unlike other delay parameters, OEA is relative to the first data clock cycle, so OE will mean “expecting data on data lines on this clock cycle” for OEA=0. It works this way in multiplexed mode, at least.
OEN = 0, (Read) OE Negation. Ignored in sync mode.
APR = 0, (Read) Asynchronous Page Read. Must be held zero in sync mode.
PAT = 0, (Read) Page Access Time. Ignored when APR=0 (and hence ignored in sync mode)
RL = 0, Read Latency.
WEA = 0, WE Assertion. How many WEIM clock cycles to delay WE assertion
WEN = 0, WEN Negation. Ignored in sync mode
WBCDD = 0, Write Burst Clock Divisor Decrement

And finally, from the WEIM Configuration Register

WDOG_LIMIT = 0, Memory Watchdog. Not really necessary in sync mode
WDOG_EN = 0, Memory Watchdog Enable.
INTPOL = 1, Interrupt Polarity
INTEN = 0, Interrupt Enable
BCM = 1, Burst clock mode. When asserted, the BCLK runs continuously, instead of only during bus transactions.
GBCD = 0, General Burst Clock Divisor. Used globally for all CS spaces instead of each space’s BCD when BCM=1. See warning below.

Some gotcha notes

Clock division with BCD and GBCD is a messy issue, and clock division is best avoided when running the clock continuously (BCM=1). The thing is that while GBCD indeed controls the division of the BCLK signal itself, the bus signals are governed by the clock divided by the individual BCD. So if BCD != GBCD for a specific CS region, the bus signals are completely unrelated to BCLK. But even if BCD and GBCD are equal, there is no guaranteed phase relation between them (as has been observed) because they’re generated by two unrelated clock dividers. So the BCLK signal is useless unless BCD = GBCD = 0.
Most delays are given in WEIM clocks, not BCLK clocks. This makes no difference as long as BCLK runs at WEIM rate (BCD=0), but if BCLK is divided, even for the sake of getting clear transitions on an oscilloscope, this needs to be taken into account.
For most parameters, delays in assertions make the signal’s assertion duration shorter. It’s not a time shift, as the deassertion doesn’t move.
All signals, except OE, are asserted at the same first clock cycle of the bus access, unless delayed by the parameters below. This includes WE and BE, which one could mistakenly expect to be asserted when data is available. OE is indeed asserted when data is due to be supplied, and its assertion timing parameter works relatively to that clock cycle.
Delaying and/or extending the ADV signal with RADVA/WADVA and RADVN/WADVN on a multiplexed bus causes address to be present during the relevant time periods without changing other timings, with precedence to address. So time slots which would otherwise be used for data transmission are overridden with address on the bus, possibly eliminating data presence on the bus completely. This can be compensated with wait states, but note that wait states count in BCLK cycles, while the ADV adjustments count WEIM cycles.
The OE signal is somewhat useless as an output-enable when BCLK runs at 95 MHz: If used directly to drive tri-state buffers, the round-trip from its assertion to when data is expected is ridiculously short: The data-to-rising clock setup time is 2 ns, according to the datasheet (section 4.6.7.3, table 53, parameter WE18). OE is asserted on the falling edge of the clock just before the rising edge, for which the data is sampled, with a delay of up to 1.75 ns (same table, WE10). At a clock cycle of 10.5 ns (95 MHz), this half-clock gap between these two events is 5.25 ns, leaving 5.25 – 2 – 1.75 = 1.5 ns for the bus slave to take control of the bus. Not realistic, to say the least. So the bus slave must deduce from WE whether the bus cycle is read or write, and drive the bus according to predefined timing. As for bursts, I’m not on the clear on whether bursts can be stalled in the middle and how OE behaves if that is possible. The timing diagram in section 63.8.7 of the Reference Manual does not imply that OE may get high in the middle of a burst. On the other hand, it shows OE going down together with ADV, which surely isn’t the case as I observed (maybe because I ran on a multiplexed bus?).

Write data cycles

Reminder: This entire post relates to a 16-bit address/data multiplexed EIM bus.

The simplest write data cycle (as defined by parameter settings above) consists of three BCLK cycles. On the first one, the lower 16 bits of the address is present on the bus, and ADV is held low. On the two following clock cycles, ADV is high and the 32-bit word is transferred over the data lines. CS and WE are held low during the entire cycle (three BCLKs). And no, the upper 16 bits of the address are never presented on the bus.

For BCD=0 (BCLK = WEIM clock), the master toggles its signals on the falling edge of BCLK, and samples signals from the slave on its rising edge. This holds true for all bus signals.

Data is sent in little endian order: A 32-bit word is sent with its lower 16-bit part (bits 15:0) in the first clock cycle, and the higher 16 bits (bits 31:16) in the second cycle. The 16-bit words are sent naturally (that is, each DA[15:0] is a consistent 16-bit word).

With AUS=1 (address unshifted) the address’ lower 16 bits appear naturally on the address cycle. For example, writing to offset Ox30 (to the CS2 address range) sets bits DA[4]=1 and DA[5]=1 (only) in the address clock cycle.

With AUS=0 (address shifted according to port size) the address shown on the bus is shifted by one bit, since there are two bytes in the port size’s width. Hence writing to offset Ox60 sets bits DA[4]=1 and DA[5]=1 (only) in the address clock cycle.

As said above (and verified with scope), WADVA and WADVN don’t just move around the ADV signal’s assertion, but also the times at which the address is given on the bus, possibly overriding time which would otherwise be used to transfer data. It’s the user’s responsibility to make sure (possibly with wait states) that there is enough time for data on the bus.

Wait states, as set by WWSC extend the first data cycle, so the lower 16 bits of data are held on the bus for a longer time. If WWSC=0, only the upper 16 bits are shown (the first data cycle is skipped) but this is an illegal setting anyhow. Again, note that WWSC counts BCLK cycles, as opposed to almost every other timing parameter.

For BCD=0 (only) the data lines’ levels are held with the last written value until the next bus operation. This feature (which AFAIK is not guaranteed by spec) is used by the FPGA bitstream loader: A word is written, and the FPGA’s clock is toggled afterwards to sample the data which is left unchanged (which is maybe why you can’t load the bitstream file from the on-board flash, as indicated in Armadeus’ wiki page). When BCD>0, the data lines go to zero after the cycle is ended.

Read data cycles

Read data cycles are in essence the same as write cycles, only the bus slave is expected to drive the data lines in the same time slots for which the master drove them on a bus write operation. On read cycles, the master samples the data lines on rising BCLK edges, which is symmetric to the slave sampling the same lines on write cycles. The endianess is the same of course.

OE is asserted on the same WEIM cycle for which ADV is deasserted, which is one BCLK cycle after CS’s assertion with the default parameters given above. WE is not asserted in write cycles, of course.

As mentioned in the “gotcha notes” above, the OE line is pretty useless in high bus rates, as it’s asserted on the falling edge of BCLK coming just before the rising edge on which the data is sampled. This gives by far too little time for the slave to respond. So the slave should figure out the correct behavior and timing according to WE and CS.

Using the WAIT signal

In order to use the WAIT signal for adding wait states on the fly, the respective RFL/WFL parameter need to be zeroed. If RFL=0, also set RWSC=2 (instead of the minimal 1), or four useless and unused wait states will be added to each bus cycle, most likely due to an illegal condition in the master’s bus state machine. This is not necessary for writes (i.e. it’s OK to have WFL=0 and WWSC=1).

WAIT is active low. When the master samples WAIT low (on a rising BCLK edge) it considers the following rising BCLK edge as a wait state. It’s or course legal to assert WAIT for consecutive BCLK cycles to achieve long waits. If WAIT is not asserted, the bus runs according to RWSC/WWSC. Each BCLK cycle is considered independently, so when a 32-bit word is transmitted on two BCLK cycles, wait states can be inserted between 16-bit words, resulting in expected behavior. There is no need to consider the fact that these 16-bit words form a 32-bit word when dealing with wait state behavior.

As one would expect, if WAIT is asserted with respect to a bus cycle that wouldn’t occur anyhow (i.e. the last BCLK cycle in a transmission), it’s ignored.

In read cycles, all this boils down to that if the master sampled WAIT low on BCLK rising edge n, no data will be sampled from data lines on rising edge n+1, and the entire bus operation is extended by another BCLK cycle. RWSC must be set to a minimum of 2, since WAIT is ignored on the first bus cycle (on which ADV is asserted) , so the first chance to request a wait state is on the second cycle, which must be a wait state anyhow. If RWSC=1 and RFL=0, the master will insert this wait state anyhow, but misbehave as just mentioned above. Even though counterintuitive, the master may very well sample data from the bus on a BCLK rising edge for which WAIT is asserted. This will make the following bus cycle a wait state, as one can deduce from the mechanism. But it may come intuitively unnatural that an asserted WAIT and valid data are sampled on the same BCLK.

For write cycles, if the master samples WAIT asserted on a rising edge of BCLK, it will behave as usual on the falling BCLK immediately following it, but will not update data lines on the falling BCLK edge afterwards (and hold the internal state machine accordingly). This follows the overall scheme of wait states described above. Unlike read bus operations, this holds true for the ADV cycle as well, so it’s possible to get wait states on the first data transaction by asserting wait on the first BCLK cycle. For WWSC=1, this means in practice to have WAIT asserted while there is no bus activity, because there’s half a clock cycle between the assertion of CS, ADV and other bus signals, and the sampling of WAIT in this case. In order to give the slave time to assert WAIT depending on the bus operation’s nature, WWSC has to be increased to 2 at least.

Bus frequency

The bus EIM bus frequency is derived by default from PLL2 (at least on Armadeus), which is a 665 MHz clock, divided according to the emi_slow_podf field in the CBCDR register (see 7.3.3.6 in the reference manual). On the Armadeus platform, this field is set to 6 by default, so the clock is divided by 7, yielding a bus clock of 95 MHz. To change it, the following code snippet applies:

#define MXC_CCM_CBCDR 0x14
u32 temp_clk;
const emi_slow_podf = 7;

temp_clk = __raw_readl( MX51_IO_ADDRESS(MX51_CCM_BASE_ADDR)+ MXC_CCM_CBCDR );

__raw_writel( (temp_clk & (~0x1c00000)) | (emi_slow_podf << 22),
               MX51_IO_ADDRESS(MX51_CCM_BASE_ADDR)+ MXC_CCM_CBCDR )

The above code reduces the EMI clock to 83.125 MHz (divide by 8).

Note that there’s a little remark in section 4.6.7.3 of the datasheet (Table 53, WEIM Bus Timing Parameters), in footnote 4, saying “The lower 16 bits of the WEIM bus are limited to 90 MHz”. Indeed, running 95 MHz has proven to have rare bus malfunctions (after half a gigabyte of data or so, probably causing some local heating), taking the form of sporadic bus cycle missed, causing injection of bogus data or data being missed.

Posted Under: ARM,FPGA,Linux kernel,NXP (Freescale)
This post was written by eli on October 14, 2011 Comments (6)

The FPGA+ARM Armadeus APF51 board: Buildroot notes

Scope

This post was spun off the main post regarding setting up the Armadeus board for Embedded Linux on ARM and Xilinx Spartan-6 FPGA. It covers my own little war story as I set up the Buildroot SDK, so I could have my own cross compiler and Linux kernel to work with.

As kernel.org happened to be down, this was a (forced) opportunity to look a bit deeper into how things work under the hood. These are my notes, reflecting my own experience, and isn’t a substitute for any official documentation.

Setting up the build environment

Downloaded armadeus-4.0.tar.bz2 from the SourceForge page. There are binaries there too (Linux kernel, rootfs and U-boot image). Opened the tarball, changed directory, and followed the instructions and deduced that the apf51_defconfig is valid even though not documented:

$ make apf51_defconfig

that downloaded buildroot-2010.11.tar.bz2 from http://buildroot.uclibc.org/downloads/buildroot-2010.11.tar.bz2 ran some script (a lot of patches applied) and then opened the config menu.

Config menu settings:

Under Target Options > Armadeus Device Support > Size of a single RAM chip, change it to 512 MB (to match my board). On second thought, the kernel detects all 512MB anyhow (despite this parameter set at 256 MB) so it looks like there’s no need to change this.
Under Build options, set Number of jobs to run simultaneously to the number of cores in the compiling computer (8 for a quadcore with hyperthreading). So it finishes today.

and then simply went “make”. The computer spits a lot of mumbo-jumbo. Loads, really.

In retrospective, I would consider running “make source” to download whatever needs to be downloaded. In my own build, this turned out to be where problems occured.

Note: Always run make from armadeus-4.0 and not from “buildroot”. Doing the latter will work properly for several components, but will fail every now and then with weird errors. It’s very easy to make this mistake, in particular when kickstarting the build after download failures.

After the build finishes, the interesting stuff is in

buildroot/output/images — Root, kernel and bootloader images.
buildroot/output/build/staging_dir — Things to run on the host computer, such as the cross-compiler (e.g. buildroot/output/build/staging_dir/usr/arm-unknown-linux-uclibcgnueabi/bin/gcc)

Issues resulting from kernel.org being down

The build got stuck at downloading the Linux kernel. kernel.org happened to be down for maintenance, and Armadeus’ FTP site didn’t supply it. So I fetched linux-2.6.38.1.tar.bz2 from a Linux mirror and put the file in /path/to/armadeus-4.0/buildroot/downloads and ran make again.

And then a git clone for linux-firmware.git failed, since it’s from kerner.org as well. So I went for the safe “trust anyone” strategy, and went for any repository I could find. So instead of the failed

git clone git://git.kernel.org/pub/scm/linux/kernel/git/dwmw2/linux-firmware.git /home/eli/armadeus/build/armadeus-4.0/buildroot/output/build/firmware-af5222c5ded5d944267acfbd001571409bea7eeb

I went

$ git clone https://github.com/mdamt/linux-firmware.git /home/eli/armadeus/build/armadeus-4.0/buildroot/output/build/firmware-af5222c5ded5d944267acfbd001571409bea7eeb

which wasn’t really helpful, since it didn’t satisfy some Make rule. Running “make -d” I found out that the Make rules would be happy with a tarball, so I went

$ tar -czf ../../downloads/firmware-af5222c5ded5d944267acfbd001571409bea7eeb.tar.gz firmware-af5222c5ded5d944267acfbd001571409bea7eeb

and ran “make” again after removing the directory from which I had created the tarball.

Just a silly mistake

When attempting to build the kernel I got

drivers/usb/Kconfig:169: can't open file "drivers/armadeus/Kconfig"

which was a direct result of the broken link of the “armadeus” subdirectory in “output/build/linux-2.6.38.1/drivers/”. But that was just a silly mistake: I ran “make” from “buildroot” and not from the armadeus-4.0 directory, so the Makefile setting up the necessary environment variable was never set.

Doing it all over again

The idea is to rerun everything offline, i.e. without any downloads from the internet. One can’t always rely on that servers will be up and ready… The correct way to do this is to define the download directories before starting off, but I didn’t bother.

My motivation for doing this was that after kickstarting the build so many times, it crossed my mind that I may have messed up something without noticing. So I figured it would be best to rerun it all in one go.

So first of all, move the already built directory to finished-armadeus-4.0. It’s some 3.5 GB, but we’ll need only the download directories.

$ tar -xjf armadeus-4.0.tar.bz2
$ cd armadeus-4.0
$ cp -r ../finished-armadeus-4.0/downloads/ .
$ cp -r ../finished-armadeus-4.0/buildroot/downloads/ buildroot/
$ make apf51_defconfig

Which brings us back to the configuration menu, after which a simple “make” does the work.

It’s also worth to mention, that according to the docs/README file, “make source” will download all necessary sources, which is a good starter.

Posted Under: ARM,FPGA,Linux kernel,NXP (Freescale)
This post was written by eli on October 6, 2011 Comments (1)

« Older Entries

Newer Entries »

Popular Posts

Latest Posts

Archives

Running custom scripts

The assembler

A simple test function

Copying the code into SDMA memory

Some housekeeping calls on channel 1

Setting up the context

Running the script

Dumping context

Clashes with Linux’ SDMA driver

Events

Interrupts

Contexts and channels

A script’s life cycle (scheduling)

Channel eligibility

Introduction

Quirky memory issues

The memory map

What we have here

Simple use

The Verilog code

The UCF file

The kernel module

The Makefile

A simple write cycle

With non-continuous clock

Delaying the assertion of WE

Adding a wait state

Bus clock division

Bus clock division + adding a wait state

A read cycle

Delaying OE assertion

Delaying ADV assertion

Delaying ADV assertion and deassertion

Making it OK

EMI, EIM and WEIM

Parameter acronyms

Some gotcha notes

Write data cycles

Read data cycles

Using the WAIT signal

Bus frequency

Scope

Setting up the build environment

Issues resulting from kernel.org being down

Other problems

Just a silly mistake

Doing it all over again

Quick links

Categories

Meta