Unless your embedded application happens to be a router, there’s some application-dependent electronics you need to talk with. If some SoC device covers your needs, that’s always nice, but what about that specific piece of electronics? And what if your application includes a part that needs to be run on an FPGA?
Making a processor talk with an FPGA is pretty doable, as long as there’s no heavy I/O, and the processor doesn’t run a sophisticated operating system. But if you picked Linux (probably to support some USB device, storage and/or network), the task of getting high-bandwidth data running between the processor and the FPGA can turn into a project by its own merits.
Xilinx addresses this issue partly with its Zynq-7000 FPGA-ARM combo, making the ARM’s internal AXI bus directly available to FPGA logic. Whether this new generation of devices is going to have a different fate than the Virtex-2 Pro and Virtex-4 FX FPGAs, which had PowerPC cores built-in and direct PLB bus access, is still to see. It seems like many believe, that putting the processor core inside the FPGA doesn’t necessarily make things easier. Anyhow, with first engineering samples of a completely new architecture to be out on the first half of 2012, it’s not clear when the Zynq-7000 solution will be live and kicking.
As many have found out, running Linux on an embedded processor may be difficult, but not a significant obstacle. Getting the Linux-running processor to access a register or two on the FPGA is not an impossible mission either. But when data needs to be transported fast and efficiently, things start to get tricky: The FPGA needs to be bus master capable, so it can transport the data over DMA. The Linux kernel driver needs to be written correctly to orchestrate DMA accesses at a high rate without using up too much CPU. The asynchronous nature of the data transfer creates corner cases, in particular when the data arrives in anything else than chunks of a constant size. In short, the distance between the “Hello, world” application and the actual working horse is sometimes not as close as it may seem at first.
Xillybus offers a simple solution for systems having a PCI or PCIe bus. As this is not usually the case for embedded processors, this doesn’t necessarily help.
On the other hand, a special port of Xillybus to Freescale’s i.MX51 is already available. Using DMA for transferring data over the external bus lines, data rates of 35 MByte/s and above are possible, with a minimal use of the ARM Cortex A8 processor. The application designer meets the same simple and intuitive interface, as in the PCIe version: The FPGA engineer faces a simple and standard FIFO or RAM interface. The programmer writes simple user space applications which interact with device files, as I/O is usually done in Linux systems.
The demo version is available for the Armadeus APF51 board, which forms, together with its development docking board, a jump start kit for evaluating Xillybus on embedded ARM platform. As the Xillybus evaluation kit is pretty much like the real thing, and the board’s design is straightforward, taking evaluation to real-life implementation is at a hand’s reach.
As voicemail messages often go here in Israel: The Hebrew message will be followed by an English one.
עברית
לפני מספר שנים, כתבתי עיבוד לשיר יום ההולדת המוכר “היום יום הולדת” למקהלת גברים (למעשה, רביעיית ברברשופ). בראייה לאחור (או האזנה, ליתר דיוק) הדמיון העיקרי לברברשופ הוא בכך שהמנגינה נמצאת אצל הטנור השני, וגם האיזון בין הקולות לפי המסורת, אבל השמאלץ האמריקאי הזה לא נמצא בין התווים. אולי כי זה שיר בעברית, והמעבד ישראלי…
כך או כך, העיבור משוחרר תחת רשיון CC0 של Creative Commons, שזה אומר שאפשר לעשות מה בראש שלך איתו. כולל, כמובן, לשכפל העתקים אלקטרוניים או על גבי נייר, לבצע, להופיע, להקליט, לזייף ולהאשים אותי בסוף.
אפשר להוריד את התווים בלינק הזה, וגם קליפ שמע קצר שבו אני והשיכפולים שלי שרים (טוב נו, לוחשים) את העיבוד.
English
A few years ago, I made a small arrangement of the Israeli birthday song for a TTBB male choir. Or just a plain male quartet. It’s kinda barbershop in the sense that the Lead has the melody and the way the voices should be balanced. In retrospective, it doesn’t have the American feel to it, but heck, it’s an Israeli song arranged by an Israeli…
You can download the sheet music directly using this link. For an audio clip of myself multiplied singing (well, whispering) this, click here.
I’ve released it under Common Creative CC0, or if you like, to the public domain. In simple words, that means that you can do whatever you want with it, with no need to ask anyone for permission. Including, of course, making electronic or paper copied, performing, recording, singing off key and blaming me for everything. As long as you have fun.
This is part IV of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.
This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:
Running custom scripts
I’ll try to show the basics of getting a simple custom script to run on the SDMA core. Since there’s a lot of supporting infrastructure involved, I’ll show my example as a hack on the drivers/dma/imx-sdma.c Linux kernel module per version 2.6.38. I’m not going to explain the details of kernel hacking, so without experience in that field, it will be pretty difficult to try this out yourself.
The process of running an application-driven custom script consists of the following steps:
- Initialize the SDMA module
- Initialize the SDMA channel and clearing its HE flag
- Copy the SDMA assembly code from application space memory to SDMA memory space RAM.
- Set up the channel’s context
- Enable the channel’s HE flag (so the script runs pretty soon)
- Wait for interrupt (assuming that the script ends with a “DONE 3″)
- Possibly copy back the context to application processor space, to inspect the registers upon termination, and verify that their values are as expected.
- Possibly copy SDMA memory to application processor space in order to inspect if the script worked as expected (if the script writes to SDMA RAM)
The first two steps are handled by the imx-smda.c kernel module, so I won’t cover them. I’ll start with the assembly code, which has to be generated first.
The assembler
Freescale offers their assembler, but I decided to write my own in Perl. It’s simple and useful for writing short routines, and its output is snippets of C code, which can be inserted directly into the source, as I’ll show later. It’s released under GPLv2, and you can download it from this link.
The sample code below does nothing useful. For a couple of memory related examples, please see another post of mine.
To try it out quickly, just untar it on some UNIX system (Linux included, of course), change directory to sdma_asm, and go
$ ./sdma_asm.pl looptry.asm
| start:
0000 0804 (0000100000000100) | ldi r0, 4
0001 7803 (0111100000000011) | loop exit, 0
0002 5c05 (0101110000000101) | st r4, (r5, 0) # Address r5
0003 1d01 (0001110100000001) | addi r5, 1
0004 1c10 (0001110000010000) | addi r4, 0x10
| exit:
0005 0300 (0000001100000000) | done 3
0006 1c40 (0001110001000000) | addi r4, 0x40
0007 0b00 (0000101100000000) | ldi r3, 0
0008 4b00 (0100101100000000) | cmpeqi r3, 0 # Always true
0009 7df6 (0111110111110110) | bt start # Always branches
------------ CUT HERE -----------
static const int sdma_code_length = 5;
static const u32 sdma_code[5] = {
0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};
The output should be pretty obvious. In particular, note that there’s a C declaration of a const array called sdma_code, which I’ll show how to use below. The first part of the output is a plain assembly listing, with the address, hex code and binary representation of the opcodes. There are a few simple syntax rules to observe:
- Anything after a ‘;’ or ‘#’ sign is ignored (comments)
- Empty lines are ignored, of course
- A label starts the line, and is followed by a colon sign, ‘:’
- Everything is case-insensitive, including labels (all code is lowercased internally)
- The first alphanumeric string is considered the opcode, unless it’s a label
- Everything following an opcode (comments excluded) is considered the arguments
- All registers are noted as r0, r1, … r7 in the argument fields, and not as plain numbers, unlike the way shown in the reference manual. This makes a clear distinction between registers and values. It’s “st r7, (r0,9)” and not “
st 7, (0,9)“.
- Immediate arguments can be represented as decimal numbers (digits only), possibly negative (with a plain ‘-’ prefix). Positive hexadecimal numbers are allowed with the classic C “0x” prefix.
- Labels are allowed for loops, as the first argument. The label is understood to be the first statement after the loop, so the label is the point reached when the loop is finished. See the example above. The second argument may not be omitted.
- Other than loops, labels are accepted only for branch instructions, where the jump is relative. Absolute jump addresses can’t be generated automatically for jmp and jsr because the absolute address is not known during assembly.
A few words about why labels are not allowed for absolute jumps: It would be pretty simple to tell the Perl script the origin address, and allow absolute addressed jumps. I believe absolute jumps within a custom script should be avoided at any cost, so that the object code can be stored and run anywhere vacant. This is why I wasn’t keen on implementing this.
A simple test function
This is a simple function, which loads a custom script and runs it a few times. I added it, and a few additional functions (detailed later) to the Linux kernel’s SDMA driver, imx-sdma.c, and called it at the end of sdma_probe(). This is the simplest, yet not most efficient way to try things out: The operation takes place once when the module is inserted into the kernel, and then a reboot is necessary, since the module can’t be removed from the kernel. But with the reboot being fairly quick on an embedded system, it’s pretty OK.
So here’s the tryrun() function. Mind you, it’s called after the SDMA subsystem has been initialized, with one argument, the pointer to the sdma_engine structure (there’s only one for the entire system).
static int tryrun(struct sdma_engine *sdma)
{
const int channel = 1;
struct sdma_channel *sdmac = &sdma->channel[channel];
static const u32 sdma_code[5] = {
0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};
const int origin = 0xe00; /* In data space terms (32 bits/address) */
struct sdma_context_data *context = sdma->context;
int ret;
int i;
sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);
ret = sdma_request_channel(sdmac);
if (ret) {
printk(KERN_ERR "Failed to request channel\n");
return ret;
}
sdma_disable_channel(sdmac);
sdma_config_ownership(sdmac, false, true, false);
memset(context, 0, sizeof(*context));
context->channel_state.pc = origin * 2; /* In program space addressing... */
context->gReg[4] = 0x12345678;
context->gReg[5] = 0xe80;
ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
0x800 + (sizeof(*context) / 4) * channel);
if (ret) {
printk(KERN_ERR "Failed to load context\n");
return ret;
}
for (i=0; i<4; i++) {
ret = sdma_run_channel(&sdma->channel[1]);
printk(KERN_WARNING "*****************************\n");
sdma_print_mem(sdma, 0xe80, 128);
if (ret) {
printk(KERN_ERR "Failed to run script!\n");
return ret;
}
}
return 0; /* Success! */
}
Copying the code into SDMA memory
First, note that sdma_code is indeed copied from the output of the assembler, when it’s executed on looptry.asm as shown above. The assembler adds the “static” modifier as well as an sdma_code_length variable which were omitted, but otherwise it’s an exact copy.
The first thing the function actually does, is calling sdma_write_datamem() to copy the code into SDMA space (and I don’t check the return value, sloppy me). This is a function I’ve added, but its clearly derived from sdma_load_context(), which is part of imx-sdma.c:
static int sdma_write_datamem(struct sdma_engine *sdma, void *buf,
int size, u32 address)
{
struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
void *buf_virt;
dma_addr_t buf_phys;
int ret;
buf_virt = dma_alloc_coherent(NULL, size, &buf_phys, GFP_KERNEL);
if (!buf_virt)
return -ENOMEM;
bd0->mode.command = C0_SETDM;
bd0->mode.count = size / 4;
bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
bd0->buffer_addr = buf_phys;
bd0->ext_buffer_addr = address;
memcpy(buf_virt, buf, size);
ret = sdma_run_channel(&sdma->channel[0]);
dma_free_coherent(NULL, size, buf_virt, buf_phys);
return ret;
}
The sdma_write_datamem()’s principle of operation is pretty simple: First a buffer is allocated, with its address in virtual space given in buf_virt and its physical address is buf_phys. Both addresses are related to the application processor, of course.
Then the buffer descriptor is set up. This piece of memory is preallocated globally for the entire sdma engine (in application processor’s memory space), which isn’t the cleanest way to do it, but since these operations aren’t expected to happen in parallel processes, this is OK. The sdma_buffer_descriptor structure is defined in imx-smda.c itself, and is initialized according to section 52.23.1 in the Reference Manual. Note that this calling convention interfaces with the script running on channel 0, and not with any hardware interface. This chunk is merely telling the script what to do. In particular, the C0_SETDM command tells it to copy from application memory space to SDMA data memory space (see section 53.23.1.2).
Note that in the function’s arguments, “size” is given in bytes, but address in SDMA data address space (that is, in 32-bit quanta). This is why “size” is divided by four to become the element count (mode.count).
Just before kicking off, the input buffer’s data is copied into the dedicated buffer with a plain memcpy() command.
And then sdma_run_channel() (part of imx-sdma.c) is called to make channel 0 runnable. This function merely sets the HE bit of channel 0, and waits (sleeping) for the interrupt to arrive, or errors on timeout after a second.
At this point we have the script loaded into SDMA RAM (at data address 0xe00).
Some housekeeping calls on channel 1
Up to this point, nothing was done on the channel we’re going to use, which is channel #1. Three calls to functions defined in imx-sdma.c prepare the channel for use:
- sdma_request_channel() sets up the channel’s buffer descriptor and data structure, and enables the clock global to the entire sdma engine, actions which I’m not sure are necessary. It also sets up the channel’s priority and the Linux’ wait queue (used when waiting for interrupt).
- sdma_disable_channel() clears the channel’s HE flag
- sdma_config_ownership() clears HO, sets EO and DO for the channel, so the channel is driven (“owned”) by the processor (as opposed to driven by external events).
Setting up the context
Even though imx-sdma.c has a sdma_load_context() function, it’s written for setting up the context as suitable for running the channel 0 script. To keep things simpler, we’ll set up the context directly.
After zeroing the entire structure, three registers are set in tryrun(): The program counter, r4 and r5. Note that the program counter is given the address to which the code was copied, multiplied by 2, since the program counter is given in program memory space. The two other registers are set merely as an initial state for the script. The structure is then copied into the per-channel designated slot with sdma_write_datamem().
Again, note that the “context” data structure, which is used as a source buffer from which the context is copied into SDMA memory, is allocated globally for the entire SDMA engine. It’s not even protected by a mutex, so in a real project you should allocate your own piece of memory to hold the sdma_context structure.
Running the script
In the end, we have a loop of four subsequent runs of the script, without updating the context, so from the second time and on, the script continues after the “done 3″ instruction. This is possible, because the script jumps to the beginning upon resumption (the three last lines in the assembly code, see above).
Each call to sdma_run_channel() sets channel 1′s HE flag, making it do its thing and then trigger off an interrupt with the DONE instruction, which in turn wakes up the process telling it the script has finished. sdma_print_mem() merely makes a series of printk’s, consisting of hex dumps of data from the SDMA memory. As used, it’s aimed on the region which the script is expected to alter, but the same function can be used to verify that the script is indeed in its place, or look at the memory. The function goes
static int sdma_print_mem(struct sdma_engine *sdma, int start, int len)
{
int i;
u8 *buf;
unsigned char line[128];
int pos = 0;
len = (len + 15) & 0xfff0;
buf = kzalloc(len, GFP_KERNEL);
if (!buf)
return -ENOMEM;
sdma_fetch_datamem(sdma, buf, len, start);
for (i=0; i<len; i++) {
if ((i % 16) == 0)
pos = sprintf(line, "%04x ", i);
pos += sprintf(&line[pos], "%02x ", buf[i]);
if ((i % 16) == 15)
printk(KERN_WARNING "%s\n", line);
}
kfree(buf);
return 0;
}
and it uses this function (note that the instruction is C0_GETDM):
static int sdma_fetch_datamem(struct sdma_engine *sdma, void *buf,
int size, u32 address)
{
struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
void *buf_virt;
dma_addr_t buf_phys;
int ret;
buf_virt = dma_alloc_coherent(NULL, size,
&buf_phys, GFP_KERNEL);
if (!buf_virt)
return -ENOMEM;
bd0->mode.command = C0_GETDM;
bd0->mode.count = size / 4;
bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
bd0->buffer_addr = buf_phys;
bd0->ext_buffer_addr = address;
ret = sdma_run_channel(&sdma->channel[0]);
memcpy(buf, buf_virt, size);
dma_free_coherent(NULL, size, buf_virt, buf_phys);
return ret;
}
Dumping context
This is the poor man’s debugger, but it’s pretty useful. A “done 3″ function can be seen as a breakpoint, and the context dumped to the kernel log with this function:
static int sdma_print_context(struct sdma_engine *sdma, int channel)
{
int i;
struct sdma_context_data *context;
u32 *reg;
unsigned char line[128];
int pos = 0;
int start = 0x800 + (sizeof(*context) / 4) * channel;
int len = sizeof(*context);
const char *regnames[22] = { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
"mda", "msa", "ms", "md",
"pda", "psa", "ps", "pd",
"ca", "cs", "dda", "dsa", "ds", "dd" };
context = kzalloc(len, GFP_KERNEL);
if (!context)
return -ENOMEM;
sdma_fetch_datamem(sdma, context, len, start);
printk(KERN_WARNING "pc=%04x rpc=%04x spc=%04x epc=%04x\n",
context->channel_state.pc,
context->channel_state.rpc,
context->channel_state.spc,
context->channel_state.epc
);
printk(KERN_WARNING "Flags: t=%d sf=%d df=%d lm=%d\n",
context->channel_state.t,
context->channel_state.sf,
context->channel_state.df,
context->channel_state.lm
);
reg = &context->gReg[0];
for (i=0; i<22; i++) {
if ((i % 4) == 0)
pos = 0;
pos += sprintf(&line[pos], "%s=%08x ", regnames[i], *reg++);
if (((i % 4) == 3) || (i == 21))
printk(KERN_WARNING "%s\n", line);
}
kfree(context);
return 0;
}
Clashes with Linux’ SDMA driver
Playing around with the SDMA subsystem directly is inherently problematic, since the assigned driver may take contradicting actions, possibly leading to a system lockup. Running custom scripts using the existing driver isn’t possible, since it has no support for that as of kernel 2.6.38. On the other hand, there’s a good chance that the SDMA driver wasn’t enabled at all when the kernel was compiled, in which case there is no chance for collisions.
The simplest way to verify if the SDMA driver is currently present in the kernel, is to check in /proc/interrupts whether interrupt #6 is taken (it’s the SDMA interrupt).
The “imx-sdma” pseudodevice is always registered on the platfrom pseudobus (I suppose that will remain in the transition to Open Firmware), no matter the configuration. It’s the driver which may not be present. The “i.MX SDMA support” kernel option (CONFIG_IMX_SDMA) may not be enabled (it can be a module). Note that it depends on the general “DMA Engine Support” (CONFIG_DMADEVICES), which may not be enabled to begin with.
Anyhow, for playing with the SDMA module, it’s actually better when these are not enabled. In the long run, maybe there’s a need to expand imx-sdma.c, so it supports custom SDMA scripting. The question remaining is to what extent it should manage the SDMA RAM. Well, the real question is if there’s enough community interest in custom SDMA scripting at all.
This is part III of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.
This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:
Events
Even though an SDMA script can be kicked off (or made eligible for running, to be precise) by the application processor, regardless of any external events, there’s a lot of sense in letting the peripheral kick off the script(s) directly, so the application processor doesn’t have to be bothered with an interrupt every time.
So the system has 48 predefined SDMA events, listed in section 3.3 of the Reference Manual. Each of these events can turn one or several channels eligible for executing by automatically setting their EP flag. Which of the channels will have its EP flag set is determined by the SDMA event’s CHNENBL register. There are 48 such registers, one for each SMDA register, with each of its 32 bits corresponding to an SDMA channel: If bit i is set, the event linked with the register will set EP[i]. Note that these registers have unknown values on powerup, so if event driven SDMA is enabled, all registers must be initialized, or hell breaks loose.
In a normal flow, EP[i] is zero when an event is about to set this flag: If it was set by a previous event, the respective SDMA script should have finished, and hence cleared the flag before the next event occurred. Since attempting to set EP[i] when it’s already set may indicate that the event came too early (or the script is too late), there’s an CHNERR[i] flag, which latches such errors, so that the application processor can make itself informed about such a condition. This can also trigger an interrupt, if the respective bit in INTRMASK is set. The application processor can read these flags (and reset them at the same time) in the EVTERR register.
I’d like to draw special attention to events #14 and #15, which are driven by external pins, namely GPIO1_4 and GPIO1_5. These two make it possible for an external chip (e.g. an FPGA) request service without involving the application processor. A rising edge on these lines creates an event when the IOMUX is set to ALT1 (SDMA_EXT_EVENT) on the relevant pins. Note that setting the IOMUX to just GPIO won’t do it.
It’s important to note, that the combination of the EP[i] flag being cleared by the script itself with the edge-triggered nature of the event signal creates an inevitable risk for a race condition: There is no rigorous way for the script to make sure that a “DONE 4″ instruction, which was intended to clear a previous event won’t clear one that just arrived to create another. The CHNERR[i] flag will indicate that the event arrived before the previous one was cleared, but in some implementations, that can actually be a legal condition. This can be solved by emulating a level-triggered event with a constantly toggling event line, when the external hardware wants servicing. This will make CHNERR[i] go high for sure, but otherwise it’s fine.
This possible race condition is not a design bug of the SDMA subsystem. Rather, it was designed with SDMA script which finish faster than the next event in mind. The “I need service” kind of design was not considered.
Interrupts
By executing a “DONE 3″ command, the SDMA scripts can generate interrupts on the application processor by setting the HI[i] flag, where i is the channel number of the currently running script. This will assert interrupt #6 on the application processor, which handles it like any other interrupt.
The H[i] flags can be read by the application processor in the INTR register (see section 52.12.3.2 in the Reference Manual). An interrupt handler should scan this register to determine which channel requests an interrupt. There is no masking mechanism for individual H[i]‘s. The global interrupt #6 can be disabled, but an individual channel can’t be masked from generating interrupts.
If any of the INTRMASK bits is set, the EVTERR register should also be scanned, or at least cleared, since CHNERR[i] conditions generate interrupts which are indistinguishable from H[i] interrupts.
“DONE 3″, which is the only instruction available for setting HI[i] also clears HE[i], so it was clearly designed to work with scripts kicked off directly by the application processor. In order to issue an interrupt from a script, which is kicked off by an event, a little trick can be used: According to section 52.21.2 in the Reference Manual (the detail for the DONE instruction), “DONE 3″ means “clear HE, set HI for the current channel and reschedule”. In other words, make the current channel ineligible of execution unless HO[i] is set, and set HI[i] so an interrupt is issued. But event-driven channels do have HO[i] set, so clearing HE[i] has no significance whatsoever. According to table 52-4, the context will be saved, and then restored immediately. So there will be a slight waste of time with context writes and reads, but since the most likely instruction following this “DONE 3″ is a “DONE 4″ (that is, clear EP[i], the event-driven script has finished), the impact is rather minimal. Anyhow, I still haven’t tried this for real, but I will soon.
So much for part III. You may want to go on with Part IV: Running custom SDMA scripts in Linux
This is part II of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.
This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:
Contexts and channels
The SDMA’s purpose is to service requests from hardware or from the application processor. In a way, it’s like a processor with no idle task, just interrupts. But the way the service is performed is different from interrupt handling.
Let’s assume that all scripts (those SDMA programs) are already present in the SDMA’s memory space. They may reside in the on-chip ROM or they’ve been loaded into RAM. How are they executed?
The answer lies in the contexts: Some of the SDMA’s RAM space is allocated for containing an array of structures. There are 32 such structures, each occupying 128 bytes (or 32 32-bit words), so all in all this block takes up 4 kB of memory (there’s a 96-byte variant as well, but we’ll leave it for now).
These structures do what their name implies: They contain the context of a certain execution thread. In other words, they contain everything that needs to be stored to resume execution at some point, as if it was never stopped. Since the SDMA core doesn’t have a stack, this information has to go to a fixed place. This includes the program counter, the registers and flags. Section 52.13.4 in the Reference Manual describes this structure in detail.
As mentioned, there’s an array of 32 of these structures. It means that the SDMA subsystem can maintain 32 contexts, or if you like, resemble a multitasking system with 32 independent threads. Or in SDMA terms: The SDMA core supports 32 DMA channels. This kinda connects with the common concept of DMA channels: Each channel has a certain purpose and particular flow.
The method to kick off a channel, so it will execute a certain script, is to write directly to the channel’s context structure, and then set up some flags to make it runnable. This is demonstrated in part IV. Since the context includes the program counter register, this controls where the execution starts. Other registers can be used to pass information to the script (that is, the SDMA “program”). What each register means upon such an invocation is up to the script’s API.
A script’s life cycle (scheduling)
So there are 32 context, each corresponding to 32 channels. What makes a context load into the registers, making its channel’s script execute? It’s time to talk about the scheduler. It’s described in painstaking detail in the Reference Manual, so let’s stick to the main points.
The scheduler’s main function is to decide which channel is the most eligible to spend time on the processor core. This decision is relevant only when the SDMA core isn’t running anything at all (a.k.a. “sleeping”) or when the currently running script voluntarily yields the processor. The SDMA core’s execution is non-preemptive, so the scheduler can’t force any script to stop running. In other words, if any script is (mistakenly) caught in an infinite loop, all DMA activity is as good as dead, most possibly leading to a complete system hangup. Nothing can force a script to stop running (expect for a reset or the debugger). Just a small thing to bear in mind when writing those scripts.
The SDMA core has a special instruction for yielding the processor, with the mnemonic “done”, which takes a parameter for choosing its variant. Two variants of this instructions have earned their own mnemonics, “yield” and “yieldge”. While “done” variant #3 (usually called just “done”) always yields the processor, the two others yield it if there are other channels ready for executing with higher priority (or higher-or-equal priority for “yieldge”). But never mind the details. The overall picture is that the script runs until it issues a command saying “you must stop me now” (as in “done”) or “you may stop me now” (as in the two other variants).
Yielding only means that the registers are stored back into the context structure (with optimizations to speed this process up) and that another context may be loaded instead of it. Depending on which variant of “done” was used, plus some other factors, the scheduler may or may not reschedule the same channel automatically at a later time. That is, the context may be reloaded into the registers. So unless designed otherwise, the opcode directly after the “done” instructions will be executed at some later time. Hence a carefully written script never “ends”, it just gives up the processor until the next time the relevant channel is scheduled.
Channel eligibility
Now let’s look at what makes a channel eligible for execution. Leaving priority issues aside, let’s ask what makes a certain channel a candidate for having its context pushed into the SDMA core.
In some cases, the setup is that the channel becomes eligible for execution without any other condition. This is the case for offload memory copy, for example. In other cases, the channel’s eligibility depends on some hardware event, typically some peripheral requesting service. The latter scenario resembles old-school interrupt handlers, only the interrupt isn’t serviced by the application processor, but wakes up a service thread (channel) in the SDMA core. And exactly as waking up a thread in a modern operating system doesn’t cause immediate execution, but rather sets some flag to make the thread eligible for getting a processor time slice, so does the SDMA channel wakeup work: It’s just a flag telling the scheduler to push the channel’s context into the SDMA’s core when it sees fit.
The Reference Manual sums this up in section 52.4.3.5, saying the channel i is eligible to run if and only if the following expression is logical ’1′:
(HE[i] or HO[i]) and (EP[i] or EO[i])
where HE[i], HO[i], EP[i], and EO[i] are flags belonging to the i’th channel. Let’s take them one by one:
- HE[i] stands for “Host Enable”, and is set and reset by the application processor by writing to registers. It’s also cleared by the “done” instruction, so it’s suitable for a scenario where the host kicks off a channel, and the script quits it.
- EP[i] stands for “External Peripheral”, and is set when an external peripheral wants service (more about that mechanism later on). It’s cleared by one of the “done” variants, so this is the flag used when a peripheral kicks off a channel, and the script quits.
- HO[i] stands for “Host override”, and is controlled solely by a register written to by the application processor. Its purpose is to make the left hand of the expression always true, when we want the channel’s eligibility be controlled by the peripheral only.
- EO[i] stands for “External override”, and is like HO[i] in the way it’s handled. This flag is set when we want the channel’s eligibility controlled by the host only.
There are four registers in the application processor’s memory space, which are used to alter these flags: STOP_STAT, HSTART, EVTOVR and HOSTOVR. They are outlined in sections 52.12.3.3-52.12.3.7 in the Reference Manual.
The full truth is that there’s also a DO[i] flag mentioned (controlled by the DSPOVR register), but it must be held ’1′ on i.MX51 devices, so let’s ignore it.
So if our case is the application processor controlling the i’th SDMA channel for offload operation, it sets EO[i], clears HO[i], and then sets HE[i] whenever it wants to have the script running. The script may clear HE[i] with a “done” instruction, or the application processor may clear it when appropriate. For example, the script can trigger an interrupt on the application processor, which clears the flag (even though I can’t see when this would be right way to do it).
In the case of channels being started by a peripheral, the application processor sets HO[i] and clears EO[i]. Certain events (as discussed next) set the EP[i] flag directly, and the script’s “done” instruction clears it.
Keep in mind that the script may not run continuously: It should execute “yield” instructions every now and then to give other channels a chance to use the SDMA core, but since neither HE[i] nor EP[i] are affected by yields, the script will keep running until it’s, well, done.
There is a possibility to reset the SDMA core or force a reschedule with the SDMA’s RESET register, but that’s really something for emergencies (e.g. a runaway script).
So much for part II. You may want to go on with Part III: Events and Interrupts
This is part I of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.
Freescale’s Linux drivers for DMA also vary significantly across different kernel releases. It looks like they had two competing sets of code, and couldn’t make up their minds which one to publish.
This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:
NOTE: For more information, in particular on SDMA for i.MX6 and i.MX7, there’s a follow-up post written by Jonah Petri.
Introduction
Behind all the nice words, the SDMA subsystem is just a small and simple RISC processor core, with its private memory space and some specialized functional units. It works side-by-side with the main ARM processor (the application processor henceforth), and pretty much detached from it. Special registers allow the application processor to control the SDMA’s core, and special commands on the SDMA’s core allow it to access the application processor’s memory space and send it interrupts. But in their natural flow, each of these two don’t interact.
The underlying idea behind the SDMA core is that instead of hardwiring the DMA subsystem’s capabilities and possible behaviors, why not write small programs (scripts henceforth), which perform the necessary memory operations? By doing so, the possible DMA operations and variants are not predefined by the chip’s vendor; the classic DMA operations are still possible and available with vendor-supplied scripts, but the DMA subsystem can be literally programmed to do a lot of other things. Offload RAID xoring is an example of something than can be taken off the main processor, as the data is being copied from disk buffers to the peripherals with DMA.
Scripts are kicked off either by some internal event (say, some peripheral has data to offer) or directly by the main processor’s software (e.g. an offload memcpy). The SDMA processor’s instruction set is simple, all opcodes occupying exactly 16 bits in program memory. Its assembler can be acquired from Freescale, or you can download my mini-assembler, which is suitable for small projects (in part IV).
Chapter 52 in the Reference Manual is dedicated to the SDMA, but unfortunately it’s not easy reading. In the hope to clarify a few things, I’ve written down the basics. Please keep in mind that the purpose of my own project was to perform memory-to-memory transfers triggered autonomously by an external device, so I’ve given very little attention to the built-in scripts and handling DMA from built-in peripherals.
Quirky memory issues
I wouldn’t usually start the presentation of a processor with its memory map and addressing, but in this case it’s necessary, as it’s a major source of confusion.
The SDMA core processor has its own memory space, which is completely detached from the application processor’s. There are two modes of access to the memory space: Instruction mode and data mode.
Instruction mode is used in the context of jumps, branches and when calling built-in subroutines which were written with program memory in mind. In this mode, the address points at a 16-bit word (which matches the size of an opcode), so the program counter is incremented (by one) between each instruction (except for jumps, of course).
Data mode is used when reading from the SDMA’s memory (e.g. loading registers) or writing to it. This should not be confused with the application processor’s memory (the one Linux sees, for example), which is not directly accessible by the SDMA core. In data mode, addressing works on 32-bit words, so incrementing the data mode address (by one) means moving forward four bytes.
Instruction mode and data mode addressing points at exactly the same physical memory space. It’s possible to write data to RAM in data mode, and then execute it as a script, the latter essentially reading from RAM in instruction mode. It’s important to note, that different addresses will be used for each. This is best explained with a simple example:
Suppose that we want to run a routine (script) written by ourselves. To do so, it has to be copied into the internal RAM first. How to do that is explained in part IV, but let’s assume that we want to execute our script with a JMP instruction to 0x1800. This is 12 kB from the zero-address of the memory map, since the 0x1800 address is given in 16-bit quanta (2 bytes per address count). After the script is loaded in its correct place, we’ll be able to read the first instruction (as a piece as data) as follows: Set one of the SDMA’s processor’s registers to the value 0x0c00, and then load from the address pointed by that register. The address, 0x0c00, is given in 32-bit quanta (4 bytes per address count), so it hits exactly the same place: 12 kB from zero-address. And since we’re reading 32 bits, we’ll read the first instruction as well as the second at the same time.
Let’s say it loud and clear:
Instruction mode addresses are always double their data mode equivalents.
As for endianess, the SDMA core thinks Big Endian all the way through. That means, that when reading two assembly opcodes from memory in data mode, we get a 32-bit word, for which the first instruction is on bits [31:16] and the instruction following it on bits [15:0].
The memory map
Since we’re at it, and since the Reference Manual has this information spread all over, here’s a short outline of what’s mapped where, in data addresses.
- 0x0000-0x03ff: 4 kB of internal ROM with boot code and standard routines
- 0x0400-0x07ff: 4 kB of reserved space. No access at all should take place here
- 0x0800-0x0bff: 4 kB of internal RAM, containing the 32 channels’ contexts (each context is 32 words of 4 bytes each, when SMSZ is set in the CHN0ADDR register). More about this in part II. For the details, see Section 52.13.4 in the Reference Manual. When SMSZ is clear, this segment is 3 kB only (see 52.4.4).
- 0x0c00-0x0fff: 4 kB of internal RAM, free for end-user application scripts and data.
- 0x1000-0x6fff: Peripherals 1-6 memory space
- 0x7000-0x7fff: SDMA registers, as accessed directly by the SDMA core (as detailed in section 52.14 of the reference manual)
- 0x8000-0xffff: Peripherals 7-14 memory space (not accessible in program memory space)
The two regions of peripherals memory space is the preferred way to access peripherals (unlike the implementation in Linux drivers using SDMA script) as discussed in another post of mine.
And once again: The memory map above is given in data addresses. The memory map in program memory space is the same, only all addresses are double.
So much for part I. You may want to go on with Part II: Contexts, Channels, Scripts and their execution
This is a small, but annoying thing about WordPress. They obviously didn’t consider the “0x” hexadecimal notation. What they did consider, was that if someone says “2x2″ that surely means “2 times 2″, so why not making that “x” in the middle fancy? Well, maybe because that makes “0x123″, which is a hexadecimal number, look weird.
The fix is in the core PHP files of WordPress, so this probably needs to be fixed every time WordPress is updated.
In wp-includes/formatting.php, in the wptexturize() function, probably around lines 50-51 there’s something like:
$dynamic_characters = array('/\'(\d\d(?:’|\')?s)/', '/(\s|\A|")\'/', '/(\d+)"/', '/(\d+)\'/', '/(\S)\'([^\'\s])/', '/(\s|\A)"(?!\s)/', '/"(\s|\S|\Z)/', '/\'([\s.]|\Z)/', '/\b(\d+)x(\d+)\b/');
$dynamic_replacements = array('’$1','$1‘', '$1″', '$1′', '$1’$2', '$1“$2', '”$1', '’$1', '$1×$2')
The last entry in both arrays replaces two numbers with an “x” between them with a fancy “times” symbol (Unicode character #215). So just remove those two entries from both arrays (marked in red above). Remove the commas as well, of course.
Maybe the 100% correct way to fix this, is to use a better regular expression, instead of ‘/\b(\d+)x(\d+)\b/’. I’m not sure about regular expressions in PHP, but in Perl I would try ‘/\b([1-9]\d*)x(\d+)\b/’, so it wouldn’t match the “0x” notation. It wouldn’t match “02 x 2 = 4″ or any other number prefixed with zeros, but this is not something normal people write anyhow.
It may help investigating the interrupt descriptors. For a 2.6.38 kernel, putting this in a kernel module load supplies some information (includes, declarations and code mixed below. Organize properly in your own module)
#include <linux/irq.h>
#include <linux/interrupt.h>
#include <asm/irq.h>
int i;
struct irq_desc *desc;
for_each_irq_desc(i, desc) {
if (!desc)
continue;
printk(KERN_INFO "%d: status=%08x, chip=%08x, handle_irq=%08x\n",
i, (u32) desc->status, (u32) desc->chip, (u32) desc->handle_irq );
}
This dumps some meta information about all possible IRQs on the system. Also be sure to look at /proc/interrupts.
Have a look in include/linux/irq.h for the meaning of the flags in desc->status and possibly include/linux/irqdesc.h for the irq_desc structure.
request_irq() may very well fail because the IRQ_NOREQUEST flag was set in status. On ARM architecture, this can be fixed by calling set_irq_flags(irq, IRQF_VALID) assuming that you have a fairly good idea of what you’re doing.
Note that set_irq_chip_and_handler() is usually called before validating an IRQ, so that Linux knows what to do with the interrupt as it happens. Looking at chip and handle_irq in the dump may give a clue about how necessary this is. Searching for the value of handle_irq in /proc/kallsyms (with a simple grep) tells who handles each interrupt.
The “chip” structure is a container for information and methods specific to the interrupt’s owner. In old days, these belonged to peripheral chips, but a “chip” is many times just a group of interrupts having a common way of handling them (setting trigger type, masking etc.).
A final note: It looks like the API is changing vividly in this area, so don’t expect things to be exactly the same on other kernels.
What we have here
As one can guess from my notes about the i.MX51′s external bus and the oscilloscope shots I’ve published, I made myself a small dissection kit for watching the bus’ lines activity with a digital oscilloscope.
This is a good time to mention, that the kit was done quickly and dirty, so the code below should not be taken as an example of proper coding for FPGA nor the Linux kernel. Seriously, this is just lab code.
Anyhow, this little kit consists of two parts
- Verilog code and UCF for programming the FPGA. Except for blinking the LED (at 1 Hz), it also wires all EIM-bus related signals to the FPGA pin headers on the development board, so they can be sampled easily with oscilloscope’s probes. You can download the bitfile directly if your board has the LX9 FPGA, or implement it from the sources below.
- A kernel module, which performs a single bus operation when it’s loaded. It’s explained further below. If you happen to be running on a 2.6.38.1 Linux kernel on your board (in particular the 2.6.38.1 which comes preloaded on the board), you may try using the precompiled kernel module. Or do it the “right way” and compile the module from the sources below.
The Verilog code below pretty much explains itself. And as the comments in the UCF say, the “debug_pins_outer” pin vector runs from pin #38 downwards continuously, on even pins only, on the outer FPGA pin header. This may sound complicated, but it simply means that out of the two rows of this pin header, only the row reached easily with a probe is used. And since pin #40 (in the corner) isn’t attached to the FPGA, debug_outer_pins[0] is connected to pin #38, debug_outer_pins[1] to #36 and so on.
As for the “debug_pin_inner” it goes more or less the same. Going from pin #3 for debug_inner_pins[0] and up on odd pin numbers, only the inner pin row of the inner pin header is used for easy physical access.
This may look like a weird choice of pin assignments, but this was the only way to get the vectors assigned on the pin headers without any gaps between them, so it’s easy to reach any signal in the vectors just by counting pins on the pin header.
Please make sure that the two “FPGA bank” jumpers are installed on your board, or nothing will appear on the pin headers. These jumpers were installed on the board as I got it, so just check it’s OK.
It’s also worth to note that debug_pins_outer[4] happens to be connected to a pin which is shared with a pushbutton on the board. Since the line is pulled up with a 10 kOhm resistor, this line may have some timing skew.
Simple use
Assuming that both the bitfile and the kernel module are in the currect directory, first load the FPGA if you haven’t done so already:
# load_fpga armaled.bit
A green LED should start blinking as a result of this. Note that according to Armadeus’ wiki page on the FPGA loader, armaled.bit should not be on the on-board flash. Copy it to /tmp first (which is on RAM) or load it from an net drive (e.g. NFS) like I did.
And then, to kick off a bus cycle, load the module and catch it on the oscilloscope:
# insmod eimtest.ko
And then unload the module, so you can load it again for the next try:
# rmmod eimtest
The relevant bus parameters can be set directly when loading the module. For example, to add an extra bus wait state, disable continuous bus clock, run at 1/4 bus rate and use bus address OxABC0, go:
# insmod eimtest.ko WSC=2 BCD=3 BCM=0 addr=0xabc0
A list of kernel module parameters, which in turn changes the bus parameters, is found in the kernel module’s source. Anything declared with “module_param” can be set. The defaults are given in the variable declarations. Setting the address and data is also possible, but be sure not to exceed the address 0xFFFC, or you’ll get a kernel oops. Also note that addresses not aligned to 32-bit words will produce several bus cycles.
The Verilog code
Note that the direct wire connections have a variable delay. This results in some unknown skew (1-2ns, I suppose) between the outputs.
module armaled
(
input ext_clk,
output reg led,
output irq,
input [15:0] imx51_da,
input imx51_cs1,
input imx51_cs2,
input imx51_adv,
input imx51_we,
input imx51_eb0,
input imx51_eb1,
input imx51_oe,
input imx51_dtack,
input imx51_wait,
input imx51_bclk,
input imx51_clko,
output [13:0] debug_pins_inner,
output [12:0] debug_pins_outer
);
reg [27:0] counter;
assign irq = 0;
assign debug_pins_outer[0] = imx51_bclk;
assign debug_pins_outer[1] = imx51_clko;
assign debug_pins_outer[2] = imx51_oe;
assign debug_pins_outer[3] = imx51_cs1;
assign debug_pins_outer[4] = imx51_cs2;
assign debug_pins_outer[5] = imx51_adv;
assign debug_pins_outer[6] = imx51_we;
assign debug_pins_outer[7] = imx51_eb0;
assign debug_pins_outer[8] = imx51_eb1;
assign debug_pins_outer[9] = imx51_dtack;
assign debug_pins_outer[10] = imx51_wait;
assign debug_pins_outer[12:11] = imx51_da[15:14];
assign debug_pins_inner = imx51_da[13:0];
always @(posedge ext_clk)
begin
if (counter >= 47500000)
begin
led <= !led;
counter <= 0;
end
else
counter <= counter + 1;
end
endmodule
The UCF file
NET "ext_clk" TNM_NET = "TN_ext_clk";
TIMESPEC "TS_ext_clk" = PERIOD "TN_ext_clk" 10.4 ns HIGH 50 %;
NET "led" LOC="G14" | IOSTANDARD=LVCMOS33;# IO_L41P_GCLK9_IRDY1_M1RASN_1
#NET "button" LOC="G15" | IOSTANDARD=LVCMOS33;# IO_L41N_GCLK8_M1CASN_1
NET "ext_clk" LOC="N8" | IOSTANDARD=LVCMOS33;# = BCLK, IO_L29P_GCLK3_2
NET "irq" LOC="P3" | IOSTANDARD=LVCMOS33;# FPGA_INITB
# Debug pins.
# The "inner" set starts from pin #3, running on odd pins only (effectively
# covering the pins convenient to attach a scope's probe to)
NET "debug_pins_inner[0]" LOC="L2" | IOSTANDARD=LVCMOS33;# IO_L39P_M3LDQS_3
NET "debug_pins_inner[1]" LOC="J2" | IOSTANDARD=LVCMOS33;# IO_L41P_GCLK27_M3DQ4_3
NET "debug_pins_inner[2]" LOC="K4" | IOSTANDARD=LVCMOS33;# IO_L43P_GCLK23_M3RASN_3
NET "debug_pins_inner[3]" LOC="K5" | IOSTANDARD=LVCMOS33;# IO_L45P_M3A3_3
NET "debug_pins_inner[4]" LOC="C2" | IOSTANDARD=LVCMOS33;# IO_L83P_3
NET "debug_pins_inner[5]" LOC="D4" | IOSTANDARD=LVCMOS33;# IO_L53P_M3CKE_3
NET "debug_pins_inner[6]" LOC="K3" | IOSTANDARD=LVCMOS33;# IO_L40P_M3DQ6_3
NET "debug_pins_inner[7]" LOC="H3" | IOSTANDARD=LVCMOS33;# IO_L42P_GCLK25_TRDY2_M3UDM_3
NET "debug_pins_inner[8]" LOC="G2" | IOSTANDARD=LVCMOS33;# IO_L44P_GCLK21_M3A5_3
NET "debug_pins_inner[9]" LOC="F3" | IOSTANDARD=LVCMOS33;# IO_L46P_M3CLK_3
NET "debug_pins_inner[10]" LOC="D3" | IOSTANDARD=LVCMOS33;# IO_L54P_M3RESET_3
NET "debug_pins_inner[11]" LOC="E2" | IOSTANDARD=LVCMOS33;# IO_L52P_M3A8_3
NET "debug_pins_inner[12]" LOC="K13" | IOSTANDARD=LVCMOS33;# IO_L44P_A3_M1DQ6_1
NET "debug_pins_inner[13]" LOC="H13" | IOSTANDARD=LVCMOS33;# IO_L42P_GCLK7_M1UDM_1
# The "outer" set starts from pin #38, running on even pins only (effectively
# covering the pins convenient to attach a scope's probe to). Note that the
# vectors runs from high board pin number to low.
NET "debug_pins_outer[0]" LOC="B15" | IOSTANDARD=LVCMOS33;# IO_L1N_A24_VREF_1
NET "debug_pins_outer[1]" LOC="C15" | IOSTANDARD=LVCMOS33;# IO_L33N_A14_M1A4_1
NET "debug_pins_outer[2]" LOC="D15" | IOSTANDARD=LVCMOS33;# IO_L35N_A10_M1A2_1
NET "debug_pins_outer[3]" LOC="E15" | IOSTANDARD=LVCMOS33;# IO_L37N_A6_M1A1_1
NET "debug_pins_outer[4]" LOC="G15" | IOSTANDARD=LVCMOS33;# IO_L41N_GCLK8_M1CASN_1
NET "debug_pins_outer[5]" LOC="J15" | IOSTANDARD=LVCMOS33;# IO_L43N_GCLK4_M1DQ5_1
NET "debug_pins_outer[6]" LOC="L15" | IOSTANDARD=LVCMOS33;# IO_L45N_A0_M1LDQSN_1
NET "debug_pins_outer[7]" LOC="G12" | IOSTANDARD=LVCMOS33;# IO_L30N_A20_M1A11_1
NET "debug_pins_outer[8]" LOC="F12" | IOSTANDARD=LVCMOS33;# IO_L31N_A18_M1A12_1
NET "debug_pins_outer[9]" LOC="H11" | IOSTANDARD=LVCMOS33;# IO_L32N_A16_M1A9_1
NET "debug_pins_outer[10]" LOC="G13" | IOSTANDARD=LVCMOS33;# IO_L34N_A12_M1BA2_1
NET "debug_pins_outer[11]" LOC="J13" | IOSTANDARD=LVCMOS33;# IO_L36N_A8_M1BA1_1
NET "debug_pins_outer[12]" LOC="K11" | IOSTANDARD=LVCMOS33;# IO_L38N_A4_M1CLKN_1
# i.MX51 related pins
NET "imx51_cs1" LOC="R11" | IOSTANDARD=LVCMOS33;# EIM_CS1
NET "imx51_cs2" LOC="N9" | IOSTANDARD=LVCMOS33;# EIM_CS2
NET "imx51_adv" LOC="R9" | IOSTANDARD=LVCMOS33;# EIM_LBA
NET "imx51_we" LOC="R6" | IOSTANDARD=LVCMOS33;# EIM_RW
NET "imx51_eb0" LOC="P7" | IOSTANDARD=LVCMOS33;
NET "imx51_eb1" LOC="P13" | IOSTANDARD=LVCMOS33;
NET "imx51_oe" LOC="R7" | IOSTANDARD=LVCMOS33;
NET "imx51_dtack" LOC="N4" | IOSTANDARD=LVCMOS33;
NET "imx51_wait" LOC="R4" | IOSTANDARD=LVCMOS33;
NET "imx51_bclk" LOC="N12" | IOSTANDARD=LVCMOS33; # Hardwired to N8
NET "imx51_clko" LOC="N7" | IOSTANDARD=LVCMOS33;
NET "imx51_da[7]" LOC="P11" | IOSTANDARD=LVCMOS33;# EIM_DA7
NET "imx51_da[6]" LOC="M11" | IOSTANDARD=LVCMOS33;# EIM_DA6
NET "imx51_da[5]" LOC="N11" | IOSTANDARD=LVCMOS33;# EIM_DA5
NET "imx51_da[13]" LOC="R10" | IOSTANDARD=LVCMOS33;# EIM_DA13
NET "imx51_da[12]" LOC="L9" | IOSTANDARD=LVCMOS33;# EIM_DA12
NET "imx51_da[11]" LOC="M10" | IOSTANDARD=LVCMOS33;# EIM_DA11
NET "imx51_da[10]" LOC="M8" | IOSTANDARD=LVCMOS33;# EIM_DA10
NET "imx51_da[9]" LOC="K8" | IOSTANDARD=LVCMOS33;# EIM_DA9
NET "imx51_da[8]" LOC="L8" | IOSTANDARD=LVCMOS33;# EIM_DA8
NET "imx51_da[0]" LOC="N6" | IOSTANDARD=LVCMOS33;# EIM_DA0
NET "imx51_da[4]" LOC="P5" | IOSTANDARD=LVCMOS33;# EIM_DA4
NET "imx51_da[3]" LOC="R5" | IOSTANDARD=LVCMOS33;# EIM_DA3
NET "imx51_da[2]" LOC="L6" | IOSTANDARD=LVCMOS33;# EIM_DA2
NET "imx51_da[1]" LOC="L5" | IOSTANDARD=LVCMOS33;# EIM_DA1
NET "imx51_da[15]" LOC="M5" | IOSTANDARD=LVCMOS33;# EIM_DA15
NET "imx51_da[14]" LOC="N5" | IOSTANDARD=LVCMOS33;# EIM_DA14
The kernel module
It currently reads one word from the bus. A write operation is obtained by commenting and uncommenting in the region marked in red.
#include <linux/version.h>
#include <linux/platform_device.h>
#include <linux/delay.h>
#include <linux/gpio.h>
#include <linux/io.h>
#include <asm/io.h>
#include <mach/iomux-mx51.h>
#include <mach/fpga.h>
#include <mach/hardware.h>
MODULE_DESCRIPTION("EIM interface test module");
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Eli Billauer");
#define EIMTEST ""
static int PSZ = 0;
static int AUS = 1;
static int BCS = 0;
static int BCD = 0;
static int BL = 0;
static int FL = 1; // Cover RFL and WFL alike
static int WC = 0;
static int ADH = 0;
static int WSC = 1;
static int ADVA = 0; // RADVA and WADVA
static int ADVN = 0; // RADVN and WADVN
static int OEA = 0;
static int CSA = 0; // RCSA and WCSA
static int RL = 0;
static int BEA = 0;
static int BE = 1;
static int WEA = 0;
static int INTPOL = 1; // Interrupt polarity
static int INTEN = 0; // Interrupt enable
static int GBCD = 0; // Burst clock divisor
static int BCM = 1; // Burst clock mode (set continuous here)
static int addr = 0x00001234;
static int data = 0xFFFF5555;
module_param(PSZ, int, 0);
module_param(AUS, int, 0);
module_param(BCS, int, 0);
module_param(BCD, int, 0);
module_param(BL, int, 0);
module_param(FL, int, 0);
module_param(WC, int, 0);
module_param(ADH, int, 0);
module_param(WSC, int, 0);
module_param(ADVA, int, 0);
module_param(ADVN, int, 0);
module_param(OEA, int, 0);
module_param(CSA, int, 0);
module_param(RL, int, 0);
module_param(BEA, int, 0);
module_param(BE, int, 0);
module_param(WEA, int, 0);
module_param(INTPOL, int, 0);
module_param(INTEN, int, 0);
module_param(GBCD, int, 0);
module_param(BCM, int, 0);
module_param(data, int, 0);
module_param(addr, int, 0);
static u32 readreg(int offset) {
return __raw_readl( MX51_IO_ADDRESS(MX51_WEIM_BASE_ADDR) + offset);
}
static void writereg(int offset, u32 val) {
__raw_writel(val, MX51_IO_ADDRESS(MX51_WEIM_BASE_ADDR) + offset);
}
static u32 bitfield(int shift, int bits, int val) {
return ((val & ( ( 1 << bits ) - 1 ) ) << shift);
}
static void eimtest_cleanup_module(void) {
}
static int eimtest_init_module(void)
{
int result = 0;
void __iomem *cs2_base;
u32 GCR1, GCR2, RCR1, RCR2, WCR1, WEIMCR;
iomux_v3_cfg_t iomux_cs2 = MX51_PAD_EIM_CS2__EIM_CS2;
mxc_iomux_v3_setup_pad(iomux_cs2);
GCR1 = 0x0111008f |
bitfield(28, 4, PSZ) |
bitfield(23, 1, AUS) |
bitfield(14, 2, BCS) |
bitfield(12, 2, BCD) |
bitfield(11, 1, WC) |
bitfield(8, 3, BL) |
bitfield(5, 1, FL) |
bitfield(4, 1, FL);
GCR2 = bitfield(0, 2, ADH);
RCR1 =
bitfield(24, 6, WSC) |
bitfield(20, 3, ADVA) |
bitfield(16, 3, ADVN) |
bitfield(12, 3, OEA) |
bitfield(4, 3, CSA);
RCR2 =
bitfield(8, 2, RL) |
bitfield(4, 3, BEA) |
bitfield(3, 1, BE);
WCR1 =
bitfield(30, 1, !BE) |
bitfield(24, 6, WSC) |
bitfield(21, 3, ADVA) |
bitfield(18, 3, ADVN) |
bitfield(15, 3, BEA) |
bitfield(9, 3, WEA) |
bitfield(3, 3, CSA);
WEIMCR =
bitfield(5, 1, INTPOL) |
bitfield(4, 1, INTEN) |
bitfield(1, 2, GBCD) |
bitfield(0, 1, BCM);
writereg(0x30, GCR1);
writereg(0x34, GCR2);
writereg(0x38, RCR1);
writereg(0x3c, RCR2);
writereg(0x40, WCR1);
writereg(0x90, WEIMCR);
printk(KERN_WARNING EIMTEST "CS2GCR1=%08x, CS2GCR2=%08x\n",
readreg(0x30),
readreg(0x34)
);
printk(KERN_WARNING EIMTEST "CS2RCR1=%08x, CS2RCR2=%08x\n",
readreg(0x38),
readreg(0x3c)
);
printk(KERN_WARNING EIMTEST "CS2WCR1=%08x, CS2WCR2=%08x\n",
readreg(0x40),
readreg(0x44)
);
printk(KERN_WARNING EIMTEST "WEIM Config register WCR=%08x\n",
readreg(0x90));
printk(KERN_WARNING EIMTEST "WEIM IP Access register WIAR=%08x\n",
readreg(0x94));
printk(KERN_WARNING EIMTEST "CCM_CBCDR=%08x\n",
__raw_readl(MX51_IO_ADDRESS(0x73fd4014)));
cs2_base = ioremap(MX51_CS2_BASE_ADDR, SZ_64K);
if (!cs2_base) {
printk(KERN_WARNING EIMTEST "Failed to obtain I/O space\n");
return -ENODEV;
}
// Uncomment as necessary:
//__raw_writel(data, cs2_base + addr);
printk(KERN_WARNING EIMTEST "Read data=%08x\n",
__raw_readl(cs2_base + addr));
iounmap(cs2_base);
return result;
}
module_init(eimtest_init_module);
module_exit(eimtest_cleanup_module);
The Makefile
This is a more-or-less standard Makefile for compiling a kernel. Please note that /path/to must be changed (twice) to where your Armadeus buildroot is, because both the crosscompiler and Linux kernel are referenced.
export CROSS_COMPILE=/path/to/armadeus-4.0/buildroot/output/build/staging_dir/usr/bin/arm-unknown-linux-uclibcgnueabi-
ifneq ($(KERNELRELEASE),)
obj-m := eimtest.o
else
KDIR := /path/to/armadeus-4.0/buildroot/output/build/linux-2.6.38.1
PWD := $(shell pwd)
default:
$(MAKE) CROSS_COMPILE=$(CROSS_COMPILE) -C $(KDIR) SUBDIRS=$(PWD) modules
clean:
@rm -f *.ko *.o modules.order Module.symvers *.mod.? *~
@rm -rf .tmp_versions module.target
@rm -f .eimtest.*
endif
So that’s it. Hope it’s helpful!
These are a few oscilloscope samples, some of which are pretty crude, showing Freescale’s i.MX51 accessing its address/data bus.
I worked with an Armadeus APF51 board, which has a 16-bit multiplexed bus connected to the Xilinx Spartan-6 FPGA. The FPGA was used to wire bus signals to a pin header, so 1-2 ns skews between signals are possible.
I wrote some code for the FPGA and processor on the board, for the sake of making these samples, which is available in another post of mine. I also wrote a general post about the EIM bus, which may come handy.
A simple write cycle
With the default settings mentioned here, detailed registers in hex follow:
CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000000, CS2WCR2=00000000
WEIM Config register WCR=00000021
WEIM IP Access register WIAR=00000014
CCM_CBCDR=59ab7180

Traces from top to bottom (CH4 to CH1): BCLK, WE, CS2 and ADV (trigger on falling edge of CS2).
The BCLK doesn’t look much like a clock, and the signals are cluttered since the clock frequency is 95 MHz, the oscilloscope’s bandwidth is 200 MHz and the signals are picked up with simple probes from the FPGA pin header, so there’s a lot of crosstalk and other issues. But it’s good enough to see the general picture.
You’ll have to believe me that the address is present on the multiplexed address/data lines while the ADV is low (one clock cycle) and that the two other clock cycles carry the two data halves of the 32 bit word (the data width is only 16 bits). Honestly. I checked it out.
What can be seen barely in the scope image is that the bus signals switch on BLK’s falling edges, and that they should be sampled on BCLK’s rising edges. But hey, that exactly what the datasheet says in section 4.6.7.3, table 53.
With non-continuous clock
The same as above, now with BCM=0, so the BCLK toggles only when the bus is working:
CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=00000014

Nothing really interesting about this, actually.
Delaying the assertion of WE
Returning to the continuous clock, let’s delay WE by one WEIM clock (which happens to be one BCLK) by setting WEA=1
CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000200, CS2WCR2=00000000
WEIM Config register WCR=00000021
WEIM IP Access register WIAR=00000014

And nothing really happened here, including the other signals, which are not shown. Except that WE was indeed asserted later.
Adding a wait state
With the “simple write cycle” as the starting point, setting WWSC=2 (its default is 1) an extra wait state cycle is added:
CS2GCR1=019100bf, CS2GCR2=00000000
CS2RCR1=02000000, CS2RCR2=00000008
CS2WCR1=02000000, CS2WCR2=00000000
WEIM Config register WCR=00000021
WEIM IP Access register WIAR=00000014

Again, you’ll have to believe me that the first 16-bit data word is on the bus on both the second and third BCLK cycle. That is, the waitstate dwells on the first piece of data.
By the way, the waitstate count for read bursts was changed here as well, but that’s irrelevant. It’s just something my test kit did.
Bus clock division
To get a cleaner look, the next scope traces will be done with BCD=3, so the clock is divided by four. Continuous BCLK is also disabled by setting BCM=0, or otherwise there is no phase relation between BCLK and the bus signals.
So just by making these two changes relative to the “simple write cycle” we have
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=00000014

The time sweep is slower in this scope image, of course.
Bus clock division + adding a wait state
With the last trace as the starting point, setting WWSC=2 (its default is 1) an extra wait state cycle is added:
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=02000000, CS2RCR2=00000008
CS2WCR1=02000000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=00000014

So we have four BCLKs instead of three, as one should expect.
A read cycle
Keeping the bus division (BCD=3 and BCM=0), and reverting everything else to the original setting, we’ll have a look on a read cycle. There’s no point in sampling WE anymore, so the probe moves to the OE signal instead. All in all, the traces from top to bottom (CH4 to CH1) are from now on: BCLK, OE, CS2 and ADV (trigger on falling edge of CS2).
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=01000000, CS2RCR2=00000008
CS2WCR1=01000000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=00000014

As expected, there are two clock cycles with OE low. This is where the processor expects to get some data.
Delaying OE assertion
With the previous example as a starting point, setting OEA=2 yields the following:
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=01002000, CS2RCR2=00000008
CS2WCR1=01000000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=00000014

This may come as a surprise: The OE’s assertion point was delayed by two WEIM clocks, which happens to be half a BCLK cycle. And nothing else changed.
Delaying ADV assertion
With “A read cycle” as a starting point, setting RADVA=2 yields the following:
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=01200000, CS2RCR2=00000008
CS2WCR1=01400000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=0000001

What we can see here, is that the ADV signal was delayed, but not shortened. While OE’s deassertion point didn’t move, ADV’s did move as a result of delaying the assertion. What is not visible in this scope image, is that the processor keeps driving the address on the address/data lines as long as ADV is asserted, leaving less time for data (as evident by the shortened OE).
Delaying ADV assertion and deassertion
Setting RADVN=2 on top of the previous example, we have a two WEIM clock delay on both the assertion and deassertion, so the deassertion is delayed by 4 WEIM clocks, which is one BCLK. Or in simple words, the first data cycle is completely wiped out:
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=01220000, CS2RCR2=00000008
CS2WCR1=01480000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=0000001

I don’t know if this setting is legal, but it was pretty evident that the data read by the processor during these cycles wasn’t consistent, not even the 16 LSB, which are read during the buried cycle.
Making it OK
Just to have a happy ending, let’s add a wait state. This will pull out the overridden data cycle and make the whole bus operation normal again.
So with RADVA=RADVN=2 and RWSC=2 (with the default as RWSC=1, this means a wait state) we have
CS2GCR1=019130bf, CS2GCR2=00000000
CS2RCR1=02220000, CS2RCR2=00000008
CS2WCR1=02480000, CS2WCR2=00000000
WEIM Config register WCR=00000020
WEIM IP Access register WIAR=0000001

So all in all there’s a longer ADV assertion, which is compensated with a wait state, so there’s time for both data cycles.