While porting Xillybus to Virtex-5, I ran into nasty trouble. In the beginning, it looked like the MSI interrupt delivery mechanism was wrong, and then it turned out that the core gets locked up completely after a few packets, and refuses to send any TLPs after a few sent. I also noticed that the PCIe core has the “Fatal Error Detected” flag set in its status register (or more precisely, Xillybus banged me in the head with the bad news). Eventually, I found myself resetting the core with a debounced pushbutton connected to sys_reset_n at some very certain point in the host’s boot process to make the system work. Using just PERST_B, like the user guide suggests, simply didn’t work.
All this was with version 1.15 of the PCIe endpoint block plus, which was introduced in ISE 13.2. Quite by chance, I tried ISE 13.1, which comes with version 1.14 of the core. And guess what, suddenly PERST_B connected to sys_reset_n did the job, and the Fatal Error vanished.
I have to admit I’m quite amazed by this.
Questions & Comments
Since the comment section of similar posts tends to turn into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please post questions and comments here instead. No registration is required. The comment section below is closed.
Posted Under:
FPGA,
PCI express
This post was written by
eli on December 20, 2011
Comments Off
In Verilog there’s a bit shifter operator, which isn’t used a lot, since FPGA designers prefer to state exact bit vectors. But sometimes bit shifting makes the code significantly more readable. Too bad that Xilinx’ XST synthesizer doesn’t get it right in a specific case.
Namely, the following statement is perfectly legal:
always @(posedge clk)
reduce <= 1 + (end_offset >> (6 + rcb_is_128_bytes - format_shift) );
But it turns out that Xilinx ISE 13.2 XST synthesizer gets confused by the calculation of the shift rate, and creates something wrong. I can’t even tell what it did, but it was wrong.
So the rule is simple: It’s fine to have the shift number being a register (even combinatoric) or a wire, but no inline calculations. So this is fine:
always @(format_shift or rcb_is_128_bytes)
if (rcb_is_128_bytes)
case (format_shift)
0: shifter <= 7;
1: shifter <= 6;
default: shifter <= 5;
endcase
else
case (format_shift)
0: shifter <= 6;
1: shifter <= 5;
default: shifter <= 4;
endcase
always @(posedge clk)
reduce <= 1 + (end_offset >> shifter );
(assuming that format_shift goes from zero to 2).
Actually, I would bet that it’s equally fine to calculate the number of shifts and put the result in a wire. I went for the case statement hoping that the synthesizer will take the hint that not all values that fit into the registers are possible, and will hence avoid implementing impossible shift values.
Needless to say, I know about this because something went horribly wrong all of the sudden. I believe XST version 12.2 handled the shift calculation OK. And then people ask me why I don’t like upgrades.
reduce_header_credits <= 1 + (effective_end_offset >> (6 + rcb_is_128_bytes – recvbuf_format_shift) )
There are several ways to stop these pingbacks, 95% of which are spam.
My method may not be optimal, but has the elegance of simplicity. Simply edit the part in the end of wp-trackback.php (at WordPress’ root directory) going
if ( !empty($tb_url) && !empty($title) ) {
header('Content-Type: text/xml; charset=' . get_option('blog_charset') );
if ( !pings_open($tb_id) )
trackback_response(1, 'Sorry, trackbacks are closed for this item.');
$title = wp_html_excerpt( $title, 250 ).'...';
$excerpt = wp_html_excerpt( $excerpt, 252 ).'...';
[ ... snipped ... ]
trackback_response(0);
}
and turn it into
trackback_response(1, 'Sorry, trackbacks are closed for this item.');
(don’t remove the trailing “?>”)
So this gives the correct response to the client trying to send trackbacks, and doesn’t bother you anymore.
Just lucky?
I’ve been approached a few times with requests to design the FPGA part of an FPGA-to-PC link over Ethernet. The purpose of the link is typically transporting a large amount of data to the PC. The application varies from continuous data acquisition to frame grabbing or transport of a raw video image stream. What these applications have in common, is that the client expects a reliable, easy-to-implement data channel. Just send the packets to the broadcast MAC address, and you’re done.
When doubting the reliability of this solution, I usually get the “I know from previous projects that it works” argument. I can’t argue with their previous success. But there’s a subtle difference between “it works” and “it’s guaranteed to work”. To a serious FPGA engineer, that’s the difference between “maybe I was lucky this time” and “this is a project I’m ready to release”.
Ethernet is inherently unreliable
The most important thing to know about Ethernet (in any data rate) is that is was never meant to be reliable. As a physical layer for networks, the underlying assumption is that a protocol layer detects packet drops and issues retransmissions as necessary. Put simply, this means that an Ethernet chip that drops a packet every now and then is considered 100% OK.
Since packet losses cause a certain data rate performance hit (e.g. TCP/IP streams will be halted for a short period), efforts are made to keep them to a minimum. For example, the Ethernet 802.3 standard states 10-10 as the objective for the raw bit error rate on a copper wire Gigabit Ethernet link (1000BASE-T, see section 40.1.1). That means that a packet drop every 10 seconds is considered within the standard’s objectives. Packet drops may also occur on the operating system level: The network stack may take the freedom to drop packets just because they didn’t arrive at a good time. This happens less when the computer is generally idle (i.e. in the lab) but may become more evident under load (that is, in real-life use).
Prototype vs. production
Engineers are often mislead to think that the link is reliable because they can’t see any packet drops on the particular prototype they’re working on. It’s also easy to overlook sporadic packet drops during the development stages. The problem becomes serious when reaching the production stage, when a no-errors system needs to be put on the table. Even worse, production copies of the system may suddenly start to fail once in a few hours or so. The QA tests may not spot these issues, so the complaints may come from end-users feeling there’s something wrong with their devices, which the vendor has no clue about. I mean, imagine your car’s dashboard going crazy for a second once a month, and the vendor insisting on that being impossible. Would you stay with that car?
Working it around
The natural way to work around Ethernet packet drops is either accepting the data loss or implementing a retransmission mechanism.
Living with the data loss is possible in e.g. one-shot data acquisition applications, when the trigger is recurrent. Say, if a single frame is grabbed from a video stream, and it’s OK to fail on the first attempt, that’s fine. As long as nobody feels the unexpected delay of 1/30th of a second.
Retransmissions may be significantly trickier, in particular if the data goes from the FPGA to the PC. The thing is, that it will take some time for the PC to respond on the lost packet, and that time may be unlimited. For example, in today’s Linux implementations, the analysis of network packets is done in a tasklet context, and not by the interrupt service routine. Since tasklets are merely scheduled as a high-priority process, the latency until the packets are analyzed closely enough to detect a packet loss depends on how busy the computer is at that time.
One could hack the Ethernet card’s device driver to check a special field in each packet (say, a counter). Let’s say that the packet interrupt is handled within 10 μs, and that the packet loss is reported back to the FPGA in no time. This means it has to store 10 kbits worth of previous packets (at least) to support a Gigabit link. Actually, that’s fine. A Xilinx FPGA’s internal RAM is more or less of that size. Too bad it’s not realistic.
And that’s because the underlying assumption of 10 μs response time is problematic, since any other kernel component can turn off interrupts while minding its own business (typically holding a spinlock). This other component could be completely unrelated to the Ethernet application (a sound card driver?) and not be active at all when the link with the FPGA is tested in QA. And it could be something not happening very often, so the sudden high latency becomes a rare bug to handle.
So taking a more realistic approach, it’s more like storing several megabytes of data to make sure all packets stay in memory until their safe arrival has been confirmed. This involves a controller for an external memory (a DDR SDRAM, I suppose) and some nontrivial state machine for keeping track of the packet traffic. While meeting the requirement of being reliable, it’s not really easy to implement. Well, easy to implement on a computer, which is the natural user of an Ethernet link. Not an FPGA.
The right way to do it
The correct thing to do, is to use a link which was originally intended for communication between a computer and its peripheral. Several interfaces exist, but today the most appealing one is PCI Express, as it’s expected to be supported for many years ahead. Being the successor of good old PCI, its packet relay interface guarantees true reliability, which is assured by the PCIe fabric’s hardware.
The PCIe solution is often avoided because of the complexity of setting up the FPGA logic for transmission over the bus. This is no excuse in situations where Xillybus fits the scenario, as it provides a simple interface on both sides for transmission of data between an FPGA and its Linux host. If that’s not the case, the two choices are either to suck it up and write the PCIe interface yourself, or revert to using Ethernet, hoping for the best.
Summary
I always say that it’s fairly easy to get an FPGA doing what you want on a lab prototype. There are a lot of engineers not asking for too much money out there, who will do that for you. But when it comes to designing the system so it’s guaranteed to work, that’s a whole different story. That includes using correct design techniques in the FPGA logic’s HDL, constraining the timing correctly, ensuring that the timing requirements of the surrounding hardware are met, as defined in their datasheets etc. The difference isn’t seen at the prototype level, so quick and dirty FPGA work gets away with it in the beginning. It’s during the later stages, when temperature checks are run and the electronics is being duplicated, that things start to happen.
At that point people tend to blame the FPGA for being an unreliable solution. Others adopt mystic rules such as the 9o%-10% rule, basically saying that the real work starts in the last phase.
But the truth is that that if you got it right in the first place, there’s no reason why things should go wrong during production. If the prototype was done professionally, turning it into a product is really not a big deal. And having said that, yes, sometimes people do everything wrong, and just turn out lucky.
As for using Ethernet as a reliable link, it all boils down to if you want to gamble on it.
The goal
Use my Microsoft mouse’s two extra buttons to do Copy and Paste on a Fedora 12 machine (Gnome 2.28.2). The problem is that different applications have different keystrokes. But the nice thing is that almost all applications will respond properly to Alt-e c for copy (that is, open the “Edit” menu, and choose C”). CTRL-C is problematic, because pressing it accidentally over a shell will halt the execution of whatever is running.
I should note that this doesn’t work with Xemacs. I’ll come to that some day, I suppose.
Doing it
Assuming the necessary packages are installed (see below), run xev and press the relevant mouse buttons over the window to detect which button generates what event.
Then setup xbindkeys’ initial configuration file (and wipe any existing settings, if present) with
xbindkeys --defaults > /home/eli/.xbindkeysrc
and edit the file, commenting out all examples, just to be safe.
Then add the following snippet to the same file:
# Copy with mouse
"xte 'keydown Alt_L' 'key e' 'keyup Alt_L' 'usleep 10000' 'key c'"
release + b:8
# Paste with mouse
"xte 'keydown Alt_L' 'key e' 'keyup Alt_L' 'usleep 10000' 'key p'"
release + b:9
Note that the relevant mouse buttons were found to be 8 and 9, using xev. The “release” keyword makes sure that the script is run upon release events, and not when the button is still pressed, because the target applications won’t execute the command while the button is pressed. The 10ms sleep between opening the menu and the command was necessary for Firefox to catch the command. Tweaking is the name of the game.
And finally, if xbindkeys isn’t run, just go
$ killall xbindkeys ; xbindkeys
Note that there’s no need to rerun xbindkeys after modifying its configuration file, as it’s reloaded automatically.
To have xbindkeys executed on each login (on a Gnome system) add the two following lines to /etc/gdm/PreSession/Default:
su -l $USER -c killall xbindkeys
su -l $USER -c xbindkeys
Just a list of relevant utilities
(Not all were used for the original purpose)
- xev — Creates a small window, and prints out events related to it. Useful for mapping the codes of mouse events, for example
- xbindkeys — A daemon which catches key or mouse events, and executes shell commands accordingly (never tried it, though. yum install xbindkeys)
- xmodmap — Utility for modifying the mapping of keys, e.g. reverse the mouse buttons or certain keys on the keyboard (but not their function)
- xte and xautomation — Create fake key presses for running X applications automatically (yum install xautomation)
In short: I’ve known for a long time that OpenGL wasn’t working on my graphics card, and the display was indeed sluggish.
The problem is most easily shown by going
$ glxinfo | grep OpenGL
OpenGL vendor string: Mesa Project
OpenGL renderer string: Software Rasterizer
OpenGL version string: 2.1 Mesa 7.7.1-DEVEL
OpenGL shading language version string: 1.20
OpenGL extensions:
as shown in http://www.x.org/wiki/radeonBuildHowTo it gives.
A much more interesting output came from
# LIBGL_DEBUG=verbose glxinfo | grep openGL
libGL: OpenDriver: trying /usr/lib64/dri/r600_dri.so
libGL error: dlopen /usr/lib64/dri/r600_dri.so failed (/usr/lib64/dri/r600_dri.so: cannot open shared object file: No such file or directory)
libGL error: unable to load driver: r600_dri.so
libGL error: driver pointer missing
libGL: OpenDriver: trying /usr/lib64/dri/swrast_dri.so
Aha! Looks like yet another punishment for running 64 bit! And indeed, the relevant file isn’t to be found on my system, so the software rasterizer is loaded instead.
According to this forum answer, I need to upgrade libdrm_radeon.
$ yum provides '*/r600_dri.so'
gave the answer: It’s in mesa-dri-drivers-experimental-7.6-0.13.fc12.x86_64. Do I want to use something experimental? Hmmm…
But checking up Mesa’s site, it looks like what they consider experimental is what is usually considered production. Judging from the bugs they fix afterwards, that is.
# yum install mesa-dri-drivers-experimental
Logged out and in again (restart X), and tried:
$ glxinfo | grep OpenGL
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: Mesa DRI R600 (RV710 954F) 20090101 TCL DRI2
OpenGL version string: 1.5 Mesa 7.7.1-DEVEL
OpenGL extensions
Yay! Went to System > Preferences > Desktop effects and enabled 3D acceleration. And some silly effects, to see it’s actually working.
Following http://phoronix.com/forums/showthread.php?20186-Software-Rasterizer-with-and-without-KMS I added “eli” to the “video” group. Not clear if this was necessary.
Update (June, 2014): When upgrading to kernel 3.12, Google Chrome complained about the GPU thread being stuck, and timed out after 10 seconds (e.g. on Facebook’s main page). The solution was to upgrade libdrm to 2.4.54 by compiling from sources, using
./configure --prefix=/usr --libdir=/usr/lib64/
This was probably needed because of an update in the kernel’s drm module.
Upgrading Mesa to 10.1.4 turned out to be a pain in the bottom because of lots of dependencies that needed to be downloaded. All in all, it had to be reverted by reinstalling the packages from the yum repo. It improved nothing, but windows didn’t redraw properly (for example, after issuing a command on gnome-terminal, nothing was updated until the window was moved with the mouse).
These are a couple of examples of SDMA assembly code, which performs data copy using the DMA functional unit. The first one shows how to copy data from application memory space to SDMA memory. The second example copies data from one application memory chunk to another, and hence works as an offload memcpy().
To actually use this code and generally understand what’s going on here, I’d warmly suggest reading a previous post of mine about SDMA assembly code, which also explains how to compile the code and gives the context for the C functions given below.
Gotchas
- Never let either the source address nor the destination address cross a 32-byte boundary during a burst from or to the internal FIFO. Even though I haven’t seen this restriction in the official documentation, several unexplained misbehaviors have surfaces when allowing this happen, in particular when accessing EIM. So just don’t.
- When accessing EIM, the EIM’s maximal burst length must be set to allow 32 bytes in one burst with the BL parameter, or data gets corrupted.
Application space memory to SDMA space
The assembly code goes
$ ./sdma_asm.pl app2sdma.asm
| # Always in context (not altered by script):
| #
| # r4 : Physical address to source in AP memory space
| # r6 : Address in SDMA space to copy to
| # r7 : Number of DWs to copy
| #
| # Both r4 and r5 must be DW aligned.
| # Note that prefetching is allowed, so up to 8 useless DWs may be read.
|
| # First, load the status registers into SDMA space
| start:
0000 6c20 (0110110000100000) | stf r4, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 008f (0000000010001111) | mov r0, r7
0002 018e (0000000110001110) | mov r1, r6
0003 7803 (0111100000000011) | loop postloop, 0
0004 622b (0110001000101011) | ldf r2, 0x2b # Read from 32 bits from MD with prefetch
0005 5a01 (0101101000000001) | st r2, (r1, 0) # Address in r1
0006 1901 (0001100100000001) | addi r1, 1
| postloop:
0007 0300 (0000001100000000) | done 3
0008 0b00 (0000101100000000) | ldi r3, 0
0009 4b00 (0100101100000000) | cmpeqi r3, 0 # Always true
000a 7df5 (0111110111110101) | bt start # Always branches
------------ CUT HERE -----------
static const int sdma_code_length = 6;
static const u32 sdma_code[6] = {
0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
};
Note that the arguments for sdf and ldf are given as numbers, and not following the not-so-helpful notation used in the Reference Manual.
The basic idea behind the assembly code is that each DW (Double Word, 32 bits) is read automatically by the functional unit from application space memory, and then fetched from the FIFO into r2. Then the register is written to SDMA memory with a plain “st” opcode.
The relevant tryrun() function to test this is:
static int tryrun(struct sdma_engine *sdma)
{
dma_addr_t src_phys;
void *src_virt;
const int channel = 1;
struct sdma_channel *sdmac = &sdma->channel[channel];
static const u32 sdma_code[6] = {
0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
};
static const u32 sample_data[8] = {
0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
0xebeb0000, 0, 0xffffffff, 0xabcdef00 };
const int origin = 0xe00; // In data space terms (32 bits/address)
struct sdma_context_data *context = sdma->context;
int ret;
src_virt = dma_alloc_coherent(NULL,
4096, // 4096 bytes, just any buffer size
&src_phys, GFP_KERNEL);
if (!src_virt) {
printk(KERN_ERR "Failed to allocate source buffer memory\n");
return -ENOMEM;
}
memset(src_virt, 0, 4096);
memcpy(src_virt, sample_data, sizeof(sample_data));
sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);
ret = sdma_request_channel(sdmac);
if (ret) {
printk(KERN_ERR "Failed to request channel\n");
return ret;
}
sdma_disable_channel(sdmac);
sdma_config_ownership(sdmac, false, true, false);
memset(context, 0, sizeof(*context));
context->channel_state.pc = origin * 2; // In program space addressing...
context->gReg[4] = src_phys;
context->gReg[6] = 0xe80;
context->gReg[7] = 3; // Number of DWs to copy
ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
0x800 + (sizeof(*context) / 4) * channel);
if (ret) {
printk(KERN_ERR "Failed to load context\n");
return ret;
}
ret = sdma_run_channel(&sdma->channel[1]);
sdma_print_mem(sdma, 0xe80, 128);
if (ret) {
printk(KERN_ERR "Failed to run script!\n");
return ret;
}
return 0; /* Success! */
}
Note that the C code snippet, which is part of the output of the assembler compilation, actually appears in the tryrun() function.
Fast memcpy()
Assembly goes
$ ./sdma_asm.pl copydma.asm
| # Should be set up at invocation
| #
| # r0 : Number of DWs to copy (is altered as script runs)
| # r1 : Source address (DW aligned)
| # r2 : Destination address (DW aligned)
|
0000 6920 (0110100100100000) | stf r1, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 6a04 (0110101000000100) | stf r2, 0x04 # To MDA, address is nonfrozen
0002 0c08 (0000110000001000) | ldi r4, 8 # Number of DWs to copy each round
| copyloop:
0003 04d8 (0000010011011000) | cmphs r4, r0 # Is 8 larger or equal to the number of DWs left to copy?
0004 7d03 (0111110100000011) | bt lastcopy # If so, jump to last transfer label
0005 6c18 (0110110000011000) | stf r4, 0x18 # Copy 8 words from MSA to MDA address.
0006 2008 (0010000000001000) | subi r0, 8 # Decrement counter
0007 7cfb (0111110011111011) | bf copyloop # Always branches, because r0 > 0
| lastcopy:
0008 6818 (0110100000011000) | stf r0, 0x18 # Copy 8 or less DWs (r0 is always > 0)
| exit:
0009 0300 (0000001100000000) | done 3
000a 0b00 (0000101100000000) | ldi r3, 0
000b 4b00 (0100101100000000) | cmpeqi r3, 0 # Always true
000c 7dfc (0111110111111100) | bt exit # Endless loop, just to be safe
------------ CUT HERE -----------
static const int sdma_code_length = 7;
static const u32 sdma_code[7] = {
0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
}
For a frozen (constant) source address (e.g. when reading from a FIFO) the first stf should be done with argument 0x30 rather than 0x20. For a frozen destination address, the seconds stf has the argument 0x14 instead of 0x04.
This script should be started with r0 > 0. It may be OK to have r0=0, but I’m not sure about that (and if there’s no issue with not reading any data after a prefetch, as possibly related to section 52.22.1 in the Reference Manual).
The endless loop to “exit” should never be needed. It’s there just in case the script is rerun by mistake, so it responds with a “done” right away. And the example above is not really optimal: To make a for-sure branch, I could have gone “bt exit” and “bf exit” immediately after it, making this in two opcodes instead of three. Wasteful me.
The tryrun() function for this case then goes
static int tryrun(struct sdma_engine *sdma)
{
dma_addr_t buf_phys;
u8 *buf_virt;
const int channel = 1;
struct sdma_channel *sdmac = &sdma->channel[channel];
static const u32 sdma_code[7] = {
0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
};
static const u32 sample_data[8] = {
0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
0xebeb0000, 0, 0xffffffff, 0xabcdef00 };
const int origin = 0xe00; // In data space terms (32 bits/address)
struct sdma_context_data *context = sdma->context;
int ret;
buf_virt = dma_alloc_coherent(NULL, 4096,
&buf_phys, GFP_KERNEL);
if (!buf_virt) {
printk(KERN_ERR "Failed to allocate source buffer memory\n");
return -ENOMEM;
}
memset(buf_virt, 0, 4096);
memcpy(buf_virt, sample_data, sizeof(sample_data));
sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);
ret = sdma_request_channel(sdmac);
if (ret) {
printk(KERN_ERR "Failed to request channel\n");
return ret;
}
sdma_disable_channel(sdmac);
sdma_config_ownership(sdmac, false, true, false);
memset(context, 0, sizeof(*context));
context->channel_state.pc = origin * 2; // In program space addressing...
context->gReg[0] = 18; // Number of DWs to copy
context->gReg[1] = buf_phys;
context->gReg[2] = buf_phys + 0x40;
ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
0x800 + (sizeof(*context) / 4) * channel);
if (ret) {
printk(KERN_ERR "Failed to load context\n");
return ret;
}
ret = sdma_run_channel(&sdma->channel[1]);
do {
int i;
const int len = 0xa0;
unsigned char line[128];
int pos = 0;
for (i=0; i<len; i++) {
if ((i % 16) == 0)
pos = sprintf(line, "%04x ", i);
pos += sprintf(&line[pos], "%02x ", buf_virt[i]);
if ((i % 16) == 15)
printk(KERN_WARNING "%s\n", line);
}
} while (0);
if (ret) {
printk(KERN_ERR "Failed to run script!\n");
return ret;
}
return 0; /* Success! */
}
The memory’s content is printed out here from tryrun() directly, since the dumped memory is in application space.
Unless your embedded application happens to be a router, there’s some application-dependent electronics you need to talk with. If some SoC device covers your needs, that’s always nice, but what about that specific piece of electronics? And what if your application includes a part that needs to be run on an FPGA?
Making a processor talk with an FPGA is pretty doable, as long as there’s no heavy I/O, and the processor doesn’t run a sophisticated operating system. But if you picked Linux (probably to support some USB device, storage and/or network), the task of getting high-bandwidth data running between the processor and the FPGA can turn into a project by its own merits.
Xilinx addresses this issue partly with its Zynq-7000 FPGA-ARM combo, making the ARM’s internal AXI bus directly available to FPGA logic. Whether this new generation of devices is going to have a different fate than the Virtex-2 Pro and Virtex-4 FX FPGAs, which had PowerPC cores built-in and direct PLB bus access, is still to see. It seems like many believe, that putting the processor core inside the FPGA doesn’t necessarily make things easier. Anyhow, with first engineering samples of a completely new architecture to be out on the first half of 2012, it’s not clear when the Zynq-7000 solution will be live and kicking.
As many have found out, running Linux on an embedded processor may be difficult, but not a significant obstacle. Getting the Linux-running processor to access a register or two on the FPGA is not an impossible mission either. But when data needs to be transported fast and efficiently, things start to get tricky: The FPGA needs to be bus master capable, so it can transport the data over DMA. The Linux kernel driver needs to be written correctly to orchestrate DMA accesses at a high rate without using up too much CPU. The asynchronous nature of the data transfer creates corner cases, in particular when the data arrives in anything else than chunks of a constant size. In short, the distance between the “Hello, world” application and the actual working horse is sometimes not as close as it may seem at first.
Xillybus offers a simple solution for systems having a PCI or PCIe bus. As this is not usually the case for embedded processors, this doesn’t necessarily help.
On the other hand, a special port of Xillybus to Freescale’s i.MX51 is already available. Using DMA for transferring data over the external bus lines, data rates of 35 MByte/s and above are possible, with a minimal use of the ARM Cortex A8 processor. The application designer meets the same simple and intuitive interface, as in the PCIe version: The FPGA engineer faces a simple and standard FIFO or RAM interface. The programmer writes simple user space applications which interact with device files, as I/O is usually done in Linux systems.
The demo version is available for the Armadeus APF51 board, which forms, together with its development docking board, a jump start kit for evaluating Xillybus on embedded ARM platform. As the Xillybus evaluation kit is pretty much like the real thing, and the board’s design is straightforward, taking evaluation to real-life implementation is at a hand’s reach.
As voicemail messages often go here in Israel: The Hebrew message will be followed by an English one.
עברית
לפני מספר שנים, כתבתי עיבוד לשיר יום ההולדת המוכר “היום יום הולדת” למקהלת גברים (למעשה, רביעיית ברברשופ). בראייה לאחור (או האזנה, ליתר דיוק) הדמיון העיקרי לברברשופ הוא בכך שהמנגינה נמצאת אצל הטנור השני, וגם האיזון בין הקולות לפי המסורת, אבל השמאלץ האמריקאי הזה לא נמצא בין התווים. אולי כי זה שיר בעברית, והמעבד ישראלי…
כך או כך, העיבור משוחרר תחת רשיון CC0 של Creative Commons, שזה אומר שאפשר לעשות מה בראש שלך איתו. כולל, כמובן, לשכפל העתקים אלקטרוניים או על גבי נייר, לבצע, להופיע, להקליט, לזייף ולהאשים אותי בסוף.
אפשר להוריד את התווים בלינק הזה, וגם קליפ שמע קצר שבו אני והשיכפולים שלי שרים (טוב נו, לוחשים) את העיבוד.
English
A few years ago, I made a small arrangement of the Israeli birthday song for a TTBB male choir. Or just a plain male quartet. It’s kinda barbershop in the sense that the Lead has the melody and the way the voices should be balanced. In retrospective, it doesn’t have the American feel to it, but heck, it’s an Israeli song arranged by an Israeli…
You can download the sheet music directly using this link. For an audio clip of myself multiplied singing (well, whispering) this, click here.
I’ve released it under Common Creative CC0, or if you like, to the public domain. In simple words, that means that you can do whatever you want with it, with no need to ask anyone for permission. Including, of course, making electronic or paper copied, performing, recording, singing off key and blaming me for everything. As long as you have fun.
This is part IV of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.
This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:
Running custom scripts
I’ll try to show the basics of getting a simple custom script to run on the SDMA core. Since there’s a lot of supporting infrastructure involved, I’ll show my example as a hack on the drivers/dma/imx-sdma.c Linux kernel module per version 2.6.38. I’m not going to explain the details of kernel hacking, so without experience in that field, it will be pretty difficult to try this out yourself.
The process of running an application-driven custom script consists of the following steps:
- Initialize the SDMA module
- Initialize the SDMA channel and clearing its HE flag
- Copy the SDMA assembly code from application space memory to SDMA memory space RAM.
- Set up the channel’s context
- Enable the channel’s HE flag (so the script runs pretty soon)
- Wait for interrupt (assuming that the script ends with a “DONE 3″)
- Possibly copy back the context to application processor space, to inspect the registers upon termination, and verify that their values are as expected.
- Possibly copy SDMA memory to application processor space in order to inspect if the script worked as expected (if the script writes to SDMA RAM)
The first two steps are handled by the imx-smda.c kernel module, so I won’t cover them. I’ll start with the assembly code, which has to be generated first.
The assembler
Freescale offers their assembler, but I decided to write my own in Perl. It’s simple and useful for writing short routines, and its output is snippets of C code, which can be inserted directly into the source, as I’ll show later. It’s released under GPLv2, and you can download it from this link.
The sample code below does nothing useful. For a couple of memory related examples, please see another post of mine.
To try it out quickly, just untar it on some UNIX system (Linux included, of course), change directory to sdma_asm, and go
$ ./sdma_asm.pl looptry.asm
| start:
0000 0804 (0000100000000100) | ldi r0, 4
0001 7803 (0111100000000011) | loop exit, 0
0002 5c05 (0101110000000101) | st r4, (r5, 0) # Address r5
0003 1d01 (0001110100000001) | addi r5, 1
0004 1c10 (0001110000010000) | addi r4, 0x10
| exit:
0005 0300 (0000001100000000) | done 3
0006 1c40 (0001110001000000) | addi r4, 0x40
0007 0b00 (0000101100000000) | ldi r3, 0
0008 4b00 (0100101100000000) | cmpeqi r3, 0 # Always true
0009 7df6 (0111110111110110) | bt start # Always branches
------------ CUT HERE -----------
static const int sdma_code_length = 5;
static const u32 sdma_code[5] = {
0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};
The output should be pretty obvious. In particular, note that there’s a C declaration of a const array called sdma_code, which I’ll show how to use below. The first part of the output is a plain assembly listing, with the address, hex code and binary representation of the opcodes. There are a few simple syntax rules to observe:
- Anything after a ‘;’ or ‘#’ sign is ignored (comments)
- Empty lines are ignored, of course
- A label starts the line, and is followed by a colon sign, ‘:’
- Everything is case-insensitive, including labels (all code is lowercased internally)
- The first alphanumeric string is considered the opcode, unless it’s a label
- Everything following an opcode (comments excluded) is considered the arguments
- All registers are noted as r0, r1, … r7 in the argument fields, and not as plain numbers, unlike the way shown in the reference manual. This makes a clear distinction between registers and values. It’s “st r7, (r0,9)” and not “
st 7, (0,9)“.
- Immediate arguments can be represented as decimal numbers (digits only), possibly negative (with a plain ‘-’ prefix). Positive hexadecimal numbers are allowed with the classic C “0x” prefix.
- Labels are allowed for loops, as the first argument. The label is understood to be the first statement after the loop, so the label is the point reached when the loop is finished. See the example above. The second argument may not be omitted.
- Other than loops, labels are accepted only for branch instructions, where the jump is relative. Absolute jump addresses can’t be generated automatically for jmp and jsr because the absolute address is not known during assembly.
A few words about why labels are not allowed for absolute jumps: It would be pretty simple to tell the Perl script the origin address, and allow absolute addressed jumps. I believe absolute jumps within a custom script should be avoided at any cost, so that the object code can be stored and run anywhere vacant. This is why I wasn’t keen on implementing this.
A simple test function
This is a simple function, which loads a custom script and runs it a few times. I added it, and a few additional functions (detailed later) to the Linux kernel’s SDMA driver, imx-sdma.c, and called it at the end of sdma_probe(). This is the simplest, yet not most efficient way to try things out: The operation takes place once when the module is inserted into the kernel, and then a reboot is necessary, since the module can’t be removed from the kernel. But with the reboot being fairly quick on an embedded system, it’s pretty OK.
So here’s the tryrun() function. Mind you, it’s called after the SDMA subsystem has been initialized, with one argument, the pointer to the sdma_engine structure (there’s only one for the entire system).
static int tryrun(struct sdma_engine *sdma)
{
const int channel = 1;
struct sdma_channel *sdmac = &sdma->channel[channel];
static const u32 sdma_code[5] = {
0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};
const int origin = 0xe00; /* In data space terms (32 bits/address) */
struct sdma_context_data *context = sdma->context;
int ret;
int i;
sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);
ret = sdma_request_channel(sdmac);
if (ret) {
printk(KERN_ERR "Failed to request channel\n");
return ret;
}
sdma_disable_channel(sdmac);
sdma_config_ownership(sdmac, false, true, false);
memset(context, 0, sizeof(*context));
context->channel_state.pc = origin * 2; /* In program space addressing... */
context->gReg[4] = 0x12345678;
context->gReg[5] = 0xe80;
ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
0x800 + (sizeof(*context) / 4) * channel);
if (ret) {
printk(KERN_ERR "Failed to load context\n");
return ret;
}
for (i=0; i<4; i++) {
ret = sdma_run_channel(&sdma->channel[1]);
printk(KERN_WARNING "*****************************\n");
sdma_print_mem(sdma, 0xe80, 128);
if (ret) {
printk(KERN_ERR "Failed to run script!\n");
return ret;
}
}
return 0; /* Success! */
}
Copying the code into SDMA memory
First, note that sdma_code is indeed copied from the output of the assembler, when it’s executed on looptry.asm as shown above. The assembler adds the “static” modifier as well as an sdma_code_length variable which were omitted, but otherwise it’s an exact copy.
The first thing the function actually does, is calling sdma_write_datamem() to copy the code into SDMA space (and I don’t check the return value, sloppy me). This is a function I’ve added, but its clearly derived from sdma_load_context(), which is part of imx-sdma.c:
static int sdma_write_datamem(struct sdma_engine *sdma, void *buf,
int size, u32 address)
{
struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
void *buf_virt;
dma_addr_t buf_phys;
int ret;
buf_virt = dma_alloc_coherent(NULL, size, &buf_phys, GFP_KERNEL);
if (!buf_virt)
return -ENOMEM;
bd0->mode.command = C0_SETDM;
bd0->mode.count = size / 4;
bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
bd0->buffer_addr = buf_phys;
bd0->ext_buffer_addr = address;
memcpy(buf_virt, buf, size);
ret = sdma_run_channel(&sdma->channel[0]);
dma_free_coherent(NULL, size, buf_virt, buf_phys);
return ret;
}
The sdma_write_datamem()’s principle of operation is pretty simple: First a buffer is allocated, with its address in virtual space given in buf_virt and its physical address is buf_phys. Both addresses are related to the application processor, of course.
Then the buffer descriptor is set up. This piece of memory is preallocated globally for the entire sdma engine (in application processor’s memory space), which isn’t the cleanest way to do it, but since these operations aren’t expected to happen in parallel processes, this is OK. The sdma_buffer_descriptor structure is defined in imx-smda.c itself, and is initialized according to section 52.23.1 in the Reference Manual. Note that this calling convention interfaces with the script running on channel 0, and not with any hardware interface. This chunk is merely telling the script what to do. In particular, the C0_SETDM command tells it to copy from application memory space to SDMA data memory space (see section 53.23.1.2).
Note that in the function’s arguments, “size” is given in bytes, but address in SDMA data address space (that is, in 32-bit quanta). This is why “size” is divided by four to become the element count (mode.count).
Just before kicking off, the input buffer’s data is copied into the dedicated buffer with a plain memcpy() command.
And then sdma_run_channel() (part of imx-sdma.c) is called to make channel 0 runnable. This function merely sets the HE bit of channel 0, and waits (sleeping) for the interrupt to arrive, or errors on timeout after a second.
At this point we have the script loaded into SDMA RAM (at data address 0xe00).
Some housekeeping calls on channel 1
Up to this point, nothing was done on the channel we’re going to use, which is channel #1. Three calls to functions defined in imx-sdma.c prepare the channel for use:
- sdma_request_channel() sets up the channel’s buffer descriptor and data structure, and enables the clock global to the entire sdma engine, actions which I’m not sure are necessary. It also sets up the channel’s priority and the Linux’ wait queue (used when waiting for interrupt).
- sdma_disable_channel() clears the channel’s HE flag
- sdma_config_ownership() clears HO, sets EO and DO for the channel, so the channel is driven (“owned”) by the processor (as opposed to driven by external events).
Setting up the context
Even though imx-sdma.c has a sdma_load_context() function, it’s written for setting up the context as suitable for running the channel 0 script. To keep things simpler, we’ll set up the context directly.
After zeroing the entire structure, three registers are set in tryrun(): The program counter, r4 and r5. Note that the program counter is given the address to which the code was copied, multiplied by 2, since the program counter is given in program memory space. The two other registers are set merely as an initial state for the script. The structure is then copied into the per-channel designated slot with sdma_write_datamem().
Again, note that the “context” data structure, which is used as a source buffer from which the context is copied into SDMA memory, is allocated globally for the entire SDMA engine. It’s not even protected by a mutex, so in a real project you should allocate your own piece of memory to hold the sdma_context structure.
Running the script
In the end, we have a loop of four subsequent runs of the script, without updating the context, so from the second time and on, the script continues after the “done 3″ instruction. This is possible, because the script jumps to the beginning upon resumption (the three last lines in the assembly code, see above).
Each call to sdma_run_channel() sets channel 1′s HE flag, making it do its thing and then trigger off an interrupt with the DONE instruction, which in turn wakes up the process telling it the script has finished. sdma_print_mem() merely makes a series of printk’s, consisting of hex dumps of data from the SDMA memory. As used, it’s aimed on the region which the script is expected to alter, but the same function can be used to verify that the script is indeed in its place, or look at the memory. The function goes
static int sdma_print_mem(struct sdma_engine *sdma, int start, int len)
{
int i;
u8 *buf;
unsigned char line[128];
int pos = 0;
len = (len + 15) & 0xfff0;
buf = kzalloc(len, GFP_KERNEL);
if (!buf)
return -ENOMEM;
sdma_fetch_datamem(sdma, buf, len, start);
for (i=0; i<len; i++) {
if ((i % 16) == 0)
pos = sprintf(line, "%04x ", i);
pos += sprintf(&line[pos], "%02x ", buf[i]);
if ((i % 16) == 15)
printk(KERN_WARNING "%s\n", line);
}
kfree(buf);
return 0;
}
and it uses this function (note that the instruction is C0_GETDM):
static int sdma_fetch_datamem(struct sdma_engine *sdma, void *buf,
int size, u32 address)
{
struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
void *buf_virt;
dma_addr_t buf_phys;
int ret;
buf_virt = dma_alloc_coherent(NULL, size,
&buf_phys, GFP_KERNEL);
if (!buf_virt)
return -ENOMEM;
bd0->mode.command = C0_GETDM;
bd0->mode.count = size / 4;
bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
bd0->buffer_addr = buf_phys;
bd0->ext_buffer_addr = address;
ret = sdma_run_channel(&sdma->channel[0]);
memcpy(buf, buf_virt, size);
dma_free_coherent(NULL, size, buf_virt, buf_phys);
return ret;
}
Dumping context
This is the poor man’s debugger, but it’s pretty useful. A “done 3″ function can be seen as a breakpoint, and the context dumped to the kernel log with this function:
static int sdma_print_context(struct sdma_engine *sdma, int channel)
{
int i;
struct sdma_context_data *context;
u32 *reg;
unsigned char line[128];
int pos = 0;
int start = 0x800 + (sizeof(*context) / 4) * channel;
int len = sizeof(*context);
const char *regnames[22] = { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
"mda", "msa", "ms", "md",
"pda", "psa", "ps", "pd",
"ca", "cs", "dda", "dsa", "ds", "dd" };
context = kzalloc(len, GFP_KERNEL);
if (!context)
return -ENOMEM;
sdma_fetch_datamem(sdma, context, len, start);
printk(KERN_WARNING "pc=%04x rpc=%04x spc=%04x epc=%04x\n",
context->channel_state.pc,
context->channel_state.rpc,
context->channel_state.spc,
context->channel_state.epc
);
printk(KERN_WARNING "Flags: t=%d sf=%d df=%d lm=%d\n",
context->channel_state.t,
context->channel_state.sf,
context->channel_state.df,
context->channel_state.lm
);
reg = &context->gReg[0];
for (i=0; i<22; i++) {
if ((i % 4) == 0)
pos = 0;
pos += sprintf(&line[pos], "%s=%08x ", regnames[i], *reg++);
if (((i % 4) == 3) || (i == 21))
printk(KERN_WARNING "%s\n", line);
}
kfree(context);
return 0;
}
Clashes with Linux’ SDMA driver
Playing around with the SDMA subsystem directly is inherently problematic, since the assigned driver may take contradicting actions, possibly leading to a system lockup. Running custom scripts using the existing driver isn’t possible, since it has no support for that as of kernel 2.6.38. On the other hand, there’s a good chance that the SDMA driver wasn’t enabled at all when the kernel was compiled, in which case there is no chance for collisions.
The simplest way to verify if the SDMA driver is currently present in the kernel, is to check in /proc/interrupts whether interrupt #6 is taken (it’s the SDMA interrupt).
The “imx-sdma” pseudodevice is always registered on the platfrom pseudobus (I suppose that will remain in the transition to Open Firmware), no matter the configuration. It’s the driver which may not be present. The “i.MX SDMA support” kernel option (CONFIG_IMX_SDMA) may not be enabled (it can be a module). Note that it depends on the general “DMA Engine Support” (CONFIG_DMADEVICES), which may not be enabled to begin with.
Anyhow, for playing with the SDMA module, it’s actually better when these are not enabled. In the long run, maybe there’s a need to expand imx-sdma.c, so it supports custom SDMA scripting. The question remaining is to what extent it should manage the SDMA RAM. Well, the real question is if there’s enough community interest in custom SDMA scripting at all.