The truth is, there is no need to do this manually. lspci does the work for us. But looking into the configuration table once and for all helps demystifying the issue. So here we go.
According to the PCIe spec (section 7.8), the max_payload_size the card can take is give in the PCIe Device Capabilities Register (Offset 0x04 in the PCI Express Capability structure), bits 2-0. Basically, take that three-bit field as a number, add 7 to it, and you have the log-2 of the number of bytes allowed.
The actual value used is set by host in the Device Control Register (Offset Ox08 in the PCI Express Capability structure). It’s the same drill, but with bits 7-5 instead. So in C it would be
OK, so how can we find these registers? How do we find the structure? Let’s start with dumping the hexadecimal representation of the 256-byte configuration space. Using lspci -xxx on a Linux machine we will get the dump for all devices, but we’ll look at one specific:
The first important thing to know about lspci -xxx on a little-endian machine (x86 processors included) is that PCI and PCIe work in big endian. And that the data is shown as little-endian DWs (or 32-bit unsigned ints). So the way to look at the output is in groups of four bytes each, and take them for a little-endian unsigned int, whose bit map matches the spec.
For example, according to the spec, bits 15-0 of the word mapped at 00h is the Vendor ID, and bits 31-16 is the Device ID. So we take the first four bytes for a little-endian 32-bit integer, and get Ox123410ee. Bits 15-0 are indeed Ox10ee, the vendor ID Xilinx, and bits 31-16 are Ox1234 which is the Device ID I made up for a custom device. So far so good.
Now we need to find the PCI Express Capability structure. It’s one of the structures in a linked list (would you believe that?), and it’s identified by a Cap ID of Ox10.
The pointer to the list is at bits 7-0 of the configuration word at Ox34. In our little-endian representation above, it’s simply the byte at Ox34, which says Ox40. The capabilities hence start at Ox40.
From here on, we can travel along the list of capability structures. Each starts 32-bit aligned, with the header always having the Capability ID on bits 7-0 (appears as the first byte above), and a pointer to the next structure in bits 15-8 (the second byte).
So we start at offset Ox40, finding it’s of Cap ID Ox01, and that the byte at offset Ox41 tells us that the next entry is at offset Ox48. Moving on to offset Ox48 we find Cap ID Ox05 and the next entry at Ox58. The entry at Ox58 is with Cap ID Ox10 (!!!), and it’s the last one (pointer to next is zero).
So we found our structure at Ox58. The Device Capabilities Register is hence at Ox5c (offset Ox04) and reads Ox00288fc2. The Device Control Register is at Ox60 (offset Ox08), and reads Ox00002810.
So we learn from bits 2-0 of the Device Capabilities Register (having value 2) that the device supports a max_payload_size of 512. But bits 7-5 (having value 0) of the Device Control Register tell us that the effective maximal payload is only 128 bytes.
Getting the info with lspci
As I mentioned above, we didn’t really need to find the addresses by hand. lspci -v gives us, for the specific device:
# lspci -v
(...)
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
Subsystem: Xilinx Corporation Generic FPGA core
Flags: bus master, fast devsel, latency 0, IRQ 42
Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
Capabilities: [40] Power Management version 3
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
Capabilities: [58] Express Endpoint IRQ 0
Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00
So the address to the PCI Express capabilities structure is given to us, but not the internal details (maybe some newer version of lspci does). And by the way, the size=128 above has nothing to do with maximal payload: It’s the size of the memory space allocated to the device by BIOS (BAR address space, if we’re into it).
For the details, including the maximal payload, we use the lspci -vv option.
So there we have it, black on white: The device supports 512 bytes MaxPayload, but below we have MayPayload given as 128 bytes.
Impact on performance
A 128-byte maximal payload is not good news if one wants to get the most out of the bandwidth. By the way, switches are not permitted to split packets (but the Root Complex is allowed) so this number actually tells us how much overhead each TLP (Transaction Layer Packet) carries. I talk about the TLP structure in another post.
Let’s make a quick calculation: Each packet comes with a header of 3 DWs (a DW is a 32-bit word, right?) when using 32 bit addressing, and a header of 4 DWs for 64-bit addressing. Let’s be nice and assume 32-bit addressing, so the header is 3 DWs.
TLPs may optionally carry a one-DW TLP digest (ECRC), which is generally a stupid idea if you trust the switching chipsets not to mess up your data. Otherwise, the Data Link layer’s CRC should be enough. So we’ll assume no TLP digest.
The Data Link layer overhead is a bit more difficult to estimate, because it has its own housekeeping packets. But since most acknowledge and flow control packets go in the opposite direction and hence don’t interfere with a unidirectional bulk data transmission, we’ll focus on the actual data added to each TLP: It consists of a 2-byte header (partially filled with a TLP sequence number) and a 4-byte LCRC.
So all in all, the overhead, assuming a 3-DW header, is 12 bytes for the TLP header and another 6 bytes by the Data Link. All in all, we have 18 bytes, which takes up ~12% if transmitted along a 128-byte TLP, but only ~3.4% for a 512-byte TLP.
For a 1x configuration, which has 2.5 Gbps on the wires, and effective 2.0 Gbps (10/8 bit coding), we could dream about 250 MBytes/sec. But when the TLPs are 128 bytes long each, our upper limit goes down to some ~219 Mbytes/sec. With 512-bytes TLPs it’s ~241 Mbytes/sec. Does it matter at all? I suppose it depends. In benchmark testing, it’s important to know these limits, or you start thinking something is wrong, when it’s actually the packet network limiting the speed.
The textbook says, that if one wants to compile a module against a kernel, the headers must be there. Those who run distribution kernels are urged to apt-get or yum-install something, and their trouble is over. People like me, who cook their own food and download vanilla kernels, need to handle this themselves.
In the old times, I used to have the kernel compiled on the disk, but nowadays the kernel subdirectory takes a few Gigabytes after compilation, so one has to do a “make clean” sooner or later. Which should be OK, since the following comment can be found in the kernel source’s Makefile, in its very root directory:
# make clean Delete most generated files
# Leave enough to build external modules
But unfortunately, it’s not possible to compile any module after “make clean”. It blows with a nasty error message. (Ehm, see update at the bottom)
As it turns out, “make clean” removes two header files, whose absence kills the possibility to compile a module: include/generated/asm-offsets.h and include/generated/bounds.h. As a matter of fact, the removal of these two files is the only change in the “include” subdirectory.
So a quick workaround is to make a copy of the “include” subdirectory, run “make clean” and then restore “include” to its pre-make-clean state.
Which make you wonder why those files are removed in the first place. Someone overlooked this issue? No, no and no. Is there any real reason to remove these files? I don’t know. Has this issue been fixed since 2.6.35? I’ll check it sometime.
If you know something I don’t, please comment below.
Update (Oct 1st, 2015):
$ make prepare scripts
on the machine that the kernel runs on solves this on v3.16.
Update (May 4th, 2016): Maybe this?
$ make modules_prepare
Will try next time. See Documentation/kbuild/modules.txt on your favorite kernel tree.
It may look like a stupid idea, since the CRC32 has been implemented in some Perl modules which can be downloaded from CPAN. But all functions I found involve an XS part, which essentially means that a compilation of C code has to take place during the installation.
In other words, these modules can’t be just attached to a bundle of Perl code, but they have to be installed on a machine where it’s permitted. Or at least, compilation capabilities are necessary. Which can turn into a mess if the target is a Windows-running computer barely having Perl installed.
So I adapted one of the implementations of the mostly used CRC32 calculation, and wrote it in pure Perl. It’s really not the efficient way to obtain a CRC, and I wouldn’t try it on long sequences. Its advantage is also its disadvantage: It’s all in Perl.
Before giving the function away, I’d just point out that if one tries the following snippet
#!/usr/bin/perl
use warnings;
use strict;
require String::CRC32;
require Digest::CRC;
my $str = "This is just any string.";
my $refcrc = String::CRC32::crc32($str);
my $refcrc2 = Digest::CRC::crc32($str);
my $mycrc = mycrc32($str);
then $refcrc, $refcrc2 and $mycrc will have the same value (the last of which is calculated by the function at the end of this post).
And if we’re at it, I’d also point out that for any string $str, the following code
my $append = mycrc32($str) ^ 0xffffffff;
my $res = mycrc32($str.pack("V", $append));
yields a CRC in $res which equals Oxffffffff. This is a well-known trick with CRCs: Append the CRC to the string for which it was calculated, and get a fixed value. And by they way, the CRC is done in Little Endian setting, so the parallel Linux kernel operation would be crc32_le(~0, string, len), only that crc32_le returns the value XORed by Oxffffffff (or is it the Perl version being upside down? I don’t know). Also note that the initial value of the CRC is set to ~0, which is effectively Oxffffffff again.
OK, time to show the function. It’s released to the public domain (Creative Commons’ CC0, if you like), so feel free to make any use of it.
It all starts with this: I’m not ready to return from my character device’s release() method before I know that the underlying hardware has acknowledged the shutdown. It is actually expected to do so quickly, so I relied on a wait_event_interruptible() call within the release method to do the short wait for the acknowledge.
And it actually worked well, until specific cases where I hit CTRL-C while a read() was blocking. I’m not exactly sure of why, but if the signal arrived while a blocking on wait_event_interruptible() within read(), the signal wouldn’t be cleared, so release() was called with the signal pending. As was quite evident with this little snippet in the release() code.
if (signal_pending(current))
printk(KERN_WARNING "Signal is pending on release()...\n");
… which ended up with the wait_event_interruptible() in the release() method returning immediately, yelling that the hardware didn’t respond.
Blocking indefinitely is out of the question, so the simple workaround is to sleep another 100ms if wait_event_interruptible() returns prematurely, and then check if the hardware is done. That should be far more than needed for the hardware, and a fairly small time penalty for the user.
So the waiting part in release() now goes:
if (wait_event_interruptible(fstr->wait, (!fstr->flag)))
msleep(100);
The cute trick here is that the sleep takes place only in the event of an interrupt, so in a normal release() call we quit much faster.
It looks like inferring RAMs and ROMs is the weak spot of XST. This is the second bug I find using this synthesizer, this time on XST M.63c, coming with ISE Release 12.2. The previous bug was ROM creation from a case statement. But hey, that was two years ago.
This time I the code says (irrelevant parts eliminated):
reg [3:0] writeidx;
reg [31:0] buf_w0;
reg [31:0] buf_w1;
reg buf_wen;
reg buf_wen_d;
reg [31:0] buf[0:15];
reg [3:0] counter;
if (buf_wen)
begin
buffer[writeidx] <= buf_w0;
writeidx <= writeidx + 1;
end
else if (buf_wen_d)
begin
buffer[writeidx] <= buf_w1;
writeidx <= writeidx + 1;
end
The slightly nasty thing about this clause is that “buffer” is an inferred distributed RAM (i.e. implemented in slices) because it’s small, and there’s an “if” statement which controls what is written to it. This messed things up. I’ll forgive the synthesizer for failing to optimize away RAM elements that clearly have a constant value of zero, since their input is always zero. What I can’t leave alone is that it created wrong logic. In particular, it completely ignored the existence of buf_w0, and generated code as if only the buf_w1 assignment existed. As a matter of fact, buf_w0 wasn’t even mentioned in the synthesis report. There was no warning about its disappearance. Like a good-old Soviet elimination. I was lucky enough to read the synthesis warnings to learn that a register, which drives buf_w0, was optimized out, and I couldn’t understand why. Until I checked what happened in FPGA editor, and saw that buf_w0 had gone up in smoke.
And here’s the silly workaround that fixed it. The code is logically equivalent, of course, but feeds XST with what I really want: A mux. Hurray. Not.
if (buf_wen || buf_wen_d)
begin
buffer[writeidx] <= buf_wen ? buf_w0 : buf_w1;
writeidx <= writeidx + 1;
end
Update: The project is up and running, available for a large range of FPGAs. Click here to visit its home page.
Over the years in which I’ve worked on FPGA projects, I’ve always been frustrated by the difficulty of communicating with a PC. Or an embedded processor running a decent operating system, for that matter. It’s simply amazing that even though the FPGA’s code is implemented on a PC and programmed from the PC, it’s so complicated to move application data between the PC and the FPGA. Having a way to push data from the PC to the FPGA and pull other data back would be so useful not only for testing, but it could also be the actual application.
So wouldn’t it be wonderful to have a standard FIFO connection on the FPGA, and having the data written to the FIFO showing up in a convenient, readable way on the PC? Maybe several FIFOs? Maybe with configurable word widths? Well, that’s exactly the kind of framework I’m developing these days.
Xillybus usage example (click to enlarge)
The usage idea is simple: A configurable core on the FPGA, which exposes standard FIFO interface lines to its user. On the other side we have a Linux-running PC or embedded processor, where there is a /dev device file for each FIFO in the FPGA. When a byte is written to the FIFO in the FPGA, it’s soon readable from the device file. Data is streamed naturally and seamlessly from the FIFO on the FPGA to a simple file descriptor in a userspace Linux application. No hassle with the I/O. Just simple FPGA design on one side, and a simple application on the Linux machine.
Ah, and the same goes for the reverse direction, of course.
The transport is a PCI Express connection. With certain Spartan-6 and Virtex 5/6 devices, this boils down to connecting seven pins from the FPGA to the processor’s PCI Express port, or to a PCIe switch. Well, not exactly. A clock cleaner is most probably necessary. But it’s seven FPGA pins anyhow, with reference designs to copy from. It’s quite difficult to get this wrong.
No kernel programming will be necessary either. All that is needed, is to compile a certain kernel module against the headers of the running Linux kernel. On many environments, this merely consists of typing “make” at shell prompt. Plus copying a file or two.
So all in all, the package consists of a healthy chunk of Verilog code, which does the magic of turning plain FIFO interfaces into TLPs on the PCIe bus, and a kernel module on the Linux machine’s side, which talks with the hardware and presents the data so it can be read with simple file access.
If you could find an IP core like this useful, by all means have a look on the project’s home page.
I know, I know. I have a very old cellular phone. But since I have enough electronic toys, I couldn’t care less about turning my phone into one. And it happens to be a good one.
Everything was OK until it failed to start. Or more precisely, it started, and then restarted itself. Like this:
And again. And again. It turned out that a defective MicroSD flash card caused it to go crazy. So I replaced the card, and everything looked fine again. But then it had a horrible relapse: It went back to this restarting pattern again, but this time it didn’t help to take out the MicroSD card. What turned out to be really bad, was that it was impossible to connect it to a computer through USB for backup, because it would restart all the time.
It wasn’t a power supply thing. I learned that from the fact, that when the phone was started without a SIM, it asked me whether it should start the phone even so. And it didn’t restart as long as I didn’t press any button on that question.
So it was clear that the phone did something that went wrong a few seconds after being powered on. So the trick was to prevent it from getting on with its booting process, but still allow a USB connection.
Connecting the USB cord while in any of the pre-start menus turned out useless (Use without SIM? Exit from Flight mode?). So I looked a bit at the codes.
What did eventually work, was to use the *#06# code, which is used to check IMEI. The phone showed me the serial number and didn’t restart, and when I plugged in the USB cord, I got the usual menu allowing me to choose mode. From there on it was a lot of playing around, trying and retrying until I finally recovered my phone list.
This also made it possible to reprogram the handset with Nokia’s Phoenix software, which didn’t work otherwise. Neither did the Green-*-3 three finger salute for a deep reset nor the infamous *#7370# code for the same purpose. These two never did anything, even when the phone appeared to be sane.
I should point out, that it’s possible that this trick may have solved a very specific issue on my own phone’s internal messup, and still, I thought it was best to have it written down for rainy days.
lspci is quite well-known. What is less known, is that it can be used to get the tree structure of bridges and endpoints with
$ lspci -tv
lspci can also be used to get cleartext info about any card using the -x or -xxxx flag, possibly along with -v. The little trick with -x is that the byte ordering is wrong in little endian systems, and reflects the host’s byte ordering rather than PCI’s native big endian.
setpci is useful to look at specific registers in the card’s configuration space (and possibly alter them). For example, look at two registers of a PCI card picked by Vendor/Product IDs (no need to be root on my computer):
The __devexit_p is a macro turning into NULL when the module is compiled into the kernel itself (as opposed loadable module) and isn’t hotpluggable. Since the module can’t ever exit, this gives the compiler an opportunity to optimize away the function alltogether?
Talking with your PCI device in 5 steps
(or: How to initialize a PCI device in the probe function)
Enable the device: pci_enable_device(pdev);
Check that it’s mapped as you expect it (optional): if (!(pci_resource_flags(pdev, the_bar_you_want) & IORESOURCE_MEM)) { fail here }
Declare yourself the exclusive owner of the device (give or take /dev/mem): pci_request_regions(pdev, “your-driver-name”);
Get a handle for I/O operations. Don’t let the function’s name confuse you. It applies to memory BARs as well as I/O BARs: pointer = pci_iomap(pdev, the_bar_you_want, length);
And when this is successful, reading and writing is possible with iowrite32(i, pointer) or ioread32(pointer).
“pointer” is just a pointer to any type (considered void by kernel), and there are 16 and 8 bits operators as well. Keep in mind that the PCI bus is big endian, so when putting a 32-bit number on the bus and reading it as 32 bit on a x86 machine, you get it twisted around.
It’s also common to allow the device to be master on bus (necessary for DMA access and MSI interrupts: pci_set_master(pdev);
That was important for me, because I’m went for a heavy setup process involving DMA transfers in the very beginning. The following encouraging sentence was found in Documentation/pci.txt, section 1:
The probe function always gets called from process context, so it can sleep.
When PCIe card isn’t responsive
If the PCIe interface logic does nothing about requests sent to a legal BAR address, it’s OK with iowrite32() operations (since they’re posted, so nobody bothers to check what happened with them), but ioread32() will make my computer freeze. It’s a complete kernel crash without even an oops message. It looks like the processor locks on waiting for the completion packet, which never arrives.
Conclusion: Messing with the FPGA while the host PC is on will most likely hang the entire system, if the host attempts to do something with that interface.
This is actually surprising, because section 2.8 of the PCI Express spec 1.1 is very clear about a mandatory timeout mechanism, and that “The Completion Timeout timer must expire if a Request is not completed in 50 ms.” Is this my G31 chipset not meeting spec?
DMA through PCI calls?
Several writeouts resembling the kernel’s own DMA-API-HOWTO.txt are out there. I go for this one, because it says a few words about using the PCI-specific functions, which are used extensively in the kernel code.
__get_free_page(s) vs. alloc_page(s)
That’s confusing. Both get you either one page of PAGE_SIZE of memory (4096 bytes in most cases), or several of them (the -s versions), but __get_free_pages returns the actual address as kmalloc, while alloc_pages returns page information via a struct page pointer. For DMA, the former’s memory chunk is mapped with e.g. pci_map_single(), but the latter with dma_map_page(). Some more details here.
MSI interrupt handlers
Unlike classic IRQ handlers, which may be called because a shared interrupt fired off, there’s no reason to check if an interrupt is really pending. Yet another good reason to use MSI.
should the “flag” entry be marked as “volatile” in the privdata structure? I mean, an interrupt will change its value, so “volatile” is the old way to do this. The answer is no, no, and again no. One should trust the kernel’s API to handle the volatility issue, since the kernel’s underlying opinion (which seems pretty justified) is that “volatile” is inherently bad.
Mutexes and spinlocks
Things I always heard people say about mutexes, and still worth to emphasize:
Use as much mutex granuality as possible, as long as it’s structured enough not to mess up into a deadlock.
Set up a logical who’s-locked-first schema, and write it as a comment in the code. It goes something like mutex_rd -> mutex_wr -> spinlock_rd -> spinlock_wr and so on. So it’s absolutely clear that if mutex_wr is taken, mutex_rd must not be asked for, but spinlock_wr may. One is allowed to skip mutexes in the list, but not go backwards.
Whenever possible, release mutexes before going to sleep (waiting for events in particular, and even more in particular if the waiting has no timeout). For example, waiting for data in a read() method handler with the mutex taken may sound reasonable if the mutex is only related to read() operations, but what if the device is open()ed a second time during this sleep? All of the sudden the open() blocks for no apparent reason.
… but this makes it a bit tricky: We’re awaken because our condition was met (say, data arrived) but while trying to take the mutex for handing the event, the data was read() by some other process, and the state data structure was also completely changed. So after regaining the mutex, the overall situation must be reassessed from scratch.
And of course, no sleeping with spinlocks on.
Use the wait_event_* functions’ return values to tell why we were woken up, and don’t rely on the condition tested for that. In particular if we sleep without mutex, because the flag be altered by another process from the moment the event was triggered until our process actually got the CPU.
This one really pissed me off. I installed Cygwin recently on an XP machine. All kinds of .cyg000(mumbo jumbo) directories started to show up, and I had no idea why.
Then I got it: The rm command in recent Cygwin (versions > 1.7?) is “friendly”: Rather than unlinking the files, they are moved to a “hidden” directory. Isn’t it wonderful that Cygwin is nice to me? Not. I want the utilities to do what I expect them to do, nothing else. One day, I hope, people will realize that making “improvements” in software is just a way to make things break. In this case, it was my “make clean” not doing what I expected.
I don’t know if there is an environment variable to fiddle with. If you do, please comment below. Impatient as I get to get rid of these pests, I downloaded the clean GNU version for Windows (a non-Cygwin one) and kissed the old one goodbye. As simple as that.
It all started with a blue screen (Windows, right?), which seems to have something to do with Firefox. Anyhow, after that crash Thunderbird told me it can’t open the abook.mab file, and hence my contacts are lost. Which means it won’t autocomplete email addresses as I type them, which I’m not ready to take.
The solution was given in a forum thread in the form of a PHP script. Perl would be the correct language to do this, but since it saved my bottom, who am I to complain.
Since it’s so good, I hope it’s OK that I’ll repeat it here. It’s by the Ubuntu forum user mikerobinson:
<?php
error_reporting(E_ALL);
$abook = file_get_contents('abook.mab-1.bak');
preg_match_all('/\((.*)\)/Ums', $abook, $matches);
$matches = $matches[1];
foreach ($matches as $key => $match) {
$entry = explode('=', $match);
if (isset($entry[1]) && strlen($entry[1]) > 4 && !isset($skipnext)) {
$entry[1] = str_replace("\\\n", '', $entry[1]);
$entry[1] = str_replace('\\\\', '', $entry[1]);
$entry[1] = str_replace('\\', ')', $entry[1]); // the backslashes SHOULD be at the end of each line
// Unicode characters
if (strstr($entry[1],'$')) {
$entry[1] = str_replace('$', "\\x", $entry[1]);
$entry[1] = preg_replace("#(\\\x[0-9A-F]{2})#e", "chr(hexdec('\\1'))", $entry[1]);
}
$matches[$key] = utf8_decode($entry[1]);
if (strstr($entry[1],'@')) $skipnext = true;
}
else {
unset($matches[$key]);
unset($skipnext);
if (strstr($entry[1],'@')) $skipnext = true;
}
unset($entry);
}
$previous = null;
foreach ($matches as $match) {
if (strstr($match,'@')) {
if (strtolower($match) != strtolower($previous)) {
if (isset($addy)) $addressbook[] = array($match, end($addy));
else $addressbook[] = array($match, $match);
unset($addy);
$previous = $match;
}
}
else {
$addy[] = $match;
}
}
echo "First Name\tLast Name\tDisplay Name\tNickname\tPrimary Email\tSecondary Email\tScreen Name\tWork Phone\tHome Phone\tFax Number\tPager Number\tMobile Number\tHome Address\tHome Address 2\tHome City\tHome State\tHome ZipCode\tHome Country\tWork Address\tWork Address 2\tWork City\tWork State\tWork ZipCode\tWork Country\tJob Title\tDepartment\tOrganization\tWeb Page 1\tWeb Page 2\tBirth Year\tBirth Month\tBirth Day\tCustom 1\tCustom 2\tCustom 3\tCustom 4\tNotes\t";
foreach ($addressbook as $addy) {
echo "\t\t{$addy[1]}\t\t{$addy[0]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n";
}
After saving this script as savior.php, one goes at shell prompt (on Linux, right?):
# php savior > book.txt
And then manually edit any garbage away from the text file. Keep in mind that a title line should be kept as the first line.
And then go back to thunderbird, click Tools > Address book > Tools > Import… and import the file as a tab separated file. That’s it. The new addresses will be in a new folder, but who cares. Autocompletion is back!