Designed to fail: Ethernet for FPGA-PC communication
Just lucky?
I’ve been approached a few times with requests to design the FPGA part of an FPGA-to-PC link over Ethernet. The purpose of the link is typically transporting a large amount of data to the PC. The application varies from continuous data acquisition to frame grabbing or transport of a raw video image stream. What these applications have in common, is that the client expects a reliable, easy-to-implement data channel. Just send the packets to the broadcast MAC address, and you’re done.
When doubting the reliability of this solution, I usually get the “I know from previous projects that it works” argument. I can’t argue with their previous success. But there’s a subtle difference between “it works” and “it’s guaranteed to work”. To a serious FPGA engineer, that’s the difference between “maybe I was lucky this time” and “this is a project I’m ready to release”.
Ethernet is inherently unreliable
The most important thing to know about Ethernet (in any data rate) is that is was never meant to be reliable. As a physical layer for networks, the underlying assumption is that a protocol layer detects packet drops and issues retransmissions as necessary. Put simply, this means that an Ethernet chip that drops a packet every now and then is considered 100% OK.
Since packet losses cause a certain data rate performance hit (e.g. TCP/IP streams will be halted for a short period), efforts are made to keep them to a minimum. For example, the Ethernet 802.3 standard states 10-10 as the objective for the raw bit error rate on a copper wire Gigabit Ethernet link (1000BASE-T, see section 40.1.1). That means that a packet drop every 10 seconds is considered within the standard’s objectives. Packet drops may also occur on the operating system level: The network stack may take the freedom to drop packets just because they didn’t arrive at a good time. This happens less when the computer is generally idle (i.e. in the lab) but may become more evident under load (that is, in real-life use).
Prototype vs. production
Engineers are often mislead to think that the link is reliable because they can’t see any packet drops on the particular prototype they’re working on. It’s also easy to overlook sporadic packet drops during the development stages. The problem becomes serious when reaching the production stage, when a no-errors system needs to be put on the table. Even worse, production copies of the system may suddenly start to fail once in a few hours or so. The QA tests may not spot these issues, so the complaints may come from end-users feeling there’s something wrong with their devices, which the vendor has no clue about. I mean, imagine your car’s dashboard going crazy for a second once a month, and the vendor insisting on that being impossible. Would you stay with that car?
Working it around
The natural way to work around Ethernet packet drops is either accepting the data loss or implementing a retransmission mechanism.
Living with the data loss is possible in e.g. one-shot data acquisition applications, when the trigger is recurrent. Say, if a single frame is grabbed from a video stream, and it’s OK to fail on the first attempt, that’s fine. As long as nobody feels the unexpected delay of 1/30th of a second.
Retransmissions may be significantly trickier, in particular if the data goes from the FPGA to the PC. The thing is, that it will take some time for the PC to respond on the lost packet, and that time may be unlimited. For example, in today’s Linux implementations, the analysis of network packets is done in a tasklet context, and not by the interrupt service routine. Since tasklets are merely scheduled as a high-priority process, the latency until the packets are analyzed closely enough to detect a packet loss depends on how busy the computer is at that time.
One could hack the Ethernet card’s device driver to check a special field in each packet (say, a counter). Let’s say that the packet interrupt is handled within 10 μs, and that the packet loss is reported back to the FPGA in no time. This means it has to store 10 kbits worth of previous packets (at least) to support a Gigabit link. Actually, that’s fine. A Xilinx FPGA’s internal RAM is more or less of that size. Too bad it’s not realistic.
And that’s because the underlying assumption of 10 μs response time is problematic, since any other kernel component can turn off interrupts while minding its own business (typically holding a spinlock). This other component could be completely unrelated to the Ethernet application (a sound card driver?) and not be active at all when the link with the FPGA is tested in QA. And it could be something not happening very often, so the sudden high latency becomes a rare bug to handle.
So taking a more realistic approach, it’s more like storing several megabytes of data to make sure all packets stay in memory until their safe arrival has been confirmed. This involves a controller for an external memory (a DDR SDRAM, I suppose) and some nontrivial state machine for keeping track of the packet traffic. While meeting the requirement of being reliable, it’s not really easy to implement. Well, easy to implement on a computer, which is the natural user of an Ethernet link. Not an FPGA.
The right way to do it
The correct thing to do, is to use a link which was originally intended for communication between a computer and its peripheral. Several interfaces exist, but today the most appealing one is PCI Express, as it’s expected to be supported for many years ahead. Being the successor of good old PCI, its packet relay interface guarantees true reliability, which is assured by the PCIe fabric’s hardware.
The PCIe solution is often avoided because of the complexity of setting up the FPGA logic for transmission over the bus. This is no excuse in situations where Xillybus fits the scenario, as it provides a simple interface on both sides for transmission of data between an FPGA and its Linux host. If that’s not the case, the two choices are either to suck it up and write the PCIe interface yourself, or revert to using Ethernet, hoping for the best.
Summary
I always say that it’s fairly easy to get an FPGA doing what you want on a lab prototype. There are a lot of engineers not asking for too much money out there, who will do that for you. But when it comes to designing the system so it’s guaranteed to work, that’s a whole different story. That includes using correct design techniques in the FPGA logic’s HDL, constraining the timing correctly, ensuring that the timing requirements of the surrounding hardware are met, as defined in their datasheets etc. The difference isn’t seen at the prototype level, so quick and dirty FPGA work gets away with it in the beginning. It’s during the later stages, when temperature checks are run and the electronics is being duplicated, that things start to happen.
At that point people tend to blame the FPGA for being an unreliable solution. Others adopt mystic rules such as the 9o%-10% rule, basically saying that the real work starts in the last phase.
But the truth is that that if you got it right in the first place, there’s no reason why things should go wrong during production. If the prototype was done professionally, turning it into a product is really not a big deal. And having said that, yes, sometimes people do everything wrong, and just turn out lucky.
As for using Ethernet as a reliable link, it all boils down to if you want to gamble on it.
Reader Comments
Hello,
I have read your project. it is very use-full and help-full for me,
i have design a board using fpga spartan3e (hardware side algorithms) and coldfire module as a controller (mod5282 module , with RT ucos as OS) for data acq.
but in my design i have interface fpga with controller through bus (data bus,address bus and control line) and design tri-state logic in fpga for that for data interchage.
But now for new design with high data transfer rate and complex hardware design with large size , i want to use spartan6 with avnet nano itx/sparan6 kit , with embedded OS, Linux, which has PCIe interface for data interchange.
I am new for this board. i don’t have idea to interface my hardware design in FPGA and with PCIe and how to read data from PCIe bus in processor board. if u guide me then it is very use full for me. i want some example module for controller to interface with PCIe.
This post suggest Xillybus as the solution, of course.
Hi,
Really nice post, thank you. What kind of link would you recommend between two FPGAs on separate boards? I need minimum latency and 400Mbps data rate, while the packets are about 1kB to 4kB, constant size, and no buffer can be kept in either FPGA. Is there any standard solution?
Thanks!
If you want a reliable packet delivery, PCIe is the preferred solution.
Otherwise, I would go for a single data-clock pair with a couple of LVDS pairs. 400 Mbps isn’t such a high rate, and can be implemented with plain logic running at 200 MHz and DDR I/O. To make life easier, you may want to work with the FPGA’s silicon SERDES if there’s one available, or even one of those external chips. There are plenty of those out there, primarily for the use with Camera Link and LCD monitors.
Thank you!
I was going to implement the LVDS line indeed, using the DDR output. Glad that you mentioned it too :)
Thanks again!
hello
I just started working on PCIE whit xilinx spartan6lx75t . and I want to know is there any packet sniffer (packet generator ) for windows.
Hello……
How to know that the FPGA board received data through lan? Any specific IP address is used for FPGA lan connection….
I appreciate you sharing this article.Thanks Again. Really Cool.