I’ve been approached a few times with requests to design the FPGA part of an FPGA-to-PC link over Ethernet. The purpose of the link is typically transporting a large amount of data to the PC. The application varies from continuous data acquisition to frame grabbing or transport of a raw video image stream. What these applications have in common, is that the client expects a reliable, easy-to-implement data channel. Just send the packets to the broadcast MAC address, and you’re done.
When doubting the reliability of this solution, I usually get the “I know from previous projects that it works” argument. I can’t argue with their previous success. But there’s a subtle difference between “it works” and “it’s guaranteed to work”. To a serious FPGA engineer, that’s the difference between “maybe I was lucky this time” and “this is a project I’m ready to release”.
Ethernet is inherently unreliable
The most important thing to know about Ethernet (in any data rate) is that is was never meant to be reliable. As a physical layer for networks, the underlying assumption is that a protocol layer detects packet drops and issues retransmissions as necessary. Put simply, this means that an Ethernet chip that drops a packet every now and then is considered 100% OK.
Since packet losses cause a certain data rate performance hit (e.g. TCP/IP streams will be halted for a short period), efforts are made to keep them to a minimum. For example, the Ethernet 802.3 standard states 10-10 as the objective for the raw bit error rate on a copper wire Gigabit Ethernet link (1000BASE-T, see section 40.1.1). That means that a packet drop every 10 seconds is considered within the standard’s objectives. Packet drops may also occur on the operating system level: The network stack may take the freedom to drop packets just because they didn’t arrive at a good time. This happens less when the computer is generally idle (i.e. in the lab) but may become more evident under load (that is, in real-life use).
Prototype vs. production
Engineers are often mislead to think that the link is reliable because they can’t see any packet drops on the particular prototype they’re working on. It’s also easy to overlook sporadic packet drops during the development stages. The problem becomes serious when reaching the production stage, when a no-errors system needs to be put on the table. Even worse, production copies of the system may suddenly start to fail once in a few hours or so. The QA tests may not spot these issues, so the complaints may come from end-users feeling there’s something wrong with their devices, which the vendor has no clue about. I mean, imagine your car’s dashboard going crazy for a second once a month, and the vendor insisting on that being impossible. Would you stay with that car?
Working it around
The natural way to work around Ethernet packet drops is either accepting the data loss or implementing a retransmission mechanism.
Living with the data loss is possible in e.g. one-shot data acquisition applications, when the trigger is recurrent. Say, if a single frame is grabbed from a video stream, and it’s OK to fail on the first attempt, that’s fine. As long as nobody feels the unexpected delay of 1/30th of a second.
Retransmissions may be significantly trickier, in particular if the data goes from the FPGA to the PC. The thing is, that it will take some time for the PC to respond on the lost packet, and that time may be unlimited. For example, in today’s Linux implementations, the analysis of network packets is done in a tasklet context, and not by the interrupt service routine. Since tasklets are merely scheduled as a high-priority process, the latency until the packets are analyzed closely enough to detect a packet loss depends on how busy the computer is at that time.
One could hack the Ethernet card’s device driver to check a special field in each packet (say, a counter). Let’s say that the packet interrupt is handled within 10 μs, and that the packet loss is reported back to the FPGA in no time. This means it has to store 10 kbits worth of previous packets (at least) to support a Gigabit link. Actually, that’s fine. A Xilinx FPGA’s internal RAM is more or less of that size. Too bad it’s not realistic.
And that’s because the underlying assumption of 10 μs response time is problematic, since any other kernel component can turn off interrupts while minding its own business (typically holding a spinlock). This other component could be completely unrelated to the Ethernet application (a sound card driver?) and not be active at all when the link with the FPGA is tested in QA. And it could be something not happening very often, so the sudden high latency becomes a rare bug to handle.
So taking a more realistic approach, it’s more like storing several megabytes of data to make sure all packets stay in memory until their safe arrival has been confirmed. This involves a controller for an external memory (a DDR SDRAM, I suppose) and some nontrivial state machine for keeping track of the packet traffic. While meeting the requirement of being reliable, it’s not really easy to implement. Well, easy to implement on a computer, which is the natural user of an Ethernet link. Not an FPGA.
The right way to do it
The correct thing to do, is to use a link which was originally intended for communication between a computer and its peripheral. Several interfaces exist, but today the most appealing one is PCI Express, as it’s expected to be supported for many years ahead. Being the successor of good old PCI, its packet relay interface guarantees true reliability, which is assured by the PCIe fabric’s hardware.
The PCIe solution is often avoided because of the complexity of setting up the FPGA logic for transmission over the bus. This is no excuse in situations where Xillybus fits the scenario, as it provides a simple interface on both sides for transmission of data between an FPGA and its Linux host. If that’s not the case, the two choices are either to suck it up and write the PCIe interface yourself, or revert to using Ethernet, hoping for the best.
I always say that it’s fairly easy to get an FPGA doing what you want on a lab prototype. There are a lot of engineers not asking for too much money out there, who will do that for you. But when it comes to designing the system so it’s guaranteed to work, that’s a whole different story. That includes using correct design techniques in the FPGA logic’s HDL, constraining the timing correctly, ensuring that the timing requirements of the surrounding hardware are met, as defined in their datasheets etc. The difference isn’t seen at the prototype level, so quick and dirty FPGA work gets away with it in the beginning. It’s during the later stages, when temperature checks are run and the electronics is being duplicated, that things start to happen.
At that point people tend to blame the FPGA for being an unreliable solution. Others adopt mystic rules such as the 9o%-10% rule, basically saying that the real work starts in the last phase.
But the truth is that that if you got it right in the first place, there’s no reason why things should go wrong during production. If the prototype was done professionally, turning it into a product is really not a big deal. And having said that, yes, sometimes people do everything wrong, and just turn out lucky.
As for using Ethernet as a reliable link, it all boils down to if you want to gamble on it.