Using MGTs in FPGA designs: Why the data is organized in packets

This post was written by eli on February 7, 2026
Posted Under: FPGA,GTX,PCI express,USB

Introduction

I’ll start with a correction: Indeed, application logic transmitting data from one FPGA to another is required to organize the data in some kind of packets or frames, but there’s one exception, which I’ll discuss later on: Xillyp2p. Anyhow, let’s take it from the beginning.

Multi-Gigabit Transceivers (MGTs, sometimes also referred to as RocketIO, GTX, GTH, GTY, GTP, GTM, etc.) have long ago become the de facto standard for serialized data communication between digital components. The most famous use cases are for a computer and its peripheral (often between the CPU’s companion chip and a peripheral), for example, PCIe, SuperSpeed USB (a.k.a. USB 3.x), and SATA. Also related to computers, Gigabit Ethernet (as well as 10GbE) is based upon MGTs, and the DisplayPort protocol can be used for connecting a graphics card with the monitor.

Many FPGAs are equipped with MGTs. These are often used for turning the FPGA into a computer peripheral (with the PCIe protocol, possibly using Xillybus, or with the SuperSpeed USB protocol, possibly using XillyUSB, or as a storage device with SATA). Gigabit Ethernet can also play in, allowing the FPGA to communicate with a computer with this protocol. Another use of MGTs is for connecting to electronic components, in particular ADC/DAC devices with a very high sampling frequency, hence requiring a high data rate.

But what about communication between FPGAs? At times, there are several FPGAs on a PCB that need to exchange information among themselves, possibly at high rates. In other usage scenarios, there’s a physical distance between the FPGAs. For example, test equipment often has a hand-held probe containing one FPGA that collects information, and a second FPGA that resides inside the table-top unit. If the data rate is high, MGTs on both sides make it possible to avoid heavy, cumbersome and error-prone cabling. In fact, a thin fiber-optic cable is a simple solution when MGTs are used anyhow, and in some scenarios it also offers an extra benefit, except for being lightweight: Electrical isolation. This is in particular important in some medical applications (for electrical safety) or when long cables need to be drawn outdoors (avoiding being hit by lightning).

Among the annoying things about MGT communication there’s the fact that the data flow somehow always gets organized in packets (or frames, bursts, pick your name for it), and these packets don’t necessarily align properly with the application data’s natural boundaries. Why is that so?

This post attempts to explain why virtually all protocols (e.g. Interlaken, RapidIO, AMD’s Aurora, and Altera’s SeriaLite) require the application data to be arranged in some kind of packets that are enforced by the protocol. The only exception is Xillyp2p, which presents error-free continuous channels from one FPGA to another (or with packets that are sensible for the application data). This is not to say that packets aren’t used under the hood; it’s just that this packet mechanism is transparent to the application logic.

I’ll discuss a few reasons for the use of packets:

Word alignment
Error detection and retransmission
Clock frequency differences

Reason #1: Word alignment

When working with an MGT, it’s easy to forget that the transmitted data is sent as a serial data stream of bits. The fact that both the transmitting and receiving side have the same data word width might give the false impression that the MGT has some magic way of aligning the word correctly at the receiver side. In reality, there is no such magic. There is no hidden trick allowing the receiver to know which bit is the first or last in a transmitted word. This is something that the protocol needs to take care of, possibly with some help from the MGT’s features.

When 8b/10b encoding is used, the common solution is to transmit a synchronization word, often referred to as a comma, which is known as the K28.5 symbol. This method takes advantage of the fact that the 8b/10b encoding uses 10 bits on the wire for each 8 bits of payload data for transmission. And this allows for a small number of extra codes for transmission, that can’t be just regular data. These extra codes are called K-symbols, and K28.5 is one of them.

Hence if the bit sequence for a K28.5 symbol is encountered on the raw data link, it can’t be a data word. Most MGTs in FPGAs have a feature allowing them to automatically align the K28.5 word to the beginning of a word boundary. So word alignment can be ensured by transmitting a comma symbol. The comma symbol is often used to reset the scrambler as well, if such is used.

Each protocol defines when the comma is transmitted. There are many variations on this topic, but they all boil down to two alternatives:

Transmitting comma symbols occasionally and periodically. Or possibly, as part of the marker for the beginning of a packet.
Transmitting comma symbols only as part of an initialization of the channel. This alternative is adopted by protocols like SuperSpeed USB and PCIe, which have specific patterns for initializing the channel, referred to as Ordered Sets for Training and Recovery. These patterns include comma symbols, among others.

Truth to be told, if the second approach is taken, the need for word alignment isn’t a reason by itself for dividing the data into packets, as the alignment takes place once and is preserved afterwards. But the concept of initializing the channel is quite complicated, and is not commonly adopted.

There are other methods for achieving word alignment, in particular when 8b/10b encoding isn’t used. The principles remain the same, though.

Reason #2: Error detection and retransmission

When working with an MGT, bit errors must be taken into account. These errors mean simply that a ’0′ is received for a bit that was transmitted as a ’1′, or vice versa. In some hardware setups such errors may occur relatively often (with a rate of say, 10^-9, which usually means more than once per second), and with other setups they may practically never occur. If an error in the application data can’t be tolerated, a detection mechanism for these bit errors must be in place at the very least, in order to prevent delivery of incorrectly received data to the application logic. Even if a link appears to be completely error free judging by long-term experience, this can’t be guaranteed in the long run, in particular as electronic components from different manufacturing batches are used.

In order to detect errors, some kind of CRC (or other redundant data) must be inserted occasionally in order to allow the receiver to check if the data has arrived correctly. As the CRC is always calculated on a segment (whether it has a fixed length or not), the information must be divided into packets, even if just for the purpose of attaching a CRC to each.

And then we have the question of what to do if an error is detected. There are mainly two possibilities:

Requesting a retransmission of the faulty packet. This ensures that an error-free channel is presented to the application logic.
Informing the application logic about the error, possibly halting the data flow so that faulty data isn’t delivered. This requires the application logic to somehow recover from this state and restart its operation.

High-end protocols like PCIe, SATA and SuperSpeed USB take the first approach, and ensure that all packets arrive correctly by virtue of a retransmission mechanism.

Gigabit Ethernet takes the second approach — there’s a CRC on the Ethernet packets, but the Ethernet protocol itself doesn’t intervene much if a packet arrives with an incorrect CRC. Such a packet is simply discarded (either by the hardware implementing the protocol or by software), so faulty data doesn’t go further. Even the IP protocol, which is usually one level above, does nothing special about the CRC error and the packet loss that occurred as a result of it. It’s only the TCP protocol that eventually detects the packet loss by virtue of a timeout, and requests retransmission.

What about FPGA-to-FPGA protocols, then? Well, each protocol takes its own approach. Xillyp2p is special in that it requests retransmissions when the physical link is bidirectional, but if the link is unidirectional it only discards the faulty data and halts everything until the application logic resumes operation — a retransmission request is impossible in the latter case.

Reason #3: Clock frequency differences

Clock frequency differences should have been the first topic, because it’s the subtle detail that prevents the solution that most FPGA engineers would consider at first for communication between two FPGAs: One FPGA sends a stream of data words at a regular pace, and the other FPGA receives and processes it. Simple and clean.

But I put it third and last, because it’s the most difficult to deal with, and the explanations became really long. So try to hang on. And if you don’t, here’s the short version: The transmission of data can’t be continuous, because the receiver’s clock might be just a few ppm slower. Hence the rate at which the receiver can process arriving data might be slightly lower than the transmitter’s rate, if it keeps sending data non-stop. So to avoid the receiver from being overflowed with data, the transmitter must pause the flow of application data every now and then to let the receiver catch up. And if there are pauses, the segments between these pauses are some kind of packets.

And now, to the long explanation, starting with the common case: The data link is bidirectional, and the data content in both directions is tightly related. Even if application data goes in one direction primarily, there is often some kind of acknowledgement and/or status information going the other way. All “classic” protocols for computers (PCIe, USB 3.x and SATA) are bidirectional, for bidirectional data as well as acknowledge packets, and there is usually a similar need when connecting two FPGAs.

The local and CDR clocks

I’ll need to make a small detour now and discuss clocks. Tedious, but necessary.

In most applications, each of the two involved FPGAs uses a different reference clock to drive its MGT, and the same reference clock is often used to drive the logic around it. These reference clocks of the two FPGAs have the same frequency, except for a small tolerance. Small, but causes big trouble.

Each MGT transmits data based upon its own reference clock (I’ll explain below why it’s always this way). The logic in the logic fabric that produces the data for transmission is usually driven by a clock derived from the same reference clock. In other words, the entire transmission chain is derived from the local reference clock.

The natural consequence is that the data which the MGT receives is based upon the other side’s reference clock. The MGT receiving this data stream locks a local clock oscillator on the data rate of the arriving data stream. This mechanism is referred to as clock data recovery, CDR. The MGT’s logic that handles the arriving data stream is clocked by the CDR clock, and is hence synchronized with this data stream’s bits.

Unlike most other IP blocks in an FPGA, the clocks that are used to interface with the MGT are outputs from the MGT block. In other words, the MGT supplies the clock to the logic fabric, and not the other way around. This is a necessary arrangement, not only because the MGT generates the CDR clock: The main reason is that the MGT is responsible for handling the clocks that run at the bit rate, having a frequency of several GHz, which is far above what the logic fabric can handle. Also, the reference clock used to generate these GHz clocks must be very “clean” (low jitter), so the FPGA’s regular clock resources can’t be used. Frequency dividers inside the MGT generate the clock or clocks used to interface with the logic fabric.

In particular, the data words that are transferred from the logic fabric into the MGT for transmission, as well as data words from the MGT to the logic fabric (received data), are clocked by the outputs of these frequency dividers. The fact that these clocks are used in the interface with the logic fabric makes it possible to apply timing constraints on paths between the MGT’s internal logic and the logic fabric.

For the purpose of this discussion, let’s forget about the clocks inside the MGT, and focus only on those accessible by the logic fabric. It’s already clear that there are two clocks involved, one generated from the local oscillator, based upon the local reference clock (“local” clock), and the CDR clock, which is derived from the arriving data stream. Two clocks, two clock domains.

Clock or clocks used for implementing the protocol

As there are two clocks involved, the question is which clock is used by the logic that processes the data. This is the logic that implements the protocol. The answer is obviously one of the two clocks supplied by the MGT. It’s quite pointless to implement the protocol in a foreign clock domain.

In principle, the logic (in the logic fabric) implementing the protocol could be clocked by both clocks, however the vast majority is usually clocked only by one of them: It’s difficult to implement a protocol across two clock domains, so even if both clocks are used, the actual protocol implementation is always clocked by one of the clocks, and the other clock is used by a minimal amount of logic.

In all practical implementations, the protocol is implemented on the local clock’s domain (the clock used for transmission). The choice is almost obvious: Given that one needs to choose one of the two clocks, the choice is naturally inclined towards the local clock, which is always present and always stable.

The logic running on the CDR clock usually does some minimal processing on the arriving data, and then pushes it into the local clock domain. And this brings us naturally to the next topic.

Crossing clock domains

Every FPGA engineer knows (or should know) that a dual-clock FIFO is the first solution to consider when a clock domain crossing is required. And indeed, this is the most common solution for crossing the clock domain from the CDR clock towards the local clock. It’s the natural choice when the only need is to hand over the arriving data to the local clock domain.

Therefore, several protocol implementations are clocked only by the local clock, and only this clock is exposed by the MGT. The dual-clock FIFO is implemented inside the MGT, and is usually called an “elastic buffer”. This way, all interaction with the MGT is done in one clock domain, which simplifies the implementation.

It’s also possible to implement the protocol with both clocks, and perform the clock domain crossing in the logic fabric, most likely with the help of a FIFO IP provided by the FPGA tools.

To reiterate, it boils down to two options:

Doing the clock domain crossing inside the MGT with an “elastic buffer”, and clock the logic fabric only with the local clock.
Using both clocks in the logic fabric, and accordingly do the clock domain crossing in the logic fabric.

Preventing overflow / underflow

As mentioned earlier, the two clocks usually have almost the same frequency, with a difference that results from the oscillators’ frequency tolerance. To illustrate the problem, let’s take an example with a bidirectional link of 1 Gbit/s, and the clock oscillators have a tolerance of 10 ppm each, which is considered pretty good. If the transmitter’s clock frequency is 10 ppm above, and the receiver’s frequency is 10 ppm below, there is a 20 ppm difference in the 1 Gbit/s data rate. In other words, the receiver gets 20,000 bits more than it can handle every second: No matter which of the two options mentioned above for clock domain crossing is chosen, there’s a FIFO whose write clock runs 20 ppm faster than the read clock. And soon enough, it overflows.

It can also be the other way around: If the write clock is slower than the read clock, this FIFO becomes empty every now and then. This scenario needs to be addressed as well.

There are several solutions to this problem, and they all boil down to that the transmitter pauses the flow of application data with regular intervals, and inserts some kind of stuffing inbetween to indicate these pauses. There is no possibility to stop the physical data stream, only to send data words that are discarded by the receiver instead of ending up in the FIFO. Recall that the protocol is almost always clocked by the local clock, which is the clock reading from the FIFO. So for example, just inserting some idle time between transmitted packets is not a solution in the vast majority of cases: The packets’ boundaries are detected by the logic that reads from the FIFO, not on the side writing to it. Hence most protocols resort to much simpler ways to mark these pauses.

The most famous mechanism is called skip ordered sets, or skip symbols. It’s the common choice when 8b/10b encoding is used. It takes advantage of the fact mentioned above, that when 8b/10b is used, it’s possible to send K-symbols that are distinguishable from the regular data flow. For example, a SuperSpeed USB transmitter emits two K28.1 symbols with regular intervals. The logic before the FIFO at the receiver discards K28.1 symbols rather than writing them into the FIFO.

It’s also common that the logic reading from the FIFO injects K28.1 symbols when the FIFO is empty. This allows a continuous stream of data towards the protocol logic, even if the local clock is faster than the CDR clock. It’s then up to the protocol logic to discard K28.1 symbols.

There are of course other solutions, in particular when 8b/10b isn’t used. The main point is however that the transmitting side can’t just transmit data continuously. At the very least, there must be some kind of pauses. And as already said, when there are pauses, there are packets between them, even if they don’t have headers and CRCs.

But why not transmit with the CDR clock?

This can sound like an appealing solution, and it’s possible at least in theory: Let one side (“master”) transmit data based upon its local clock, just as described above, and let the other side (“slave”) transmit data based upon the CDR clock. In other words, the slave’s transmission clock follows the master’s clock, so they have exactly the same frequency.

First, why it’s a bad idea to use the CDR clock directly for transmission: Jitter. I’ve already used the word jitter above, but now it deserves an explanation: In theory, a clock signal has a fixed time period between each transition. In practice, the time between each such transition varies randomly. It’s a slight variation, but it can have a devastating effect on the data link’s reliability: As each clock transitions sets the time point at which a new bit is presented on the physical link, by virtue of changing the voltage between the wires, a randomness of the timing has an effect similar to adding noise.

This is why MGTs should always be driven by “clean” reference clocks, meaning oscillators that are a bit more expensive, a bit more carefully placed on the PCB, and have been designed with focus on low jitter.

So what happens if the slave side uses the CDR clock to transmit data? Well, the transmitter’s clock already has a certain amount of jitter, which is the result of the reference clock’s own jitter, plus the jitter added by the PLL that created bit-rate clock from it. The CDR creates a clock based upon the arriving data stream, which usually adds a lot of jitter. That too has the same effect as adding noise to its input, because the receiver samples the analog signal using the CDR clock. However, this effect is inevitable. In order to mitigate this effect, the PLL that generates the CDR clock is often tuned to produce as little jitter as possible, while still being able to lock on the master’s frequency.

As the CDR clock has a relatively high jitter due to how it’s created, using it directly to transmit data is equivalent to adding noise to the physical channel, and is therefore a bad idea.

It’s however possible to take a divided version of the CDR clock (most likely the CDR clock as it appears on the MGT’s output port) and drive one of the FPGA’s output pins with it. That output goes to a “jitter cleaner” component on the PCB, which returns the same clock, but with much less jitter. And the latter clock can then be used as a reference clock to transmit data.

I’ve never heard of anyone attempting the trick with a “jitter cleaner”, let alone tried this myself. I suppose a few skip symbols are much easier than playing around with clocks.

But if the link is unidirectional?

If there’s a physical data link only in one direction, the CDR clock can be used on the receiving side to clock the protocol logic without any direct penalty. But it’s still a foreign clock. The MGT at the receiving side still needs a local reference clock in order to lock the CDR on the arriving data stream.

And as things usually turn around, the same local reference clock becomes the reference for all logic on the FPGA. So using the local clock for receiving data often saves a clock domain crossing between the protocol logic and the rest of it. It becomes a question of where the clock domain crossing occurs.

Conclusion

If data is transmitted through an MGT, it will most likely end up divided into packets. At least one of the reasons mentioned above will apply.

It’s possible to avoid the encapsulation, stripping, multiplexing and error checking of packets by using Xillyp2p. Unlike other protocol cores, this IP core takes care of these tasks, and presents the application logic with error-free and continuous application data channels. The packet-related tasks aren’t avoided, but rather taken care of by the IP core instead of the application logic.

This is comparable with using raw Ethernet frames vs TCP/IP: There is no way around using packets for getting information across a network. Choosing raw Ethernet frames requires the application to chop up the data into frames and ensure that they arrive correctly. If TCP/IP is chosen, all this is done and taken care of.

One way or another, there will be packets on wire.

Add a Comment

Previose Post: Messy jots on AppSheets

my tech blog

Popular Posts

Latest Posts

Archives