Single-source utilities
This is the Makefile I use for compiling a lot of simple utility programs, one .c file per utility:
CC= gcc
FLAGS= -Wall -O3 -g -fno-strict-aliasing
ALL= broadclient broadserver multicastclient multicastserver
all: $(ALL)
clean:
rm -f $(ALL)
rm -f `find . -name "*~"`
%: %.c Makefile
$(CC) $< -o $@ $(FLAGS)
The ALL variable contains the list of output files, each have a corresponding *.c file. Just an example above.
The last implicit rule (%: %.c) tells Make how to create an extension-less executable file from a *.c file. It’s almost redundant, since Make attempts to compile the corresponding C file anyhow, if it sees a target file with no extension (try “make –debug=v”). If the rule is removed, and the CFLAGS variable is set to the current value of FLAGS, it will work the same, except that the Makefile itself won’t be dependent on.
IMPORTANT: Put the dynamic library flags (e.g. -lm, not shown in the example above) last in the command line, or “undefined reference” errors may occur on some compilation platforms (Debian in particular). See my other post.
Multiple-source utilites
CC= gcc
ALL= util1 util2
OBJECTS = common.o
HEADERFILES = common.h
LIBFLAGS=-fno-strict-aliasing
FLAGS= -Wall -O3 -g -fno-strict-aliasing
all: $(ALL)
clean:
rm -f *.o $(ALL)
rm -f `find . -name "*~"`
%.o: %.c $(HEADERFILES)
$(CC) -c $(FLAGS) -o $@ $<
$(ALL) : %: %.o Makefile $(OBJECTS)
$(CC) $< $(OBJECTS) -o $@ $(LIBFLAGS)
Note that in this case LIBFLAGS is used only for linking the final executables
Strict aliasing?
It might stand out that the -fno-strict-aliasing flag is the only one with a long name, so there’s clearly something special about it.
Strict aliasing means (in broad strokes) that the compiler has the right to assume that if there are two pointers of different types, they point at different memory regions (and not just not having the same address). In other words, dereferences of pointers (as in *p) of different types are treated as independent non-pointer variables. Reordering and elimination of operations is allowed accordingly.
For example, a struct within a struct. If you have a pointer to the outer struct as well as one to the inner struct, you’re on slippery ice.
The actual definition is actually finer, and some of it is further explained on this page (or just Google for “strict aliasing”). Regardless, Linus Torvalds explains why this flag is used in the Linux kernel here.
The thing is that unless you’re really aware of the detailed rules, there’s a chance that you’ll write code that works on one version of gcc and fails on another. The difference might be where and how this or another compiler decided to optimize the code. Such optimization may involve reordering of operations with mutual dependency or optimizing away things that have no impact unless some pointers are related.
This is true in particular for code that plays with pointer casting. So if the code is written in the spirit of “a pointer is just a pointer, what could go wrong”, -fno-strict-aliasing flag is your friend.
General
As part of a larger project, I was required to set up a PCIe link between a host and some FPGAs through a fiber link, in order to ensure medical-grade electrical isolation of a high-bandwidth video data link + allow for control over the same link.
These are a few jots on carrying a 1x Gen2 PCI Express link over a plain SFP+ fiber optics interface. PCIe is, after all, just one GTX lane going in each direction, so it’s quite natural to carry each Gigabit Transceiver lane on an optical link.
When a general-purpose host computer is used, at least one PCIe switch is required in order to ensure that the optical link is based upon a steady, non-spread spectrum clock. If an FPGA is used as a single endpoint at the other side of the link, it can be connected directly to the SFP+ adapter, with the condition that the FPGA’s PCIe block is set to asynchronous clock mode.
Since my project involved more than one endpoint on the far end (an FPGA and USB 3.0 chip), I went for the solution of one PCIe switch on each end. Avago’s PEX 8606, to be specific.
All in all, there are two issues that really require attention:
- Clocking: Making sure that the clocks on both sides are within the required range (and it doesn’t hurt if they’re clean from jitter)
- Handling the receiver detect issue, detailed below
How each signal is handled
- Tx/Rx lanes: Passed through with fiber. The differential pair is simply connected to the SFP+ respective data input and output.
- PERST: Signaled by turning off laser on the upstream side and issuing PERST to everything on the downstream side on (a debounced) LOS (Loss of Signal).
- Clock: Not required. Keep both clocks clean, and within 250 ppm.
- PRSNT: Generated locally, if this is at all relevant
- All other PCIe signals are not mandatory
Some insights
- It’s as easy (or difficult) as setting up a PCIe switch on both sides. The optical link itself is not adding any particular difficulty.
- Dual clock mode on the PCIe switches is mandatory (hence only certain devices are suitable). The isolated clock goes to a specific lane (pair?), and not all configurations are possible (e.g. not all 1x on PEX8606).
- According to PCIe spec 4.2.6.2, the LTSSM goes to Polling if a receiver has been detected (that is, a load is sensed), but Polling returns to Detect if there is no proper training sequence received from the other end. So apparently there is no problem with a fiber optic transceiver, even though it presents itself as a false load in the absence of a link partner at the other side of the fiber: The LTSSM will just keep looping between Detect and Polling until such partner appears.
- The SFP+ RD pins are transmitters on the PCIe wire pair, and the TD are receivers. Don’t get confused.
- AC coupling: All lane wires must have an 100 nF capacitor in series. External connectors (e.g. PCIe fingers) must have an capacitor on PET side (but must not have one on the ingoing signal).
- Turn off ASPM wherever possible. Most BIOSes and many Linux kernels volunteer doing that automatically, but it’s worth making sure ASPM is never turned on in any usage scenario. A lot of errors are related to the L0s state (which is invoked by ASPM) in both switches and endpoints.
- Not directly related, but it’s often said that the PERST# signal remains asserted 100 ms after the host’s power is stable. The reference for this is section 2.2 of the PCI Express Card Electromechanical Specification (“PERST# Signal”): “On power up, the deassertion of PERST# is delayed 100 ms (TPVPERL) from the power rails achieving specified operating limits.”
PEX 86xx notes
- PEX_NT_RESETn is an output signal (but shouldn’t be used anyhow)
- It seems like the PLX device cares about nothing that happened before the reset: A lousy voltage ramp-up or the absence of clock. All if forgotten and forgiven.
- A fairly new chipset and BIOS are required on the motherboard, say from year 2012 and on, or the switch isn’t handled properly by the host.
- On a Gigabyte Technology Co., Ltd. G31M-ES2L/G31M-ES2L, BIOS FH 04/30/2010, the motherboard’s BIOS stopped the clock short after powering up (it gave up, probably), and that made the PEX clockless, probably, leading to completely weird behavior.
- There’s a difference between the lane numbering a port numbering (the latter used in function numbers of the “virtual” endpoints created with respect to each port). For example, on 8606 running a 2x-1x-1x-1x-1x configuration, lanes 0-1, 4, 5, 6 and 7 are mapped to ports 0, 1, 5, 7 and 9 respectively. Port 4 is lane 1 in an all-1x configuration (with other ports mapped the same).
- The PEX doesn’t detect an SFP+ transceiver as a receiver on the respective PET lane, which prevents bringup of the fiber lane, unless the SerDes X Mask Receiver Not Detected bit is enabled in the relevant register (e.g. bit 16 at address 0x204). The lane still produces its receiver detection pattern, but ignores the fact it didn’t feel any receiver at the other end. See below.
- In dual-clock mode, the switch works even if the main REFCLK is idle, given that the respective lane is unused (needless to say, the other clock must work).
- Read the errata of the device before picking one. It’s available on PLX’ site on the same page that the Data Book is downloaded.
- Connect an EEPROM on custom board designs, and be prepared to use it. It’s a lifesaver.
Why receiver detect is an issue
Before attempting to train a lane, the PCIe spec requires the transmitter to check if there is any receiver on the other side. The spec requires that the receiver should have a single-ended impedance of 40-60 Ohm on each of the P/N wires at DC (and a differential impedance of 80-120 Ohms, but that’s not relevant). The transmitter’s single-ended impedance isn’t specified, only the differential impedance must be 80-120. The coupling capacitor may range between 75-200 nF, and is always on the transmitter’s side (this is relevant only when there’s a plug connection between Tx and Rx).
The transmitter performs a receiver detect by creating an upward common mode pulse of up to 600 mV on both lane wires, and measuring the voltage on these.This pulse lasts for 100 us or so. As the time constant for 50 Ohms combined with 100 nF is 5 us, a charging capacitor’s voltage pattern is expected. Note that the common mode impedance of the transmitter is not defined by the spec, but the transmitter’s designer knows it. Either way, if a flat pulse is observed on the lane wires, there’s no receiver sensed.
Now to SFP+ modules: The SFP+ specification requires a nominal 100 Ohm differential impedance on its receivers, but “does not require any common mode termination at the receiver. If common mode terminations are provided, it may reduce common mode voltage and EMI” (SFF-8431, section 3.4). Also, it requires DC-blocking capacitors on both transmitter and receiver lane wires, so there’s some extra capacitance on the PCIe-to-SFP+ direction (where the SFP+ is the PCIe receiver) which is not expected. But the latter issue is negligible compared with the possible absence of common mode termination.
As the common-mode termination on the receiver is optional, some modules may be detected by the PCIe transmitter, and some may not.
This is what one of the PCIe lane’s wires looks like when the PEX8606 switch is set to ignore the absence of receiver (with the SerDes X Mask Receiver Not Detected bit): It still runs the receiver detect test (the large pulse), but then goes to link training despite that no load was detected (that’s the noisy part after the pulse). In the shown case, the training kept failing (no response on the other side), so it goes back and forth between detection and training.
This capture was done with a plain digital oscilloscope (~ 200 MHz bandwidth).
These are a few jots I wrote down as I wrote some code that generates component.xml files automatically. The XML convention of this file IP-XACT format, a specification by the SPIRIT Consortium which can be downloaded free from IEEE. The “spirit:” prefixes all over the XML file indicates that the keywords are defined in the IP-XACT spec.
Block design files (*.bd), which is the only essential source Vivado needs for defining a block design, are also given by IP-XACT convention, however they serve a different purpose, and have a different format.
An IP-XACT file can be opened directly in Vivado (File -> Open IP-XACT File… or the ipx::open_ipxact_file Tcl command on earlier Vivados) and there are plenty of Tcl commands (try “help ipx::” at Tcl prompt, yes, with two colons).
Structure
Everything is under the <spirit:component> entry.
- Vendor, library version etc
- busInterfaces: Each businterface groups ports (to be listed later on) into interfaces such as AXI, AXI Streaming etc. These interfaces are one of those known to Vivado, and it seems like it’s not possible to add a custom interface in a sensible way.
- model: views and ports, see below
- fileSets: Each fileset lists the files that are relevant for one particular view. The pairing is done by matching the view’s fileSetRef attribute with the fileset’s name attribute.
- description: This is some text that is displayed to the user. It can be long
- parameters
- vendorExtensions (Xilinx taxonomy, basic stuff, note the supportedFamilies entry)
The “model” entry has two subentries:
- views: Different ways to consume the files of the IP: Synthesis in Verilog, synthesis in VHDL, synthesis in any language, files for describing GUI etc. It seems like Vivado is looking at the envIdentifier attribute in particular, and the fileSetRef for linking with a fileset.
A view doesn’t have to contain all files required, but several views are used together for a given scenario. For example, when synthesizing in Verilog, the fileset linked to the view identified with “verilogSource:vivado.xilinx.com:synthesis” (typically named “xilinx_verilogsynthesis”) will probably contain the Verilog files. But if there’s also a fileset linked with a view identified with “:vivado.xilinx.com:synthesis” (typically named “xilinx_anylanguagesynthesis”), its files will be used as well as well. The latter fileset may contain netlists (ngc, edif), which the language-specific fileset may not.
- ports. Enlists the top-level module’s ports. Input ports may have a defaultValue attribute, which defines the value in case nothing is connected to it. All ports appearing in the busInterfaces section must appear here, in which case Vivado includes them in a group. If a port doesn’t belong to any bus interface, it’s exposed as a wire on the block.
Notes
- In the files listed in a fileset, each one is given a fileType attribute. This attribute has to be one listed in the IP-XACT standard section C.8.2 (e.g. verilogSource, vhdlSource, tclSource etc.). Other strings will be rejected by Vivado. For xci, ngc, edif etc, Vivado expects a userFileType attribute instead. One of fileType or userFileType must be present.
- When instantiating the IP in a block design, Vivado expects the top-level module to have the name given in the modelName attribute in the relevant view. This is typically the name of one of the modules of the fileset.
- The entries in vendorExtensions -> taxonomies is where the IP will appear in Vivado’s IP Catalog, when it’s listed by groups. The path is given as a directory path, with slashes (hence the leading slash, marking “root”). It’s fine to invent a name for a new root entry, in which case a new group is generated in the IP Catalog. Vivado accepts taxonomies it doesn’t know of.
- Sub-core’s XCI files may go into a Verilog/VHDL Synthesis fileset, but the last file in the fileset must be in Verilog/VHDL.
So the situation is like this: An email I attempted to send got rejected by the recipient’s mail server because my ISP (Netvision) has a poor spam reputation. And it so happens that I have a shell account (with root, possibly) on a server with an excellent reputation. So how do I use this advantage?
On my Thunderbird oldie, save the message with “Save As…” from the “Sent” folder into an .eml file. Or from “Unsent Mail” folder, if it’s a fresh message which I haven’t even tried to send the normal way (using the “Send Later” feature).
Copy this .eml file to the server with good mail reputation.
On that server, go
$ sendmail -v -t < test.eml
"eli@picky.server.com" <eli@picky.server.com>... Connecting to [127.0.0.1] via relay...
220 theserver.org ESMTP Sendmail 8.14.4/8.14.4; Sat, 18 Jun 2016 11:05:26 +0300
>>> EHLO theserver.org
250-theserver.org Hello localhost.localdomain [127.0.0.1], pleased to meet you
250-ENHANCEDSTATUSCODES
250-PIPELINING
250-8BITMIME
250-SIZE
250-DSN
250-ETRN
250-DELIVERBY
250 HELP
>>> MAIL From:<eli@theserver.org> SIZE=864
250 2.1.0 <eli@theserver.org>... Sender ok
>>> RCPT To:<eli@picky.server.com>
>>> DATA
250 2.1.5 <eli@picky.server.com>... Recipient ok
354 Enter mail, end with "." on a line by itself
>>> .
250 2.0.0 u5I85QQq030607 Message accepted for delivery
"eli@picky.server.com" <eli@picky.server.com>... Sent (u5I85QQq030607 Message accepted for delivery)
Closing connection to [127.0.0.1]
>>> QUIT
221 2.0.0 theserver.org closing connection
The -v flag causes all the verbose output, and the -t flag makes sendmail read the headers. If there’s a Bcc: header, it’s removed before sending.
IMPORTANT: If the processing involves several connections, they will be shown one after the other. In particular, the first leg might be to the local mail server, and only then the relaying out. In that case, the EHLO of the first connection is set by the sendmail executable we’re running, not the sendmail server.
Note that the “MAIL From:” (envelope sender) is the actual user on the Linux machine (user name@the machine’s domain name). The -f flag can be used to used to change this:
# sendmail -v -f 'eli@billauer.co.il' -t < test.eml
To be sure it went fine, look in /var/log/maillog. A successful transmission leaves an entry like this:
Jun 18 11:06:17 theserver sendmail[30611]: u5I85QQq030607: to=<eli@picky.server.com>, ctladdr=<eli@theserver.org> (500/123), delay=00:00:51, xdelay=00:00:48, mailer=esmtp, pri=120985, relay=picky.server.com. [108.86.85.180], dsn=2.0.0, stat=Sent (OK id=1bEBGd-0007kL-DB)
Note the mail ID, which was given by sendmail (marked in red). Finding all related log messages is done simply with e.g. (as root)
# grep u5I85QQq030607 /var/log/maillog
Jun 18 11:05:26 theserver sendmail[30607]: u5I85QQq030607: from=<eli@theserver.org>, size=985,, nrcpts=1, msgid=<57650054.90002@picky.server.com>, proto=ESMTP, daemon=MTA, relay=localhost.localdomain [127.0.0.1]
Jun 18 11:05:29 theserver sendmail[30604]: u5I85NZV030604: to="eli@picky.server.com" <eli@picky.server.com>, ctladdr=eli (500/123), delay=00:00:06, xdelay=00:00:03, mailer=relay, pri=30864, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u5I85QQq030607 Message accepted for delivery)
Jun 18 11:06:17 theserver sendmail[30611]: u5I85QQq030607: to=<eli@picky.server.com>, ctladdr=<eli@theserver.org> (500/123), delay=00:00:51, xdelay=00:00:48, mailer=esmtp, pri=120985, relay=picky.server.com. [208.76.85.180], dsn=2.0.0, stat=Sent (OK id=1bEBGd-0007kL-DB)
(the successful finale is the last message)
The problem
I needed to implement an FPGA design for an Arria 10 chip with Quartus 15 on a Linux machine. According to Altera’s requirement page, (“Memory recommendations” tab), the computer should have 28-48 GB of RAM. Or, as it says on that page, one can fake it with virtual memory. It turns out the the fitter (quartus_fit) is the process that requires this much memory.
Since I have a desktop with 16 GB and a laptop with 8 GB, I set up a large swap partition on the desktop (see below) and fired off the implementation. For a reason I can’t figure out, the memory just ran out, bringing the computer to a freeze after quartus_fit ate up GB after GB until it reached 15.7GB of used physical RAM: The kernel was still responsive (computer answered to pings) but it seemed like no process was able to run (for example, attempts to connect with ssh got no response whatsoever: The TCP link was established, but no data ran through it). After several minutes of looking at a completely frozen screen, and a hard disk doing almost nothing, I reset the computer.
As for the swap partition, only a few hundred MBs of it was used. Why pages weren’t rushed into swap to avoid this freezing is beyond me. This happened on the desktop running kernel v3.12.20 as well as the laptop with a 3.13.0-35 (Ubuntu 14.04.1).
And by the way, I have a different post that discusses how to measure RAM consumption of a systemd service, and that is applicable to plain cgroups too.
The solution
Since the swapping mechanism didn’t kick in fast enough to prevent quartus_fit from eating up all physical RAM, let cgroups do the job instead. The idea is that one can limit the amount of physical memory used. Everything else goes to swap. Since I didn’t want to mess with my desktop again, I went for a 6 GB limit on my laptop (out of the existing 8 GB). Details follow.
Setting up swap
First thing first, set up a large swap partition. I’m using LVM on the machine, so it was quite easy.
In retrospective, 64 GB is much more than needed (10 GB would have been enough) but I was lucky enough to have this much spare room in the physical volume.
So to create a new logic volume, and format it for swap, it was (vg_main is the physical volume):
# lvcreate --size 64G vg_main -n lv_bigswap
# mkswap /dev/mapper/vg_main-lv_bigswap
# lvdisplay
Turn off old swap, and enable the new one only:
# swapoff -a
# swapon /dev/mapper/vg_main-lv_bigswap
And that’s it. The swap is enabled.
Cgroups
Now to the interesting part. First I needed to install the cgroup tools:
# apt-get install cgroup-bin
(there was no need to reboot, as suggested elsewhere)
Following this guide: Create a group, owned by myself (eli), but this has to be done as root:
# cgcreate -a eli:eli -g:memory:quartus
This creates the /sys/fs/cgroup/memory/quartus/ subdirectory, owned by user “eli” (and everything below too, so I don’t have to be root to control anything related to it).
Note that the name “quartus” is just a name and has nothing to do with the target executable. Which is never “quartus” in my case, because I implement the project by kicking off “make” from “xemacs”.
I could have used cgexec to start a new process, for example (as root, because changing a group isn’t allowed as plain user)
# cgexec -g memory:quartus xemacs
but I went for changing the group for an existing process (root required, again. 4550 happens to be the PID of xemacs):
# cgclassify -g memory:quartus 4550
Now drop the root privileges. They won’t be required anymore.
And indeed, the process has joined the group (as non-root):
$ cat /sys/fs/cgroup/memory/quartus/cgroup.procs
4550
It could also make sense to target a shell process, which would limit anything executed from it. For example, to add the running shell to the group:
$ sudo cgclassify -g memory:vivado $$
Set the memory limit to 6 GiB:
$ cat /sys/fs/cgroup/memory/quartus/memory.limit_in_bytes
18446744073709551615
$ echo 6442450944 > /sys/fs/cgroup/memory/quartus/memory.limit_in_bytes
$ cat /sys/fs/cgroup/memory/quartus/memory.limit_in_bytes
6442450944
And now launch the implementation from that xemacs process (the “Compile”) button.
For the amusement, follow the joining processes with
$ watch cat /sys/fs/cgroup/memory/quartus/cgroup.procs
Needless to say(?) any process that forks from the originating process joins the group automatically, so the limit applies to all processes. And indeed, when the memory use reaches 6 GB, it goes to swap.
This made the whole process considerably slower I suppose (CPU usage went down to almost zero for some periods of time waiting for disk I/O), but it took some 35 minutes to finish a simple implementation, which is all I needed.
A quick summary on how to get my old Fedora 12 to display Emojis when browsing the web (Instagram, for example).
Download the EmojiOneColor font from its Github repo.
Untar the bundle. Don’t run the installation script (maybe it works, but I prefer messing up things myself).
Create a directory named “emoji” (or any other name) in /usr/share/fonts/ and copy EmojiOneColor-SVGinOT.ttf into that directory.
Clear the font cache (as root):
# fc-cache -f
Find Emoji One as a listed font (non-root):
$ fc-list | grep -i emoji
Emoji One Color:style=Regular
That’s it, on my machine. There was no need to add a font configuration script: After restarting Firefox and Google Chrome, both started displaying Emojis instead of those empty boxes (Chrome shows them only in black and white, Firefox in color).
A font configuration file is needed if the browser sticks to the text font even for Emoji characters, ending up with rubbish or empty boxes. In this case, the font configuration file is required to set the Emoji font as default, and fall back on text fonts. All this according to the comments in the bundle I downloaded — I didn’t need this myself, so why mess with the fonts settings?
Short answer: ~/.recently-used.xbel (taken from here, and it actually works on my computer).
It’s an XML files, organized in chronological order, last item most recently accessed.
It was moved into ~/.local/ or ~/.config or some other subdirectory in later revisions of Gnome, though. For example, on Linux Mint 19 it’s ~/.local/share/recently-used.xbel.
Just wanted this written down for my own future use.
I struggled with this a bit, and ended up doing it right by guessing. Even though I should have read the manual to begin with.
So the procedure is simple (cited from manual, page 7, “Pairing”):
- Turn the ignition on.
- Make sure the Bluetooth feature of your phone is turned on.
- Start the pairing procedure on your mobile phone.
- When prompted for a passkey, enter 1234 on your mobile phone
The crucial hint is that nothing is expected to happen in the “Phone” any setup menu, as shown in many video tutorials.
So on an Android phone, open Settings > Bluetooth and make it search for devices. Once it finds it, enter the 1234 passcode (it’s was actually suggested, that and 0000. So it was 1234). Don’t expect anything to happen on the car radio’s side nor the dashboard display until the phone is paired.
I managed to pair two phones (the manual says there are up to five allowed).
Introduction
This is a summary of a few topics that should to be kept in mind when a Multi-Gigabit Tranceiver (MGT) is employed in an FPGA design. It’s not a substitute for reading the relevant user guide, nor a tutorial. Rather, it’s here to point at issues that may not be obvious at first glance.
The terminology and signal names are those used with Xilinx FPGA. The tranceiver is referred to as GTX (Gigabit Transceiver), but other variants of transceivers, e.g. GTH and GTZ, are to a large extent the same components with different bandwidth capabilities.
Overview
GTXs, which are the basic building block for common interface protocols (e.g. PCIe and SATA) are becoming an increasingly popular solution for communication between FPGAs. As the GTX’ instance consists of a clock and parallel data interface, it’s easy to mistake it for a simple channel that moves the data to the other end in a failsafe manner. A more realistic view of the GTX’ is a front end for a modem, with the possible bit errors and a need to synchronize serial-to-parallel data alignment at the receiver. Designing with the GTX also requires attention to classic communication related topics, e.g. the use of data encoding, equalizers and scramblers.
As a result, there are a few application-dependent pieces of logic that needs to be developed to support the channel:
- The possibility of bit errors on the channel must be handled
- The alignment from a bit stream to a parallel word must be taken care of (which bit is the LSB of the parallel word in the serial stream?)
- If the transmitter and receiver aren’t based on a common clock, a protocol that injects and tolerates idle periods on the data stream must be used, or the clock difference will cause data underflows or overflows. Sending the data in packets in a common solution. In the pauses between these packets, special skip symbols must be inserted into the data stream, so that the GTX’ receiver’s clock correction mechanism can remove or add such symbols into the stream presented to the application logic, which runs at a clock slightly different from the received data stream.
- Odds are that a scrambler needs to be applied on the channel. This requires logic that creates the scrambling sequence as well as synchronizes the receiver. The reason is that an equalizer assumes that the bit stream is uncorrelated on the average. Any average correlation between bit positions is considered ISI and is “fixed”. See Wikipedia
Having said the above, it’s not uncommon that no bit errors are ever observed on a GTX channel, even at very high rates, and possibly with no equalization enabled. This can’t be relied on however, as there is in fact no express guarantee for the actual error probablity of the channel.
Clocking
The clocking of the GTXs is an issue in itself. Unlike the logic fabric, each GTX has a limited number of possible sources for its reference clock. It’s mandatory to ensure that the reference clock(s) are present in one of the allowed dedicated inputs. Each clock pin can function as the reference clock of up to 12 particular GTXs.
It’s also important to pay attention to the generation of the serial data clocks for each GTX from the reference clock(s). It’s not only a matter of what multiplication ratios are allowed, but also how to allocate PLL resources and their access to the required reference clocks.
QPLL vs. CPLL
Two types of PLLs are availble for producing the serial data clock, typically running at severtal GHz: QPLLs and CPLLS.
The GTXs are organized in groups of four (“quads”). Each quad shares a single QPLL (Quad PLL), which is instantiated separately (as a GTXE2_COMMON). In addition, each GTX has a dedicated CPLL (Channel PLL), which can generate the serial clock for that GTX only.
Each GTX may select its clock source from either the (common) QPLL or its dedicated CPLL. The main difference between these is that the QPLL covers higher frequencies. High-rate applications are hence forced to use the QPLL. The downside is that all GTXs sharing the same QPLL must have the same data rate (except for that each GTX may divide the QPLL’s clock by a different rate). The CPLL allow for a greater flexibility of the clock rates, as each GTX can pick its clock independently, but with a limited frequency range.
Jitter
Jitter on the reference clock(s) is the silent killer of GTX links. It’s often neglected by designers because “it works anyhow”, but jitter on the reference clock has a disastrous effect on the channel’s quality, which can be by far worse than a poor PCB layout. As both jitter and poor PCB layout (and/or cabling) contribute to the bit error rate and the channel’s instability, the PCB design is often blamed when things go bad. And indeed, playing with the termination resistors or similar black-magic actions sometimes “fix it”. This makes people believe that GTX links are extremely sensitive to every via or curve in the PCB trace, which is not the case at all. It is, on the other hand, very sensitive to the reference clock’s jitter. And with some luck, a poorly chosen reference clock can be compensated for with a very clean PCB trace.
Jitter is commonly modeled as a noise component which is added to the timing of the clock transition, i.e. t=kT+n (n is the noise). Consequently, it is often defined in terms of the RMS of this noise component, or a maximal value which is crossed at a sufficiently low probability. The treatment of an GTX’ reference clock requires a slightly different approach; the RMS figures are not necessarily a relevant measures. In particular, clock sources with excellent RMS jitter may turn out inadequate, while other sources, with less impressive RMS figures may work better.
Since the QPLL or CPLL locks on this reference clock, jitter on the reference clock results in jitter in the serial data clock. The prevailing effect is on the transmitter, which relies on this serial data clock; the receiver is mainly based on the clock it recovers from the incoming data stream, and is therefore less sensitive to jitter.
Some of the jitter – in particular “slow” jitter (based upon low frequency components) is fairly harmless, as the other side’s receiver clock synchronization loop will cancel its effect by tracking the random shifts of the clock. On the other hand, very fast jitter in the reference clock may not be picked up by the QPLL/CPLL, and is hence harmless as well.
All in all, there’s a certain band of frequency components in the clock’s timing noise spectrum, which remains relevant: The band that causes jitter components which are slow enough for the QPLL/CPLL to track and hence present on the serial data clock, and too fast for the receiver’s tracking loop to follow. The measurable expression for this selective jitter requirement is given in terms of phase noise frequency masks, or sometimes as the RMS jitter in bandwidth segments (e.g. PCIe Base spec 2.1, section 4.3.7, or Xilinx’ AR 44549). Such spectrum masks required for GTX published by the hardware vendors. The spectral behavior of clock sources is often more difficult to predict: Even when noise spectra are published in datasheets, they are commonly given only for certain scenarios as typical figures.
8b/10b encoding
Several standardized uses of MGT channels (SATA, PCIe, DisplayPort etc.) involve a specific encoding scheme between payload bytes for transmission and the actual bit sequence on the channel. Each (8-bit) byte is mapped to an 10-bit word, based upon a rather peculiar encoding table. The purposes of this encoding is to ensure a balance between the number of 0′s and 1′s on the physical channel, allowing AC-coupling of the electrical signal. Also, this encoding also ensures frequent toggling between 0′s and 1′s, which ensures the proper bit synchronization at the receiver by virtue of the of the clock recovery loop (“CDR”). Other things that are worth noting about this encoding:
- As there are 1024 possible code words covering 256 possible input bytes, some of the excessive code words are allocated as control characters. In particular, a control character designated K.28.5 is often referred to as “comma”, and is used for synchronization.
- The 8b/10b encoding is not an error correction code despite its redundancy, but it does detect some errors, if the received code word is not decodable. On the other hand, a single bit error may lead to a completely different decoded word, without any indication that an error occurred.
Scrambling
To put it short and concise: If an equalizer is applied, the user-supplied data stream must be random. If the data payload can’t be ensured to be random itself (this is almost always the case), a scrambler must be defined in the communication protocol, and applied in the logic design.
Applying a scrambler on the channel is a tedious task, as it requires a synchronization mechanism between the transmitter and receiver. It’s often quite tempting to skip it, as the channel will work quite well even in the absence of a scrambler, even where it’s needed. However in the long run, occasional channel errors are typically experienced.
The rest of this paragraph attempts to explain the connection between the equalizer and scrambler. It’s not the easiest piece of reading, so it’s fine to skip it, if my word on this is enough for you.
In order to understand why scrambling is probably required, it’s first necessary to understand what an equalizer does.
The problem equalizers solve is the filtering effect of the electrical media (the “channel”) through which the bit stream travels. Both cables and PCBs reduce the strength of the signal, but even worse: The attenuation depends on the frequency, and reflections occur along the metal trace. As a result, the signal doesn’t just get smaller in magnitude, but it’s also smeared over time. A perfect, sharp, step-like transition from -1200 mV to +1200mV at the transmitter’s pins may end up as a slow and round rise from -100mV to +100mV. Because of this slow motion of the transitions at the receiver, the clear boundaries between the bits are broken. Each transmitted bit keeps leaving its traces way after its time period. This is called Inter-Symbol Interference (ISI): The received voltage at the sampling time for the bit at t=0 depends on the bits at t=-T, t=t-2T and so on. Each bit effectively produces noise for the bits coming after it.
This is where the equalizer comes in. The input of this machine is the time samples of the bit at t=0, but also a number of measured voltage samples of the bits before and after it. By making a weighted sum of these inputs, the equalizer manages, to a large extent, to cancel the Inter-Symbol Interference. In a way, it implements a reverse filter of the channel.
So how does the equalizer acquire the coefficients for each of the samples? There are different techniques for training an equalizer to work effectively against the channel’s filtering. For example, cellular phones do their training based upon a sequence of bits on each burst, which is known in advance. But when the data stream runs continuously, and the channel may change slightly over time (e.g. a cable is being bent) the training has to be continuous as well. The chosen method for the equalizers in GTXs is therefore continuous.
The Decision Feedback Equalizer, for example, starts with making a decision on whether each input bit is a ’0′ or ’1′. It then calculates the noise signal for this bit, by subtracting the measured voltage with the expected voltage for a ’0′ or ’1′, whichever was decided upon. The algorithm then slightly alters the weighted sums in a way that removes any statistical correlation between the noise and the previous samples. This works well when the bit sequence is completely random: There is no expected correlation between any input sample, and if such exists, it’s rightfully removed. Also, the adaptation converges into a compromise that works on the average best for all bit sequences.
But what happens if there is a certain statistical correlation between the bits in the data itself? The equalizer will specialize in reducing the ISI for the bit patterns occurring more often, possibly doing very bad on the less occurring patterns. The equalizer’s role is to compensate for the channel’s filtering effect, but instead, it adds an element of filtering of its own, based upon the common bit patterns. In particular, note that if a constant pattern runs through the channel when there’s no data for transmission (zeros, idle packets etc.) the equalizer will specialize in getting that no-data through, and mess up with the actual data.
One could be led to think that the 8b/10b encoding plays a role in this context, but it doesn’t. Even though cancels out DC on the channel, it does nothing about the correlation between the bits. For example, if the payload for transmission consists of zeros only, the encoded words on the channel will be either 1001110100 or 0110001011. The DC on the channel will remain zero, but the statistical correlation between the bits is far from being zero.
So unless the data is inherently random (e.g. an encrypted stream), using an equalizer means that the data which is supplied by the application to the transmitter must be randomized.
The common solution is a scrambler: XORing the payload data by a pseudo-random sequence of bits, generated by a simple state machine. The receiver must XOR the incoming data with the same sequence in order to retrieve the payload data. The comma (K28.5) symbol is often used to synchronize both state machines.
In GTX applications, the (by far) most commonly used scrambler is the G(X)=X^16+X^5+X^4+X^3+1 LFSR, which is defined in a friendly manner in the PCIe standard (e.g. the PCI Express Base Specification, rev. 1.1, section 4.2.3 and in particular Appendix C).
TX/RXUSRCLK and TX/RXUSRCLK2
Almost all signals between the FPGA logic fabric and the GTX are clocked with TXUSRCLK2 (for transmission) and RXUSRCLK2 (for reception). These signals are supplied by the user application logic, without any special restriction, except that the frequency must match the GTX’ data rate so as to avoid overflows or underflows. A common solution for generating this clock is therefore to drive the GTX’ RX/TXOUTCLK through a BUFG.
The logic fabric is required to supply a second clock in each direction, TXUSRCLK and RXUSRCLK (without the “2” suffix). These two clocks are the parallel data clocks in a deeper position of the GTX.
The rationale is that sometimes, it’s desired to let the logic fabric work with a word width which is twice as wide as the actual word width. For example, in a high-end data rate application, the GTX’ word width may be set to 40 bits with 8b/10b, so the logic fabric would interface with the GTX through a 32 bit data vector. But because of the high rate, the clock frequency may still be too high for the logic fabric, in which case the GTX allows halving the clock, and applying the data through a 80 bits word. In this case, the logic fabric supplies the 80-bit word clocked with TXUSRCLK2, and is also required to supply a second clock, TXUSRCLK having twice the frequency, and being phase aligned with TXUSRCLK2. TXUSRCLK is for the GTX’ internal use.
A similar arrangement applies for reception.
Unless the required data clock rate is too high for the logic fabric (which is usually not the case), this dual-clock arrangement is best avoided, as it requires an MMCM or PLL to generate two phase aligned clocks. Except for the lower clock applied to the logic fabric, there is no other reason for this.
Word alignment
On the transmitting side, the GTX receives a vector of bits, which forms a word for transmission. The width of this word is one of the parameters that are set when the GTX is instantiated, and so is whether 8b/10b encoding is applied. Either way, some format of parallel words is transmitted over the channel in a serialized manner, bit after bit. Unless explicitly required, there is nothing in this serial bitstream to indicate the words’ boundaries. Hence the receiver has no way, a-priori, to recover the word alignment.
The receiver’s GTX’ output consists of a parallel vector of bits, typically with the same width as the transmitter. Unless a mechanism is employed by the user logic, the GTX has no way to recover the correct alignment. Without such alignment, the organization into a parallel words arrives wrong at the receiver, and possibly as complete garbage, as an incorrect alignment prevents 8b/10b decoding (if employed).
It’s up to the application logic to implement a mechanism for synchronizing the receiver’s word alignment. There are two methodologies for this: Moving the alignment one bit at a time at the receiver’s side (“bit slipping”) until the data arrives properly, or transmitting a predefined pattern (a “comma”) periodically, and synchronize the receiver when this pattern is detected.
Bit slipping is the less recommended practice, even though simpler to understand. It keeps most of the responsibility in the application logic’s domain: The application logic monitors the arriving data, and issues a bit slip request when it has gathered enough errors to conclude that the alignment is out of sync.
However most well-established GTX-based protocols use commas for alignment. This method is easier in the way that the GTX aligns the word automatically when a comma is detected (if the GTX is configured to do so). If injecting comma characters periodically into the data stream fits well in the protocol, this is probably the preferred solution. The comma character can also be used to synchronize other mechanisms, in particular the scrambler (if employed).
Comma detection may also have false positives, resulting from errors in the raw data channel. As these data channels usually have a very low bit error probability (BER), this possibility can be overlooked in applications where a short-term false alignment resulting from a false comma detected is acceptable. When this is not acceptable, the application logic should monitor the incoming data, and disable the GTX automatic comma alignment through the rxpcommaalignen and/or rxmcommaalignen inputs of the GTX.
Tx buffer, to use or not to use
The Tx buffer is a small dual-clock (“asynchronous”) FIFO in the transmitter’s data path + some logic that makes sure that it starts off in the state of being half full.
The underlying problem, which the Tx buffer potentially solves, is that the serializer inside the GTX runs on a certain clock (XCLK) while the application logic is exposed to another clock (TXUSRCLK). The frequency of these clocks must be exactly the same to prevent overflow or underflow inside the GTX. This is fairly simple to achieve. Ensuring proper timing relationships between these two clocks is however less trivial.
There are hence two possibilies:
- Not requiring a timing relationship between these clock (just the same frequency). Instead, use a dual-clock FIFO, which interfaces between these two clock domains. This small FIFO is referred to as the “Tx buffer”. Since it’s part of the GTX’ internal logic, going this path doesn’t require any additional resources from the logic fabric.
- Make sure that the clocks are aligned, by virtue of a state machine. This state machine is implemented in the logic fabric.
The first solution is simpler and requires less resources from the FPGA’s logic fabric. Its main drawback is the latency of the Tx buffer, which is typically around 30 TXUSRCLK cycles. While this delay is usually negligible from a functional point of view, it’s not possible to predict its exact magnitude. It’s therefore not possible to use the Tx buffer on several parallel lanes of data, if the protocol requires a known alignment between the data in these lanes, or when an extremely low latency is required.
The second solutions requires some extra logic, but there is no significant design effort: This logic that aligns the clocks is included automatically by the IP core generator on Vivado 2014.1 and later, when “Tx/Rx buffer off” mode is chosen.
Xilinx GTX’ documentation is somewhat misleading in that it details the requirements of the state machine to painful detail: There’s no need to read through that long saga in the user guide. As a matter of fact, this logic is included automatically by the IP core generator on Vivado 2014.1, so there’s really no reason to dive into this issue. Only note that gtN_tx_fsm_reset_done_out may take a bit longer to assert after a reset (something like 1 ms on a 10 Gb/s lane).
Rx buffer
The Rx buffer (also called “Rx elastic buffer”) is also a dual-clock FIFO, which is placed in the same clock domain gap as the Tx buffer, and has the same function. Bypassing it requires the same kind of alignment mechanism in the logic fabric.
As with its Tx counterpart, bypassing the Rx buffer makes the latency short and deterministic. It’s however less common that such a bypass is practically justified: While a deterministic Tx latency may be required to ensure data alignment between parallel lanes in order to meet certain standard protocol requirements, there is almost always fairly easy methods to compesate for the unknown latency in user logic. Either way, it’s preferred not to rely on the transmitter to meet requirements on data alignment, and align the data, if required, by virtue of user logic.
Leftover notes
- sysclk_in must be stable when the FPGA wakes up from configuration. A state machine that brings up the transceivers is based upon this clock. It’s referred to as the DRP clock in the wizard (find more imformation at http://www.directics.com/).
- It’s important to declare the DRP clock’s frequency correctly, as certain required delays which are measured in nanoseconds are implemented by dwelling for a number of clocks, which is calculated from this frequency.
- In order to transmit a comma, set the txcharisk to 1 (since it’s a vector, it sets the LSB) and the value of the 8 LSBs of the data to 0xBC, which is the code for K.28.5.
Problem: My LG G4 (Android 5.1, kernel 3.10.49) suddenly ignored my home’s 5 GHz router. It saw the neighbors’ networks all right, but not mine.
Reason for problem: I had activated the phone’s hotspot previously. It seems like that locked the Wifi hardware to the 2.4 GHz band, as it happens to transmit on 2.437 GHz (channel 6). Correction: It seems like the phone doesn’t detect the 5 GHz hotspot if it wasn’t present on boot. Regardless of hotspot activation.
Solution: Reboot the phone. As simple as that. That is, restart Android, with the 5 GHz hotspot present and with Wifi enabled on the phone (before rebooting or enabled soon after boot). Turning the phone off and on again would probably do the same job, but why bother. It has been suggested to turn flight mode on and off, but that didn’t work for me.
Conclusion: Treat your smartphone for what it is: A small, weak and expensive computer with a lot of silly bugs.
Update (Jun 2017): After changing to another 5 GHz channel, the phone detects the hotspot normally. It seems like the previous 5 GHz channel I used wasn’t an allowed frequency in Israel, so the phone ignored it (and sometimes didn’t). Or maybe there’s also some kind of software upgrade that has taken place since.
By the way, I tried to move the G4′s hotspot to 2.452 GHz (channel 9) in the Advanced Settings, but the reception signal on the laptop went down. Go figure.