Outgoing SMTP mail servers considerations

Mail with gmail.com as From address vanishing

It started really bad: Someone asked me why he hasn’t received an answer from me in two weeks, and I had answered his mail the same day I got his.

It turned out that Gmail had thrown mail into the black hole without any warning. Probably the updated DMARC policy, which has been mentioned for long now.

Solution: When using gmail.com as a From address, also use smtp.gmail.com as the outgoing SMTP mail server. This ensures that the mail arrives properly to gmail.com recipients as well.

Bonus: The sent mails appear in Google’s web interface’s Sent folder as well. I would have done well without this favor.

That was really annoying, frankly speaking.

SPF entries

The topic is explained on this page. It’s important to keep this entry updated if and when I change outgoing servers.

This mechanism helps spam filters tell if the sender of the mail is authentic, by looking for an SPF record in the domain name’s TXT record.

For example, since I’m relaying through my web host’s server, billauer.co.il, the desired SPF record should read:

v=spf1 +a +mx +ip4: +ip4: +ip4: +ip4: +ip4: -all
  • The +a part means to pass the mail if the A entry if the sending domain appears
  • The +mx means the same for the MX entry
  • The other IP4 parts say that if these addresses (or address blocks) appear, pass the mail
  • Finally, the -all part in the end says that if none of the previous entires matches, drop the mail

(In Cpanel, go to EMail > Authentication to set this up)

A word about “include records”. I used to have one saying “+include:relay.mailchannels.net”, which means “get the SPF record from relay.mailchannels.net”, and add whatever records they have. Which makes sense in a way, since their servers are expected to appear on the list. On the other hand, if this record is missing (or their DNS temporary out of business), it’s a fatal error. So I’m not happy about this idea. The solution above, copying their records (which is the long list of IP4 address blocks), is suboptimal in that I may miss some new servers or so, but this can’t cause a fatal error.

Checking the current SPF record

$ dig txt billauer.co.il

; <<>> DiG 9.6.2-P2-RedHat-9.6.2-5.P2.fc12 <<>> txt billauer.co.il
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1406
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;billauer.co.il.            IN    TXT

billauer.co.il.        14400    IN    TXT    "v=spf1 +a +mx +ip4: +ip4: +ip4: +ip4: +ip4: -all"

billauer.co.il.        28920    IN    NS    ns2.totalchoicehosting.com.
billauer.co.il.        28920    IN    NS    ns1.totalchoicehosting.com.

ns1.totalchoicehosting.com. 4736 IN    A
ns2.totalchoicehosting.com. 131050 IN    A

;; Query time: 173 msec
;; WHEN: Thu Sep  7 18:23:52 2017
;; MSG SIZE  rcvd: 256

Checking directly on the authoritative DNS is better when monitoring quick changes


Gtkwave notes


Gtkwave is a simple and convenient viewer of electronic waveforms. It’s a free software tool at its best: A bit rough to start working with, but after a while it becomes clear that the decisions have been made by someone who uses the tool himself. Really recommended.

So here are my jumpstart notes for the next time I’ll need it.


$ gtkwave this.vcd &

Note to self: My own utility for generating VCD files from raw dumps, named dump2vcd, is in the utils repository.

Saving a signal view set

To have a certain set of signals shown, as well as a certain time region shows, create a “Save File”: In the top menu go File > Write Save File. The automatic filename extension is .sav.

At the next session, just read the save file with File > Read Save File. Or from the command line e.g.

$ gtkwave this.vcd allview.sav &


Note that hovering the mouse pointer over a certain signal’s trace will make it sticky on transitions.

  • Plain left-click: Sets the “current position”. Values seen at left.
  • Left click and drag: Time difference shown at the top
  • Right-click and drag: Zoom in
  • Middle-button (wheel) click: Set the “base marker”. Then move around the primary marker with plain left-clicks, and the time difference between the base marker and primary will appear as the top.

Other nice stuff

  • Find Next / Previous Edge (plain right/left arrows at the toolbar). Select one or several signals on the Signals list, and click on these icons to jump to the next edge of those signals.
  • If a set of signal are named with index suffixes, e.g. state[0] and state[1], Gtkwave automagically shows them as a vector (e.g. state[1:0]).


Notes on USB 1.1 low-level protocol for FPGA implementation


These are the consideration and design decisions I took when designing a transparent hub for low- and full-speed USB (that is, all covered by USB 1.1, and not high-speed as required by USB 2.0).

A transparent 1:1 hub is a device with one male USB plug, and one female plug. It basically substitutes an extension cable, and should not be visible or have any practical influence on the connected parties. The idea is to repeat the signals from one end to another by virtue of two ULPI PHY frontends, one connected to each side, and both connected to an FPGA in the middle.

This is not the recommended practice if all you want is to sniff the data: In that case, just connecting the D+/D- wires directly to a couple of 3.3V-compatible inputs of the FPGA in parallel to the cable is the way to go. Literally tapping the wire (you might need to suppress the chirps that upgrade the link to high-speed, if that’s an issue). A transparent hub is only required if intervention in the traffic is desired, on top of sniffing capabilities.


  • Full operation is required, including connect / disconnect, suspend and resume. It may appear redundant to support suspend on a system that is never turned off or hibernated, but I’ve seen Linux suspending USB hubs when nothing is connected to its downstream ports. Once something is connected to one of the hub’s ports, it issues a resume (long K, see below) signal towards the host, and the host replies with a long K resume signal before SOF packets are sent. Besides, Windows supports a Selective Suspend feature which may suspend certain idle devices to save energy. These are just examples.
  • Both low and full speed must be supported. The logic will not be able to deduce which speed is in effect based upon pullup resistors from the device.
    Reason I: In the specific project, the board has a pullup resistor only on the D+ wire going to the host. The solution was to swap the D+/D- wires on both sides when a low-speed device is used. As a result, the D+ wire seen by the FPGA will always be pulled up on the device side in either case.
    Reason II: If a (regular) hub is connected as a device to the transparent hub, it always presents itself as a full-speed device with the pullup resistor. Low speed data is sent just by reducing the rate in this case, in both directions (see “SE0 detection when a hub is connected as a device” below)
  • There will be no additional intermediate elements (i.e. hubs) between the connected machines, but there may be hubs inside those.
  • The PHY device on both ends is TI’s TUSB1106 USB transceiver.

Comparison with a regular hub

A transparent hub, like a regular one, is required to repeat a variety of signals going from one end to another. The main challenges are to keep track of which side drives the wires, and prevent accidental signal level transitions (noise and differential signaling imbalance) from generating errors.

The USB spec defines maximal delays and jitter for a hub, which are then used to calculate the overall link jitter and turnaround delay. Since the transparent hub is designed to be the only element between the host and device (more or less, as the machines may be internal hubs), the total jitter and delay may be generated by this single element (again, more or less).

As for jitter, the USB 1.1 spec section 7.1.15 says: “Data receivers are required to decode differential data transitions that occur in a window plus and minus a nominal quarter bit cell from the nominal (centered) data edge position”. This is indeed reflected in the total jitter allowed, as presented in tables 7-2 and 7-3 of the same document (20 ns for full speed).

The turnaround delay for low and full speed is defined in USB 1.1 spec, 7.1.19, defining the timeout between 16-18 bit times (applies to both speeds). This calculation takes 5 hubs into consideration, as well as the device’s delay — in our case, the single element may add a few bits times of delay (I never calculated the exact figure, as my implementation went way below this figure).

The following functionalities are required from a regular hub, but can be omitted from a transparent one:

  • Suspend: A hub is required to put itself into a low-power state (Suspend) in the event of no activity on the bus (like any USB device). This isn’t necessary, as the transparent hub doesn’t consume current from the USB wire.
  • Babble / Loss of Activity (LOA) detection: A regular hub must detect if a device transmits junk on the wires, and by doing so, preventing the other devices (those connected to the same hub) from communicating with the host. The hub should kick in if the device doesn’t release the bus with an EOP before the end of the USB frame (once in 1 ms). A transparent hub doesn’t need to take action, since there are no neighboring devices competing for access. If a device is babbling or otherwise holding the wires, the host will soon detect that and take action.
  • Frame tracking: A regular hub must lock on the USB frames (by detecting SOF packets and maintaining a timer). However the purpose of this lock is to detect babble. Hence this isn’t required either.
  • Reset: A regular hub generates the SE0 signal required to reset a device when it’s hot-plugged to it. The transparent hub merely enables the pullup resistor on the upstream port in response to a hotplug on its downstream port, letting the host issue the reset.

Possible bus events

This is a list of events that need proper relaying to the other side.

The repeater should expect the following events from the host:

  • Packet (low or full speed): transition to K followed by SYNC and packet data
  • Long SE0 for reset (2.5 us, even though it’s expected to be several ms)
  • Keep-alive for low-speed device, 2 low-speed bits of SE0
  • Resume signaling (long K), followed by a low-speed EOP
  • PRE packets. See below.

The repeater should expect the following events from the device:

  • Packet (low or full speed): transition to K followed by SYNC and packet data
  • Long SE0 (2 us) = disconnection
  • Resume signaling (long K), followed by release of the bus (no EOP nor driven J).

Note that since the transparent hub doesn’t keep track of frame timing, it doesn’t detect 3 ms of no traffic, and hence doesn’t know when Suspend is required. Therefore, Suspend and Resume signaling is treated like any other signals, and no timeout mechanism is applied on long K periods.

One thing that might be confusing is that when the upstream port is disconnected, it might catch noise, in particular from the electrical network. Both single ended inputs may change states at a rate of 50 or 100 Hz (assuming a 50 Hz network), since no termination is attached to either ports. This is a no-issue, as there’s nothing to disrupt at that point, but may be mistaken for a problem.

Voltage transitions

Like any USB element, a transparent hub must properly detect transitions between J, K, and SE0 correctly, avoiding false detections.

The TUSB1106 transceiver presents the D+/D- wire pairs’ state by virtue of three logic signals: RCV, VP and VM. RCV represents the differential voltage state (J or K, given the speed mode), while VP and VM represent the single-ended voltages. The purpose of VP and VM is detecting an SE0 state, in which case both are low. The other possibilities are either illegal (i.e. both high, as single-ended ’1′ isn’t allowed per USB spec) or redundant (if they are different, RCV should be used, as it measures the differential voltage, rather than two single-ended voltages, and is therefore more immune to noise).

Any USB receiver is in particular sensitive to transitions between J, K and SE0. For example, a transition from J to K after idling means SOP (Start of Packet) and immediately causes the hub repeater to drive the other side. Likewise, a transition into SE0 while a packet is transmitted signals the end of packet (EOP).

The timing of the transitions is crucial, as a DPLL is maintained on the receiving side to lock on the arriving bits. In short, detecting voltage transitions timely and correctly is crucial.

The main difficulty is that switching from J to K essentially involves one of the D+/D- wires going from low to high, and the other one from high to low. Somewhere in the middle, they might both be high or both low. And if both are low, an SE0 condition occurs.

The USB 1.1 spec section 7.1.4 says: “Both D+ and D- may temporarily be less than Vih(min) during differential signal transitions. This period can be up to 14ns (TFST) for full-speed transitions and up to 210ns (TLST) for low-speed transitions. Logic in the receiver must ensure that that this is not interpreted as an SE0.”

So a temporary “detection” of SE0 may be false, and this is discussed next. But before that, can a transition into J or K also be false? Let’s divided it into two cases:

  • A transition from J to K or vice versa: This is detected by a change in the RCV input. The TUSB1106′s datasheet promises no transition in RCV when going into SE0 in its Table 5. So if RCV toggles when the previous state was J or K, it’s not an SE0. We’ll have to trust TI’s transceiver on this. Besides, going from J or K necessarily involves both wires’ voltage changes, but going to SE0 requires only one wire to move. So a properly designed receiver will not toggle RCV before both wires have changed polarity within legal high/low ranges (did I say TI?). This doesn’t contradict that VP and VM may show an SE0 momentarily, as already mentioned.
  • A transition from SE0 to J. This is detected by one of VM or VP going high. This requires one of the wires to reach the high voltage level. This can, in principle, happen momentarily due to single-ended noise (SE0 can be 20 ms long), so requiring that the VM or VP signal remains high for say, 41 ns before declaring the exit of SE0 may be in place. As discussed below, the uses of SE0 are such that it’s never required to time the exit of this state accurately.
  • A transition from SE0 to K. Even though this kind of transition is mentioned in the USB 1.1 spec’s requirement for the hub’s repeater (e.g. see section, there is no situation in the USB protocol for such transition to occur. I’ve also set a trigger for such event, and tried several pieces of electronics, none of which generated such transition. Consequently, there’s no reason to handle this case.

Avoiding false SE0 detection

The SE0 state may appear in the following cases:

  • EOP in either direction, following a packet. The length of the SE0 is approximately 2 bit times (depending on the speed mode), but should be detected after 1 bit time. See “SE0 detection when a hub is connected as a device” below for a possible problem with this.
  • End of a resume signal from host (K state for > 20 ms): A low-speed EOP (two low-speed bit times of SE0).
  • A long SE0 from host (should be detected at 2.5 us, generated as 10 ms): Reset the device
  • A long SE0 from device (should be detected at 2.0 us): Disconnect
  • Keep-alive signaling from host (low-speed only): 2 low-speed bits times of SE0 followed by J.

As mentioned above, a false SE0 may be present for 14 ns on full-speed mode (17% of a bit time) and 210ns in low-speed mode (32% of a bit time). The difference is because the rise and fall times of the D+/D- depend on the speed mode.

Except for detecting EOP in full-speed mode, it’s fine to ignore any SE0 shorter than 210 ns, as all other possibilities are significantly longer.

For detecting EOP in full-speed mode, the 14ns lower limit is enforced instead. If the speed mode can’t be derived from the pullup resistors (as was my case), it’s known from the packet itself. For example, from timing the second transition in the packet’s SYNC pattern.

The truth is, that looking at the output of the TUSB1106 USB transceiver when connected to real devices, it seems like the transceiver was designed to allow its user to be unaware of these issues. Clearly, an effort has been made to avoid these false SE0s in its design.

The VP and VM ports tend to toggle together, and in some cases (depending what USB device is tested) VP and VM were both high for a few nanoseconds on J/K transitions. This was even more evident in low-speed mode. Most likely, seeing both low at any time is enough to detect an SE0, despite the precautions required in the spec.

Also, the RCV signal toggled somewhere in the middle between VP and VM moving, or along with one of these, but never when SE0 was about to appear. So the unaware user of this chip could simply consider any movement in RCV as a true data toggle, and if VM and VP are low, detect SE0 without fussing too much about it.

Or, one can add these mechanisms like I did, just in case.

Summary of J/K/SE0 transition detection

The precautious way:

  • From J or K: If RCV toggles, register the transition to the opposite J/K symbol immediately. Ignore further transitions for 41 ns (half a full-speed bit) to avoid false transitions due to noise (I bet this is unnecessary — I’ve never seen any glitches on the RCV signal).
    If VP and VM are both zero for 210 ns (or 14 ns, if in the midst of a full-speed packet), register a transition to SE0.
  • From SE0: If either VM or VP are high for 41 ns, register a transition to J.

Lazy man’s implementation (will most likely work fine):

  • From J or K: If RCV toggles, register the transition to the opposite J/K symbol. If VM and VP are both zero, register an SE0 (immediately).
  • From SE0: If either VM or VP are high for 41 ns, register a transition to J (immediately).

PRE packets

Section 8.6.5 states that a PRE packet should be issued by the host when it needs to make a low-speed transaction, when there’s a hub in the middle. The special thing with the PRE packet is that it doesn’t end with an SE0. Rather, there are 4 full-speed bit times of “nothing happening”, which is followed by a low-speed SYNC and packet.

This requires some special handling of this case, which actually isn’t described further down this post (it’s a bit complicated, quite naturally).

One tricky thing is that the state of the wires is always K after the PRE packet. In fact, it’s always K after then PID, because the SYNC itself is like transmitting 0x80, and the PID 8 bits itself always contains an even number of zeros, as half of it is the NOT of the other. So it’s always an uneven number of J/K toggles (7 zeros from SYNC, an even number from the PID part), leaving the wires at K.

The USB standard is a bit unclear on what happens after the PRE PID has been transmitted, but it’s quite safe to conclude that it has to switch to J immediately after the PRE packet has completed. The hub is expected to start relaying to the low-speed downstream ports from that moment on (up to 4 full-speed bit times later), and if the upstream port stands at K when that happens, a false beginning of SYNC will be relayed.

And indeed, this is what I’ve observed with real hardware: Immediately after the PRE packet’s last bit, the wires switch to a J state.


These are the implemented states of transparent hub. The rationale is explained after the short outline of the states.

  • Disconnected — this is the initial state, and is also possibly invoked by virtue of SE0 timers explained below. The D+/D- pullup resistor is disconnected at the upstream port (to the host), and none of the ports is driven. When a pullup resistor has been sensed for 2.5us on the downstream port, a matching pullup resistor is enabled at the upstream port. Then wait for the upstream port to change from SE0 to J, or it will be interpreted as an SE0 signal. Then switch to the Idle state. The pullup resistor remains enabled on all other states.
  • Idle — None of the ports is driven. If a ‘K’ is sensed on the upstream port, switch to the “Down” state. If ‘K’ is sensed on the downstream port, switch to the “Up” state. In both cases, this is a J-to-K transition, so it takes place immediately when RCV toggles. An SE0 condition on either port switches the state to Host_SE0 or Device_SE0, whichever applies.
  • Down — J/K symbols are repeated from the upstream to the downstream port. This state doesn’t handle SE0. If a ‘J’ symbol is sensed continuously for 8 bit times, go to “Idle”, which isn’t a normal packet termination, but a result of a missed SE0. “Bit times” means low-speed bit times, unless a full-speed sync transition was detected at the first K-to-J transition during this state. The EOP is handled by the Host_SE0 state.
  • Up — J/K symbols are repeated from the downstream to the upstream port. Exactly like “Down”, only in the opposite direction. Same 8 bit timeout mechanism for returning to “Idle”
  • Host_SE0 — This state is invoked from Idle or Down states when an SE0 is sensed on the upstream port (after waiting TLST or TFST to prevent a false SE0). The downstream port is driven with SE0. When the SE0 condition on the upstream port is exited, switch to Host_preidle.
  • Host_preidle — This state drives the downstream port with a J for a single bit’s time (depends on the speed mode) and then switches the state machine to Idle. This completes EOPs and other similar conditions.
  • Device_SE0 — This state is invoked from Idle or Down states when an SE0 is sensed on the downstream port. The upstream is driven with SE0. When the SE0 condition is exited, switch to Device_preidle.
  • Device_preidle — This state will drives the upstream port with a J for a single bit’s time (depends on the speed mode) and then switches the state machine to Device_preidle.

The states above are outlined for illustration. In a real-life implementation, it’s convenient to collapse Down, Host_SE0 and Host_preidle into a single state (an enriched “Down” state). These three states all drive the downstream port. Host_SE0 is basically Down, as it repeats the data from the upstream to the downstream port. The only special thing is that a preidle phase follows it. This SE0 to preidle sequence is easier to implement by adding state flags, rather than adding states.

Likewise, Up, Device_SE0 and Device_preidle can be collapsed into a single state.

SE0 counters

Two counters are maintained to keep track of how long an SE0 condition has been continuously active, one for the upstream port, and one for the downstream port. These counters are zeroed in two situations:

  • When one of VP or VM is high, indicating not being in an SE0 condition.
  • When the port is question is driven by the other port. For example, the counter for the downstream port is held at zero in the Host_SE0 state (or a reset signal from the host could have been mistaken for a disconnection). A other possible criterion is the PHY’s Output Enable signal.

These counters serve two purposes:

  • Detect non-false SE0: When the counter goes above the relevant threshold value, the SE0 condition is valid. The state switches to Host_SE0 or Device_SE0 (whichever applied) from Idle or Down/Up. This handles EOP.
  • Detect disconnection on the downstream port: When the counter reaches the value corresponding to 2.0 us, the state switches to Disconnected

Note that the transition to the *_SE0 states, and the way they are terminated by virtue of a single bit’s J, covers all of the protocol’s uses of the SE0 condition, including reset from host and keep-alives.

Determining the bit length

As the speed isn’t determined by pullup resistors, a detection mechanism is applied as follows:

  • A flag, “lowspeed” is maintained, to indicate the time of a USB bit. When ’1′, low-speed is assumed (one bit is ~666 ns), and when ’0′, full-speed (~ 83 ns).
  • The flag is set to ’1′ in the Idle state.
  • When one of the Up or Down states is invoked by a transition from J to K, a counter measures the time until the first transition back to J. If this occurs within 6.5 full-speed bit times (~542 ns), the flag is cleared to ’0′.
  • Following transitions (until the next Idle state) are ignored for this purpose.

Rationale: The only situation where full-speed timing is required is full-speed packets. Such begin with a SYNC word, which starts with a few J/K togglings on each bit.

The time threshold could have been set slightly above 83 ns, since the first toggle is expected on the bit following the J-to-K toggle that took the state machine out of Idle. However 6.5 full-speed bit’s time is better, as it gracefully handles the unlikely event that Idle would be invoked in the middle of a transmitted packet (due to a false SE0, for example). Since 6 bits is the maximal length of equal bits (before bit stuffing is employed), the bus must toggle after 6 bits. If it doesn’t after 6.5, it’s not full-speed.

The “lowspeed” flag correctly handles other bus events, such as keep-alive signaling, which consists of a two SE0 low-speed bits followed by a bit of J. Since the flag is set on Idle, it remains such when Host_SE0 is invoked, which will correctly terminate with a low-speed bit’s worth of J.

SE0 detection when a hub is connected as a device

There’s a slight twist when a (real) USB hub is connected as the device to the transparent hub, and a low-speed device is connected to one of the (real) hub’s downstream ports. Any (real) hub’s upstream port runs at full-speed (or higher), so the hub repeats the low-speed device’s signals on the upstream port, using full-speed’s rise/fall times, polarity and other attributes.

A practical implication is that the EOP may consist of a full-speed SE0, if it’s generated by the hub (see citations below). It’s also the hub’s responsibility to ensure that false SE0 time periods are limited to the full-speed spec (i.e. TFST rather than TLST). Hence for the specific case of detecting an EOP that was generated by the (real) hub for truncating a packet that went beyond the end of a frame, the method of detecting SE0 outlined above won’t work, because it will ignore a full-speed SE0 in the absence of a full-speed sync pattern on that packet.

This is however a rare corner case, which stems from a misbehaving device connected to the (real) hub.

Citing 11.8.4 of the USB spec:

“The upstream connection of a hub must always be a full-speed connection. [ ... ] When low-speed data is sent or received through a hub’s upstream connection, the signaling is full-speed even though the bit times are low-speed.”

and also:

“Hubs will propagate upstream-directed packets of any speed using full-speed signaling polarity and edge rates.”


“Although a low-speed device will send a low-speed EOP to properly terminate a packet, a hub may truncate a low-speed packet at the EOF1 point with a full-speed EOP. Thus, hubs must always be able to tear down connectivity in response to a full-speed EOP regardless of the data rate of the packet.

and finally,

“Because of the slow transitions on low-speed ports, when the D+ and D- signal lines are switching between the ‘J’ and ‘K’, they may both be below 2.0V for a period of time that is longer than a full-speed bit time. A hub must ensure that these slow transitions do not result in termination of connectivity and must not result in an SE0 being sent upstream.”

Misalignment of transitions to SE0

Preventing false SE0 comes with a price: The J/K transitions are repeated immediately to the opposite port, but transitions to SE0 must be delayed until they’re confirmed to be such. The spec relates to this issue in section 7.1.14 (Hub Signal Timings, also see Table 7-8): “The EOP must be propagated through a hub in the same way as the differential signaling. The propagation delay for sensing an SE0 must be no less than the greater of the J-to-K, or K-to-J differential data delay (to avoid truncating the last data bit in a packet), but not more than 15ns greater than the larger of these differential delays at full-speed and 200ns at low-speed (to prevent creating a bit stuff error at the end of the packet).”

It’s not clear how the 200 ns requirement can be met given the 210 ns it takes to detect SE0 properly on low-speed. But 10 ns in low-speed mode is probably not much to fuss about.

In theory, it would be possible to delay repeating of the J/K transitions to the opposite port with the relevant expected SE0 delay. This is however not feasible when the speed is deduced from the first SYNC transition (see “Determining the bit length” above).

The chosen solution was to have a digital delay line for all transitions, with the delay of 100 ns. This is equivalent to 1.25 bits delay at full-speed, and slightly longer than a full-speed hub’s delay according to table 7-8, so it’s acceptable. J/K transitions are always delayed by the full delay amount, but transitions to SE0 manipulate the segment in the delay line, so that the output SE0 starts in a time that corresponds to when the SE0 was first seen, not when it was confirmed, in full-speed mode. In low-speed mode, only 100 ns are compensated for, but that’s still within spec.

High-speed suppression

Since the transparent hub supports only USB 1.1, the transition into high speed (per USB 2.0) must not occur. This is guaranteed by the suggested implementation, since the mechanism for upgrading into high speed takes place during the long SE0 period that resets the device. Namely, the USB 2.0 spec section states that the a high speed capable device should drive current into the D- wire during the initial reset, while the host drives an SE0 on the wires. This creates a chirp-K state on the wires, which is essentially a slightly negative differential voltage instead of more or less zero. This is the first stage of the transition into high speed.

But since the host has already initiated an SE0 state when the chirp-K arrives from the device, the transparent hub is in the Host_SE0 state, so the voltages at the device’s side are ignored at that time period. The chirp will have no significance and has no way to get passed on to the host. Hence the host will never sense anything special, and the device will give up the speed upgrade attempt.

NXP / Freescale SDMA and the art of accessing peripheral registers


While writing a custom SDMA script for copying data arriving from an eCSPI peripheral into memory, it occurred to me that there is more than one way to fetch the data from the peripheral. This post summarizes my rather decisive finding in this matter. Spoiler: Linux’ driver could have done better.

I’ve written a tutorial on SDMA scripts in general, by the way, which is recommended before diving into this one.

Using the Peripheral DMA Unit

This is the method used by the official eCSPI driver for Linux. That is, the one obtained from Freescale’s / NXP’s Linux git repository. Specifically, spi_imx_sdma_init() in drivers/spi/spi-imx.c sets up the DMA transaction with

	spi_imx->rx_config.direction = DMA_DEV_TO_MEM;
	spi_imx->rx_config.src_addr = res->start + MXC_CSPIRXDATA;
	spi_imx->rx_config.src_addr_width = DMA_SLAVE_BUSWIDTH_1_BYTE;
	spi_imx->rx_config.src_maxburst = spi_imx_get_fifosize(spi_imx) / 2;
	ret = dmaengine_slave_config(master->dma_rx, &spi_imx->rx_config);
	if (ret) {
		dev_err(dev, "error in RX dma configuration.\n");
		goto err;

Since res->start points at the address resource obtained from the device tree (0x2008000 for eCSPI1), this is the very same address used for accessing the peripheral registers (only the software uses the virtual address mapped to the relevant region).

In essence, it means issuing an stf command to set the PSA (Peripheral Source Address), and then reading the data with an ldf command on the PD register. For example, if the physical address (e.g. 0x2008000) is in register r1:

69c3 (0110100111000011) | 	stf	r1, 0xc3	# PSA = r1 for 32-bit frozen periheral read
62c8 (0110001011001000) | 	ldf	r2, 0xc8	# Read peripheral register into r2

One would expect this to be correct way, or why does this unit exist? Or why does Linux’ driver use it? On the other hand, if this is the right way, why is there a “DMA mapping”?

Using the Burst DMA Unit

This might sound like a bizarre idea: Use the DMA unit intended for accessing RAM for peripheral registers. I wasn’t sure this would work at all, but it does: If the same address that was fed into PSA for accessing a peripheral goes into MSA instead, the data can be read correctly from MD. After all, the same address space is used by the processor, Peripheral DMA unit and Burst DMA unit, and it turns out that the buses are interconnected (which isn’t obvious).

So the example above changes into

6910 (0110100100010000) | 	stf	r1, 0x10    # To MSA, NO prefetch, address is frozed
620b (0110001000001011) | 	ldf	r2, 0x0b    # Read peripheral register into r2

The motivation for this type of access is using copy mode — a burst of up to 8 read/write operations in a single SDMA command. This is possible only from PSA to PDA, or from MSA to MDA. But there is no burst mode from PSA to MDA. So treating the peripheral register as a memory element works around this.

Spoiler: It’s not such a good idea. The speed results below tell why.

Using the SDMA internal bus mapping

The concept is surprisingly simple: It’s possible to access some peripherals’ registers directly in the SDMA assembly code’s memory space. In other words, to access eCSPI1, one can go just

5201 (0101001000000001) | 	ld	r2, (r1, 0) # Read peripheral register from plain SDMA address space

and achieve the equivalent result of the examples above. But r1 needs to be set to a different address. And this is where it gets a bit confusing.

The base address is fairly easy to obtain. For example, i.MX6′s reference manual lists the address for eCSPI1 as 0x2000 in section 2.4 (“DMA memory map”), where it also says that the relevant section spans 4 kB. Table 55-14 (“SDMA Data Memory Space”) in the same document assigns the region 0x2000-0x2fff to “per2″, declares its size as 16 kB, and in the description it says “peripheral 2 memory space (4 Kbyte peripheral’s address space)”. So what is it? 4 kB or 16 kB?

The answer is both: The address 0x2000 is given in SDMA data address format, meaning that each address points at a 32-bit word. Therefore, the SDMA map region of 0x2000-0x2fff indeed spans 16 kB. But the mapping to the peripheral registers was done in a somewhat creative way: The address offsets of the registers apply directly on the SDMA mapping’s addresses.

For example, let’s consider the ECSPI1_STATREG, which is placed at “Base address + 18h offset”. In the Application Processor’s address space, it’s quite clear that it’s 0x2008000 + 0x18 = 0x2008018. The 0x18 offset means 0x18 (24 in decimal) bytes away from the base.

In the SDMA mapping, the same register is accessed at 0x2000 + 0x18 = 0x2018. At first glance, this might seem obvious, but an 0x18 offset means 24 x 4 = 96 bytes away from the base address. A bit odd, but that’s the way it’s implemented.

So even though each address increment in SDMA data address space moves 4 bytes, they mapped the multiply-by-4 offsets directly, placing the registers 16 bytes apart. Attempting to access addresses like 0x2001 yield nothing noteworthy (in my experiments, they all read zero). I believe that the SDMA submodule was designed in France, by the way.

Almost needless to say, these addresses (e.g. 0x2000) can’t be used to access peripherals with Peripheral / Burst DMA units — these units work with the Application Processor’s bus infrastructure and memory map.

Speed tests

As all three methods work, the question is how fast each is. So I ran a speed test. I only tested the peripheral read operation (my application didn’t involve writes), but I would expect more or less the same results for writes. The speed tests were carried out by starting the SDMA script from a Linux kernel module, and issuing a printk when the SDMA script was kicked off. When the interrupt arrived at the completion of the script (resulting from a “done 3″ opcode, not shown in the table below), another printk was issued. The timestamps in dmeg’s output was used to measure the time difference.

In order to keep the influence of the Linux overhead delays low, the tested command was executed within a hardware loop, so that the overall execution would take a few seconds. A few milliseconds of printk delay hence became fairly negligible.

The results are given in the following table:

Peripheral DMA Unit Burst DMA Unit Internal bus mapping Non-IO command
Assembly code stf r1, 0xc3
loop endloop, 0
ldf r2, 0xc8
stf r1, 0x10
loop endloop, 0
ldf r2, 0x0b
loop endloop, 0
ld r2, (r1, 0)
loop endloop, 0
addi r5, 2
Execution rate 7.74 Mops/s 3.88 Mops/s 32.95 Mops/s 65.97 Mops/s

Before concluding the results, a word on the rightmost one, which tested the speed of a basic command. The execution rate, almost 66 Mops/s, shows the SDMA machine’s upper limit. Where this came from isn’t all that clear, as I couldn’t find a matching clock rate in any of the three clocks enabled by Linux’ SDMA driver: clk_ahb, clk_ipg and clk_per.

The reference manual’s section 55.4.6 claims that the SDMA core’s frequency is limited to 104 MHz, but calling clk_get_rate() for clk_ahb returned 132 MHz (which is 2 x 66 MHz…). For the two other which the imx-sdma.c driver declares that it uses, clk_ipg and clk_per (the same clock, I believe), clk_get_rate() returned 60 MHz, so it’s not that one. In short, it’s not 100% what’s going on, except that the figure is max 66 Mops/s.

By the way, I verified that the hardware loop doesn’t add extra cycles by duplicating the addi command, so it ran10 times for each loop. The execution rate dropped to exactly 1/10, so there’s definitely no loop overhead.

OK, so now to the conclusions:

  • The clear winner is using the internal bus. Note that the result isn’t all that impressing, after all. With 33 Mops, 4 bytes each, there’s a theoretical limit of 132 MB/s for just reading. That doesn’t include doing something with the data. More about that below.
  • Note that reading from the internal bus takes just 2 execution cycles.
  • There is a reason for using the Peripheral DMA Unit, after all: It’s twice as fast compared with the Burst DMA Unit.
  • It probably doesn’t pay off to use the Burst DMA Unit for burst copying from a peripheral to memory, even though I didn’t give it a go: The read is twice as slow, and writing to memory with autoflush is rather quick (see below).
  • The use of the Peripheral DMA Unit in the Linux kernel driver is quite questionable, given the results above. On the other hand, the standard set of scripts aren’t really designed for efficiency anyhow.

Copying data from peripheral to RAM

In this last pair of speed tests, the loop reads one value from the peripheral with Internal bus mapping (the fastest way found) and writes it to the general RAM with an stf command, using autoincrement. This is hence a realistic scenario for bulk copying of data from a peripheral data register into memory that is available to the Application Processor.

The test code had to be modified slightly, so the destination address is brought back to the beginning of the buffer every 1,000,000 write operations, since the buffer size is limited, quite naturally. So when the script begins, r7 contains the number of times to loop until resetting the destination address (that is, r7 = 1000000) and r3 contains the number of such sessions to run (was set to 200). The overhead of this larger loop is literally one in a million.

The assembly code used was:

                             | bigloop:
0000 008f (0000000010001111) | 	mov	r0, r7
0001 6e04 (0110111000000100) | 	stf	r6, 0x04	# MDA = r6, incremental write
0002 7802 (0111100000000010) | 	loop endloop, 0
0003 5201 (0101001000000001) | 	ld	r2, (r1, 0)
0004 6a0b (0110101000001011) | 	stf	r2, 0x0b	# Write 32-bit word, no flush
                             | endloop:
0005 2301 (0010001100000001) | 	subi	r3, 1		# Decrement big loop counter
0006 7cf9 (0111110011111001) | 	bf	bigloop		# Loop until r3 == 0
                             | quit:
0007 0300 (0000001100000000) | 	done 3			# Quit MCU execution

The result was 20.70 Mops/s, that is 20.7 Million pairs of read-writes per second. This sets the realistic hard upper limit for reading from a peripheral to 82.8 MB/s. Note that deducing the known time it takes to execute the peripheral read, one can estimate that the stf command runs at ~55.5 Mops/s. In other words, it’s a single cycle instruction until an autoflush is forced every 8 writes. However dropping the peripheral read command (leaving only the stf command) yields only 35.11 Mops/s. So it seems like the DMA burst unit takes advantage of the small pauses between accesses to it.

I should mention that the Linux system was overall idle while performing these tests, so there was little or no congestion on the physical RAM. The results were repeatable within 0.1% of the execution time.

Note that automatic flush was enabled during this test, so the DMA burst unit received 8 writes (32 bytes) before flushing the data into RAM. When reattempting this test, with explicit flush on each write to RAM (exactly the same assembly code as listed above, with a peripheral read and then stf r7, 0x2b instead of 0x0b), the result dropped to 6.83 Mops/s. Which is tantalizingly similar to the 7.74 Mops result obtained for reading from the Peripheral DMA Unit.

Comparing with non-DMA

Even though not directly related, it’s worth comparing how fast the host accesses the same registers. For example, how much time will this take (in Linux kernel code, of course)?

  for (i=0; i<10000000; i++)
    rc += readl(ecspi_regs + MX51_ECSPI_STAT);

So the results are as follows:

  • Reading from an eCSPI register (as shown above): 4.10 Mops/s
  • The same, but from RAM (non-cacheable, allocated with dma_alloc_coherent): 6.93 Mops/s
  • The same, reading with readl() from a region handled by RAM cache (so it’s considered volatile): 58.14 Mops/s
  • Writing to an eCSPI register (with writel(), loop similar to above): 3.8696 Mops/s

This was carried out on an i.MX6 processor with a clock frequency of 996 MHz.

The figures echo well with those found in the SDMA tests, so it seems like the dominant delays come from i.MX6′s bus bridges. It’s also worth nothing the surprisingly slow performance of readl() from cacheable, maybe because of the memory barriers.

NXP / Freescale i.MX6 as an SPI slave


Even though SPI is commonly used for controlling rather low-speed peripherals on an embedded system, it can also come handy for communicating data with an FPGA.

When using the official Linux driver, the host can only be the SPI master. It means, among others, that transactions are initiated by the host: When the bursts take place is completely decided by software, and so is how long they are. It’s not just about who drives which lines, but also the fact that the FPGA is on the responding side. This may not be a good solution when the data rates are anything but really slow: If the FPGA is slave, it must wait for the host to poll it for data (a bit like a USB peripheral). That can become a bit tricky at the higher end of data rates.

For example, if the FPGA’s FIFO is 16 kbit deep, and is filled at 16 Mbit/s, it takes 1 ms for it to overflow, unless drained by the host. This can be a difficult real-time task for a user-space Linux program (based upon spidev, for example). Not to mention how twisted such a solution will end up, having the processor constantly spinning in a loop collecting data, whether there is data to collect or not.

Another point is that the SPI clock is always driven by the SPI master, and it’s usually not a free-running one. Rather, bursts of clock edges are presented on the clock wire to advance the data transaction.

Handling a gated clock correctly on an FPGA isn’t easy when it’s controlled by an external device (unless its frequency is quite low). From an FPGA design point of view, it’s by far simpler to drive the SPI clock and handle the timing of the MOSI/MISO signals with respect to it.

And finally: If a good utilization of the upstream (FPGA to host) SPI channel is desired, putting the FPGA as master has another advantage. For example, on i.MX6 Dual/Quad, the SPI clock cycle is limited to a cycle of 15 ns for write transactions, but to 40 ns or 55 ns on read transactions, depending on the pins used. The same figures are true, regardless of whether the host is master or slave (compare sections in the relevant datasheet, IMX6DQCEC.pdf). So if the FPGA needs to send data faster than 25 Mbps, it can only use write cycles, hence it has to be the SPI master.

CS is useless…

This is the “Chip Select” signal, or “Slave Select” (SS) in Freescale / NXP terminology.

The reference manual, along with NXP’s official errata ERR009535, clearly state that deasserting the SPI’s CS wire is not a valid way to end a burst. Citing the description for the SS_CTL field of ECSPIx_CONFIGREG, section 21.7.4 in the i.MX6 Reference Manual:

In slave mode – an SPI burst is completed when the number of bits received in the shift register is equal to (BURST_LENGTH + 1). Only the n least-significant bits (n = BURST_LENGTH[4:0] + 1) of the first received word are valid. All bits subsequent to the first received word in RXFIFO are valid.

So the burst length is fixed. The question is, what value to pick. Short answer: 32 bits (set BURST LENGTH to 31).

Why 32? First, let’s recall that RXFIFO is 32 bits wide. So what is more natural than packing the incoming data into full 32 bits entries in the RXFIFO, fully utilizing its storage capacity? Well, maybe the natural data alignment isn’t 32 bits, so another packing scheme could have been better. In theory.

That’s where the second sentence in the citation above comes in. What it effectively says is that if BURST_LENGTH + 1 is chosen anything else than a multiple of 32, the first word, which is ever pushed into RXFIFO since the SPI module’s reset, will contain less than 32 received bits. All the rest, no matter what BURST_LENGTH is set to, will contain 32 bits of received data. This is really what happens. So in the long run, data is packet into 32 bit words no matter what. Choosing BURST_LENGTH + 1 other than a multiple of 32 will just mess up things on the first word the RXFIFO receives after waking up from reset. Nothing else.

So why not set BURST_LENGTH to anything else than 31? Simply because there’s no reason to do so. We’re going to end up with an SPI slave that shifts bits into RXFIFO as 32 bit words anyhow. The term “burst” has no significance, since deassertions of CS are ignored anyhow. In fact, I’m not sure if it makes any difference between different values satisfying multiple of 32 rule.

Note that since CS doesn’t function as a frame for bursts, it’s important that the eCSPI module is brought out of reset while there’s no traffic (i.e. clock edges), or it will pack the data in an unaligned and unpredictable manner. Also, if the FPGA accidentally toggles the clock (due to a bug), alignment it lost until the eCSPI is reset and reinitialized.

Bottom line: The SPI slave receiver just counts 32 clock edges, and packs the received data into RXFIFO. Forever. There is no other useful alternative when the host is slave.

… but must be taken care of properly

Since the burst length doesn’t depend on the CS signal, it might as well be kept asserted all the time. With the register setting given below, that means holding the pin constantly low. It’s however important to select the correct pin in the CHANNEL_SELECT field of ECSPIx_CONREG: The host will ignore the activity on the SPI bus unless CS is selected. In other words, you can’t terminate a burst with CS, but if it isn’t asserted, bits aren’t sampled.

Another important thing to note, is that the CS pin must be IOMUXed as a CS signal. In the typical device tree for the mainstream Linux SPI master driver, it’s assigned as a GPIO pin. That’s no good for an SPI slave.

So, for example, if the ECSPI entry in the device tree says:

&ecspi1 {
[ ... ]
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_ecspi1_1>;
	status = "okay";

meaning that the IOMUX settings given in pinctrl_ecspi1_1 should be applied, when the Linux driver related to ecspi1 is probed. It should say something like

&iomuxc {
	imx6qdl-var-som-mx6 {
[ ... ]

		pinctrl_ecspi1_1: ecspi1grp {
			fsl,pins = <
				MX6QDL_PAD_DISP0_DAT23__ECSPI1_SS0	0x1f0b1
[ ... ]

The actual labels differ depending on the processor’s variant, which pins were chosen etc. The point is that the _SS0 usage was selected for the pin, and not the GPIO alternative (in which case it would say MX6QDL_PAD_DISP0_DAT23__GPIO5_IO17). The list of IOMUX defines for the i.MX6 DL variant can be found in arch/arm/boot/dts/imx6dl-pinfunc.h.


The timing diagrams for SPI communication in the Reference Manual show only 8 bit examples, with MSB received first. But this applies to 32 bit words as well. But what happens if 4 bytes are sent with the intention of being treated as a string of bytes?

Because the first byte is treated as the MSB of a 32-bit word, it’s going to end up as the last byte when the 32-bit word is copied (by virtue of a single 32-bit read and write) into RAM, whether done by the processor or by SDMA. This ensures that a 32-bit integer is interpreted correctly by the Little Endian processor when transmitted over the SPI bus, but messes up single bytes transmitted.

Where exactly this flipping takes place, I”m not sure, but it doesn’t really matter. Just be aware that if a sequence of bytes are sent over the SPI link, they need to be byte swapped in groups of 4 bytes to appear in the correct order in the processor’s memory.

Register setting

In terms of a Linux kernel driver, the probe of an SPI slave is pretty much the same as the SPI master, with a few obvious differences. For example, the SPI clock’s frequency isn’t controlled by the host, so it probably doesn’t matter so much how the dividers are set (but it’s probably wise to set these dividers to 1, in case the internal clock is used for something).

  ctrl = MX51_ECSPI_CTRL_ENABLE | /* Enable module */
    /* MX51_ECSPI_CTRL_MODE_MASK not set, so it's slave mode */
    /* Both clock dividers set to 1 => 60 MHz, not clear if this matters */
    MX51_ECSPI_CTRL_CS(which_cs) | /* Select CSn */
    (31 << MX51_ECSPI_CTRL_BL_OFFSET); /* Burst len = 32 bits */

  cfg = 0; /* All defaults, in particular, no clock phase / polarity change */

  /* CTRL register always go first to bring out controller from reset */
  writel(ctrl, regs + MX51_ECSPI_CTRL);

  writel(cfg, regs + MX51_ECSPI_CONFIG);

   * Wait until the changes in the configuration register CONFIGREG
   * propagate into the hardware. It takes exactly one tick of the
   * SCLK clock, but we will wait 10 us to be sure (SCLK is 60 MHz)


    Turn off DMA requests (revert the register to its defaults)
    But set the RXFIFO watermark as required by device tree.
	 regs + MX51_ECSPI_DMA);

  /* Enable interrupt when RXFIFO reaches watermark */
  writel(MX51_ECSPI_INT_RDREN, regs + MX51_ECSPI_INT);

The example above shows the settings that apply when the the host reads from the RXFIFO directly. Given the measurements I present in another post of mine, showing ~4 Mops/s with a plain readl() call, it means that at the maximal bus rate of 66 Mbit/s, which is ~2.06 Mops/s (32 bits per read), we have the a processor core 50% busy just on readl() calls.

So for higher data rates, SDMA is pretty much a must.

The speed test

Eventually, I ran a test. With a dedicated SDMA script, SPI clock running at 112 MHz, 108.6 Mbit/s actual throughput:

# time dd if=/dev/myspi of=/dev/null bs=64k count=500
500+0 records in
500+0 records out
32768000 bytes (33 MB, 31 MiB) copied, 2.41444 s, 13.6 MB/s

real	0m2.434s
user	0m0.000s
sys	0m1.610s

This data rate is, of course, way above the allowed SPI clock frequency of 66 MHz, but it’s not uncommon that real-life results are so much better. I didn’t bother pushing the clock higher.

I ran a long and rigorous test looking for errors on the data transmission line (~ 1 TB of data) and it was completely clean with the 112 MHz, so the SPI slave is reliable. For a production system, I don’t think about exceeding 66 MHz, despite this result. Just to have that said.

But the bottom line is that the SPI slave mode can be used as a simple transmission link of 32-bit words. Often that’s good enough.

Fedora 12: Bringing keyboard autorepeat back

For some weird reason (actually, running an old Atari 800 emulator), my autorepeat was suddenly not working. On any window. That’s the moment one becomes aware of how often it’s used.

Quick solution:

$ xset r on

and it was all fine again. Or

$ xset q

to see all settings.

Altera NIOS II jots

About this post

These are things I wrote down at different stages of introducing myself to Nios II and its environment. Nothing really consistent nor necessarily the right way to do things.


  • Open Qsys. Follow this post.
  • Went for Nios II classic, used Nios/e (no Hardware multiplication, as the target device doesn’t have it. Set instruction cache to 2 kB, and no data cache
  • Add 16 kB on-chip memory (Basic > On-Chip Memory > On-Chip Memory (RAM or ROM) ). Data width 32 bits, set hex file to raminit.hex (to be found at verilog/raminit.hex)
  • Attach memory to processor’s Avalon master
  • Attach peripherals
  • Connect clk_0′s clock output to all clock inputs (including processor’s).
  • Same with reset
  • Assign base addresses automatically: System > Assign Base Addresses
  • Enter the CPU configuration, and assign the Reset and Exception Vectors to the onchip memory (this issues an offset to the addresses, per the peripheral’s offsets).
  • Build the Qsys project. Among all (Verilog) files, it generates a processor.sopcinfo file.


  • Launch Nios II Software Build Tools for Eclipse (from Qsys or Quartus)
  • Pick a path for the workspace
  • Pick File > New > Nios II Application and BSP from Template. Assign the SOPC information file as processor.sopcinfo as generated before, and pick the “Hello World” template. There’s also a much smaller “Minimal Hello World” which allows communication with the JTAG UART.
  • Build the project. Eh, it failed. Not enough memory (printf is heavy. There’s a thinner version, but doesn’t matter now)
  • Go back to Qsys, and make on-chip memory 40960 bytes large (40kB, fitter fails if it’s 48 kB). Re-run Assign Base Addresses.
  • Build the Qsys project again
  • Regenerate the BSP: In Eclipse, right-click the BSP project, pick Nios II > Generate BSP (NOT from the top menu’s Nios II, there is no such option there!). Or alternatively, within a NIOS2 shell (see below), and from the BSP project’s home directory, go
    nios2-bsp-generate-files --settings settings.bsp --bsp-dir .
  • Rebuild: Project > Clean… and clean all, with the rebuild option set.
  • To add a lot of files to a project: Right-click the project, pick Import…, General > File System. Click Browse… and navigate to the directory where the files are and pick the directory. Then choose the desired files. Pick Advanced below, and pick “Add links” (it works).
  • To add an existing file to the project: Right-click the project, New > File > Advanced, check “Link to file in the file system” and pick the file. Then right-click the file (or several files) and pick “Add to Nios II build”
  • To remove a file, first right-click it, and pick “Remove from Nios II build”. Then right-click and delete. Failing to remove the file first will make the build system continue to look for it.
  • Creating a new application, based upon an existing BSP, and including the relevant source file sets it all up.
  • To compile manually, right-click the project, go to Nios > Nios command shell… (that opens a shell window) and type “make”
  • It’s also possible to copy the relevant elements in the PATH variable, and compile with “make” outside this shell window. Or set up the environment, as shown here.
  • I had a stubborn linking error with alt_main.c having an undefined reference to ‘main’ because I didn’t read my own note above about how to add a file to a project. It turned out that the Makefile doesn’t include any of the C source files (C_SRCS assigned to nothing in the Makefile). I ended up adding these entries manually. That allowed at least a manual build with the command shell, as mentioned in the bullet above.
  • The Eclipse project seems to consists of the Makefile, the .cproject XML file containing mostly useless mumbo-jumbo, and the .project XML file, which contains information about source files and build targets. There’s also .settings/language.settings.xml, which also seems not to contain anything relevant.
  • When creating a custom component, and an interrupt is required, be sure to associate the interrupt sender interface with an “Associated Addressable Interface” (e.g. associatedAddressablePoint set to avalon_slave_0 in the component’s tcl file). Otherwise, the interrupt will no be assigned an entry nor controller, so *_IRQ and *_IRQ_INTERRUPT_CONTROLLER_ID end up assigned with -1 in system.h.
  • For a shell prompt (“NIOS2 shell”) with all paths set up properly, go e.g.

Running against hardware

Note: Quartus’ programmer and the “Run” environment on Eclipse are mutually exclusive, competing for the USB bitblaster.

  • Make sure you’ve quit Quartus’ programmer (actually not necessary. Just be sure that the blue LED on the USB Blaster is off).
  • Also make sure to “terminate launch” on the Eclipse side before attempting to reprogram the FPGA (pressing the red stop-like button on the Nios Console is enough.
  • Pick the “hello” project (that is, not the BSP) and go to top menu: Run > Run configurations…, pick Target Connection tab. Both a processor and a byte stream device should be enlisted (the latter is the jtaguart). Refresh to make sure it’s actually there.
  • If it says “Connected system ID hash not found on target at the expected base address” at the top, select “Ignore mismatched system ID” and “Ignore mismatched system timestamp”. This happens when there’s no system ID peripheral in the Qsys design.
  • The “Hello world from NIOS!!” should appear in the Nios II console
  • The base addresses etc. are listed in system.h inside the BSP (hello_bsp in my case).
  • This program printed out “Hello world” as well as blinked the LEDs:
    #include <stdio.h>
    #include <unistd.h>
    #include <io.h>
    #include <system.h>
    #include <altera_avalon_pio_regs.h>
    int main()
      int i;
      printf("Hello from Nios II!\n");
      while (1) {
      return 0;
  • To generate a hex file, right-click the project (“hello”) and pick Make Targets > Build…, chooise mem_init_generate and click the Build button. The juicy part in the process was
    elf2hex hello.elf 0x00010000 0x00019fff --width=32 --little-endian-mem --create-lanes=0 mem_init/raminit.hex
  • Alternatively, go (skip to the “make” statement if already in a NIOS shell)
    /path/to/altera/15.1/nios2eds/nios2_command_shell.sh make mem_init_generate
  • It noteworthy that the tools spotted my choice of the file name, even though it’s not located where Quartus expects it.
  • Giving the hex file to Quartus resulted in a lot of lines saying
    Warning (113015): Width of data items in "raminit.hex" is greater than the memory width. Wrapping data items to subsequent addresses. Found 1280 warnings, reporting 10
        Warning (113009): Data at line (2) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.
        Warning (113009): Data at line (3) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.
        Warning (113009): Data at line (4) of memory initialization file "raminit.hex" is too wide to fit in one memory word. Wrapping data to subsequent addresses.

    But this is most probably OK, as the processor worked immediately after FPGA configuration.

  • Redirect printf() and other stdout to UART: By default, the standard output goes to the JTAG UART. To change this, right-click the BSP project, pick Nios II > BSP Editor. Pick the “Main” tab, navigate to “hal > common” (it usually starts there anyhow) and change the stdout target to the desired UART. And regenerate the BSP.

Remote Update from ECPQ flash on Altera Cyclone IV


This post relates to Altera (or should I say Intel FPGA?) Cyclone IV FPGAs loaded from an ECPQ flash in Active Serial x 1 (AS x 1) mode. Things written below are probably relevant to other Altera FPGAs as well, but keep in mind that Cyclone IV FPGAs have several peculiarities you won’t find on other Altera device families.

“Remote Update” is the feature in some Altera FPGAs, which allows application logic / software to safely update the bitstream from which the FPGA is loaded. The trick is to always have a Factory (“Golden”) bitstream image on the flash, and update the “Application” image only. When powers goes up, the FPGA ends up with the Application bitstream if it’s OK, or the Factory bitstream if it’s absent or corrupt.

Since the bitstreams carry a CRC, it’s guaranteed that only valid bitstreams are used. It’s therefore safe to overwrite a previous Application bitstream image: If something goes wrong in the middle of writing, it won’t be deemed a valid bitstream, so the FPGA will end up with the Factory bitstream.

The basics

To implement a remote update feature on an FPGA design, there are two functional elements needed:

  • The ability to write data into the configuration flash with user-designed logic / software. This is discussed in this post.
  • The logic / software that makes sure the FPGA ends up with the right configuration (and, in particular, prevents an endless configuration loop as explained next)

Note that the Remote Update IP Core has nothing to do with flash programming: Its function is merely to allow the FPGA’s logic to issue a reconfiguration, and offer some information on how and why the current bitstream was loaded.

When an FPGA powers up, it always configures from a constant address of the flash, which is zero on ECPQ flashes. In other words, the FPGA always powers up from the Factory bitstream, no matter what. It’s the user application logic / software’s duty to force the configuration of the Application bitstream when adequate. This means that during normal operation, there are always two configurations of the FPGA at powerup, one for the Factory bitstream, and one for the Application. This doubles the configuration time, of course.

How it happens: The FPGA is powered up, and loads the Factory bitstream from a fixed address. Through the Remote Update IP Core, the logic / software in the FPGA sets the address of the Application image at the flash, from which the FPGA should configure itself. It then can triggers a reconfiguration of the FPGA.

The FPGA’s configuration state machine attempts to load a bitstream from the flash at the given address. If the bitstream’s magic words are in place and the CRC is OK, it starts running on the new bitstream. If not, it loads the Factory bitstream again as a fallback.

By virtue of a register of the Remote Update IP Core, the software / logic in the Factory bitstream detects that it was loaded due to a failure, and takes action (or no action) accordingly. It may try another address at the flash, or refrain from another reconfiguration altogether (i.e. stay with the “Golden Image”). The decision depends on the design requirements, but the crucial point here is to prevent an endless loop of configurations.

Some reading

This post is not a user guide or a substitute for these two must-read documents:


This Nios II code implements the loading of the Application bitstream. It written so it can be used on any bitstream, as is does nothing when run from an Application bitstream. It’s also safe for use with JTAG configuration (it won’t issue a reconfiguration in that case).

void do_remote_update(void) {
  alt_u32 app_bitstream_addr = 0x100000;

  alt_u32 mode = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0) & 3;
  alt_u32 config_reason = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x64);

  if ((mode == 0) && (config_reason == 0)) {
    IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x30, 0); // Turn off watchdog
    IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x40, app_bitstream_addr);

    IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x74, 1); // Trigger reconfiguration

    while (1); // Wait briefly until configuration takes place

do_remote_update() should be called first thing in the Nios II code entry. If the function returns, the FPGA is either running on the Application bitstream, or the Factory (“Golden”) bitstream with a good reason not to reconfigure (i.e. a previous failure to load the Application bitstream or after a JTAG bitstream load).

Please refer to the “Programming the flash with NIOS software” section in this post on how to generate the image of the Application bitstream.

The code above works with the following setting:

  • The FPGA’s NCONFIG pin is tied high. This will not work if the NCONFIG pin is driven by some power supply watch logic or alike, because config_reason won’t be zero if NCONFIG triggered the configuration.
  • REMOTE_UPDATE_0_BASE is the base address in NIOS’ address space of a Remote Update IP core, which has the “writing configuration parameters” option enabled.
  • The application bitstream image is loaded at flash address 0x100000 (i.e. can be read with epcs_read_buffer() using this address)
  • The Golden image is at address zero, of course.

If loading the application image fails once, no other attempts are made. This is the straightforward thing to do if there’s no additional image to try from. There’s no sensible reason to try the same image again, unless the PCB designer has done a really bad job.

How this function works, briefly:

  • It verifies that the configuration mode is 0, that is Factory mode. If we’re in Application mode, the function returns.
  • It verifies that the trigger for configuration was a powerup by checking config_reason, or it returns. This prevents an endless loop of configurations in the case of a fallback into the Factory bitstream in the event of a failed attempt to load the Application bitstream.
    Note that if the configuration was triggered as a result of an assertion of the FPGA’s NCONFIG pin, or on a JTAG configuration, config_reason will read 0x10.
  • The watchdog is disabled, so the Application bitstream doesn’t have to deal with it
  • The Application bitstream’s address is set
  • A configuration is forced by writing to the dedicated register
  • An endless while (1) loop is invoked for preventing the execution to go on — not that it would go anywhere far.

General notes

  • It’s important to observe that the terminology of Factory / Application configuration modes, which is used in the docs, isn’t just for the sake of clarity: The Remote Update IP Core exposes different registers, based upon whether it considers itself to be in either of the modes: In particular, when in Application mode, there is very little the logic can do, except for jumping back to Factory mode or to reset the watchdog.
  • When generating the Remote Update block (most likely in Qsys), be sure to check “Add support for writing configuration parameters”. Or you’ll keep wondering why writing to the NIOS registers has no effect at all.
  • Also be sure to set configuration mode to remote for the FPGA project. There should be a line as follows in the QSF file:
    set_global_assignment -name STRATIXIII_UPDATE_MODE REMOTE
  • When setting the boot address register, use the actual boot address with the two LSBs forced to zero. When it’s read back after a configuration as the previous boot address, it’s shifted two bits to the left. The docs are a bit confusing about this too. Go figure.
  • The watchdog is enabled by default, so unless it’s tended to in the application bitstream, it must be explicitly turned off before firing off reconfiguration.
  • The watchdog timer runs on the internal configuration clock, which is 10 MHz unless the external CLKUSR is applied..

Accessing registers

Put short, the registers map is a mess. Out of the long list given in Tables 20 and 21 in the Remote Update IP Core User Guide, only a handful have a meaning.

It’s important to realize that some registers are valid when the Remote Update IP core is in Factory mode, and others when it’s in Application mode. These two register sets are mutually exclusive (except for the CURRENT_STATE_MODE register). The test program shown further down this post demonstrates which registers are valid in each mode.

This is a list of things to keep in mind regarding these registers:

  • Reading from a Factory mode register in Application mode (and vice versa) returns meaningless (and rather confusing) data.
  • The way to make sense of the registers from the docs is to refer to tables 16 and 17 in the Remote Update IP Core User Guide to tell what you want to access in terms of which param and which read_source, and then find the address for them in table 21. Several registers in table 21 constitute combinations of param and read_source that aren’t listed in table 17, which probably renders them meaningless.
  • … except for RU_RESET_TIMER and RU_RECONFIG, which are interpreted in logic to generate a reset signal / reconfiguration signal respectively, and and therefore not listed in table 17.
  • Too add more confusion, readbacks don’t work as one might expect. For example, the boot address for the next configuration is set at address offset 0x40, but reading back from the same address always yields the factory boot address. To get the boot address for the next configuration (i.e. the one written to 0x40), read it back at 0x4c.
  • More confusion: The translation from the param numbers to the Nios access register isn’t some arithmetic operation, but rather some lookup logic of the avl_controller_cycloneiii_iv module in Qsys_remote_update_0_remote_update_controller.sv, which is generated automatically by Qsys.
  • The registers listed in the BSP’s drivers/inc/altera_remote_update_regs.h are those of all Altera FPGAs except Cyclone IV. For example, the docs as well as the Qsys Verilog file () place RU_WATCHDOG_TIMEOUT at address 0x08 (actually, addresses 0x08-0x0b) but the BSP’s altera_remote_update_regs.h
  • Note that compared with other FPGA families, Cyclone IV’s register interface is considerably more extensive, allowing the controller to query the status of two configuration cycles back in history. Seems like this feature was dropped on later FPGAs (due to lack of interest vs complication…?)


This register is interesting in particular, as it tells us why that caused the FPGA to configure the bitstream that is currently running:

IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x64); // Register 0x19 in the guide

And to obtain the reason for the configuration before that:

IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0x68); // Register 0x1a in the guide

These read the remote config core’s param 3′b111 with read source 2′b01 and 2′b10 respectively. Note that the translation from the param number of 3′b111 to the Nios access register isn’t just a multiplication, but rather some lookup logic as mentioned (but not detailed) above.

Running some tests of my own (with the test program below), I got the following values. There’s nothing surprising about these results; they are exactly as documented.

  • On cold configuration: 0
  • When not disabling the watchdog (not handling it after configuration): 2 (bit 1 set, User watchdog timer timeout)
  • After failed application configuration due to lack of image: 4 (bit 2 set, nSTATUS asserted by an external device as the result of an error)
  • After failed application configuration due to damaged image: 8 (bit 3 set, CRC error during application configuration)
  • On configuration from JTAG; 0x10 (bit 4 set, External configuration reset (nCONFIG) assertion)

A test program

On my way to understanding how the whole thing works, I wrote a small test program that ran on the Nios II processor, which dumps all registers that are relevant for each mode. As a bonus, it can be used as a register reference, as it lists all registers available for reading Factory vs. Application mode in the respective structures.

#include <system.h>
#include <alt_types.h>
#include <io.h>
#include "sys/alt_stdio.h"
#include <unistd.h>

int main()
  int mode;

  struct regitem {
    int read_source;
    int param;
    const char *desc;

  const struct regitem factoryparams[] = {
    { 0, 0x00, "Current Machine State Mode" },
    { 0, 0x10, "Factory Boot Address" },
    { 1, 0x10, "Previous Boot Address" },
    { 1, 0x18, "Previous reconfiguration trigger source" },
    { 2, 0x10, "One before previous Boot Address" },
    { 2, 0x18, "One before previous reconfiguration trigger source" },
    { 3, 0x04, "Early confdone check bits" },
    { 3, 0x08, "Watchdog timeout value" },
    { 3, 0x0c, "Watchdog enable bit" },
    { 3, 0x10, "Boot address" },
    { 3, 0x14, "Force internal oscillator" },

  const struct regitem applicationparams[] = {
    { 0, 0x00, "Current Machine State Mode" },
    { 1, 0x08, "Watchdog timeout value" },
    { 1, 0x0c, "Watchdog enable bit" },
    { 2, 0x10, "Boot address" },

  const struct regitem unknownparams[] = { {} };

  const struct {
    const struct regitem *list;
    char *desc;
  } modetab[4] = {
    { factoryparams, "Factory mode" },
    { applicationparams, "Application mode" },
    { applicationparams, "Application mode with watchdog enabled" },
    { unknownparams, "Unknown mode" },

  const struct regitem *item;

  alt_putstr("\r\n----------------   BASE IMAGE   ---------------------\r\n\r\n");

  mode = IORD_32DIRECT(REMOTE_UPDATE_0_BASE, 0) & 3;

  alt_printf("Remote update register dump\r\nMode: %s\r\n",


  for (item = modetab[mode].list; item->desc; item++) {
    int addr = (item->param + item->read_source) * 4;
    alt_printf("%s (0x%x) = 0x%x\r\n", item->desc, addr,

  if (mode == 0) { // Factory mode only
    IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x30, 0); // Turn off watchdog
    IOWR_32DIRECT(REMOTE_UPDATE_0_BASE, 0x40, 0x100000);

  /* Event loop never exits. */
  while (1);

  return 0;

Tests results

The test program above was compiled and included in the bitstream that was loaded into flash address 0 (Factory image).

The first alt_putstr was then changed to say “Application Image”, and the compiled version of that was included in the bitstream loaded at address 0x100000 of the flash (Application Image).

Standard output was directed to a physical UART (instead of the JTAG UART) for the purpose of this test (Eclipse’s JTAG UART console didn’t like these games with configurations).

And then I powered on:

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   APPLICATION IMAGE   ---------------------

Remote update register dump
Mode: Application mode

Current Machine State Mode (0x0) = 0x1
Watchdog timeout value (0x24) = 0x1ffe0008
Watchdog enable bit (0x34) = 0x0
Boot address (0x48) = 0x400000

Note that if the register writes in the example are done before showing the registers, these following two lines would replace their respective outputs in the Base Image parameter list:

Watchdog enable bit (0x3c) = 0x0
Boot address (0x4c) = 0x100000

The same, with the application image wiped out (zeros):

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x4
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x4
One before previous Boot Address (0x48) = 0x400000
One before previous reconfiguration trigger source (0x68) = 0x4
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

[ ... etc ... ]

The same, with the Application image loaded in place, but with a small error (changed a single bit):

(this caused a CRC error)

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0xc
Previous reconfiguration trigger source (0x64) = 0x0
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x8
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x8
One before previous Boot Address (0x48) = 0x400000
One before previous reconfiguration trigger source (0x68) = 0x8
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

[ ... etc ... ]

Loading with JTAG: I set up both flash images properly, powered up so the FPGA stayed on the Application Image. At that point, I loaded the SOF of the Factory bitstream into the FPGA through JTAG (with a USB Blaster). The JTAG operation yielded this:

----------------   BASE IMAGE   ---------------------

Remote update register dump
Mode: Factory mode

Current Machine State Mode (0x0) = 0x0
Factory Boot Address (0x40) = 0x0
Previous Boot Address (0x44) = 0x400000
Previous reconfiguration trigger source (0x64) = 0x10
One before previous Boot Address (0x48) = 0xc
One before previous reconfiguration trigger source (0x68) = 0x0
Early confdone check bits (0x1c) = 0x1
Watchdog timeout value (0x2c) = 0x0
Watchdog enable bit (0x3c) = 0x1
Boot address (0x4c) = 0x0
Force internal oscillator (0x5c) = 0x1

----------------   APPLICATION IMAGE   ---------------------

Remote update register dump
Mode: Application mode

Current Machine State Mode (0x0) = 0x1
Watchdog timeout value (0x24) = 0x1ffe0008
Watchdog enable bit (0x34) = 0x0
Boot address (0x48) = 0x400000

When loading the same bitstream through JTAG once again the same result is obtained, only with “One before previous reconfiguration trigger source” set to 0x10 as well.

Quartus/Linux: Setting PATH and environment for command-line

The classic way:

$ export QUARTUS_ROOTDIR=/path/to/altera/15.1/quartus
$ . $QUARTUS_ROOTDIR/adm/qenv.sh

Or open a shell (will set path, but not a full environment):

$ /path/to/altera/15.1/nios2eds/nios2_command_shell.sh

This is good for compiling for NIOS etc.

VMplayer: Silencing excessive hard disk activity

For some unknown reason, possibly after an VMplayer upgrade, running any Windows Virtual machine on my Linux machine with WMware Player caused some non-stop heavy hard disk activity, even when the guest machine was effectively idle, and made had no I/O activity of its own.

Except for being surprisingly annoying, it also made the mouse pointer non-responsive and the effect was adverse on the hosting machine as well.

So eventually I managed to get things normal by editing the virtual machine’s  .vmx file as described below.

I have Vmplayer 6.0.2 on Fedora 12 (suppose both are considered quite old).

Following this post, add

isolation.tools.unity.disable = "TRUE"
unity.allowCompositingInGuest = "FALSE"
unity.enableLaunchMenu = "FALSE"
unity.showBadges = "FALSE"
unity.showBorders = "FALSE"
unity.wasCapable = "FALSE"

(unity.wasCapable was already in the file, so remove it first)

That appeared to help somewhat. But what really gave the punch was also adding

MemTrimRate = "0"
sched.mem.pshare.enable = "FALSE"
MemAllowAutoScaleDown = "FALSE"

Don’t ask me what it means. Your guess is as good as mine.