Implementing a Displayport source on FPGA: Random jots
As I implemented a Displayport source (Single Stream Transport, SST) on a Virtex-7 from scratch (based upon GTH outputs), I wrote down some impressions. Here they are, in no particular order.
As for MST (Multiple Stream Transport), which I didn’t touch — I really wonder if someone is going to use it. Or if that part just fills the standard with dead weight.
It’s actually DVI/HDMI over GTX
My personal opinion on the standard is that it’s not so well engineered and poorly written: Some of the mechanisms don’t make much sense (for example, the clock regeneration method), and some are defined quite vaguely. To the extent that it’s likely that an experienced engineer with a good common sense will get it right. But with room for misinterpretations.
Even so, I don’t expect this to prevent Displayport from becoming a long-lived method to connect TVs and monitors to computers and other devices. Odds are that the monitors will ignore much of the possibly ambiguous features in the stream, and work properly, no matter what is thrown on them. The real damage is mainly that it’s harder than necessary to implement the standard on both sides.
The saddest things about this standard is that it didn’t lift off from the classic screen scanning methodology, as the task is merely transmitting pixels data from one side to another. Moving to a GTX-based standard was a great opportunity to leave the old paradigms that date back to the invention of television behind, and adopt digital-age methods. But that didn’t happen.
So instead of just transmitting the image line by line, each line as a packet (like MIPI), the source is required to maintain timing of an imaginary scan of the image, with an imaginary pixel clock, producing a standard VESA graphics mode in terms of active areas and blanking. This is the natural way for a DVI-to-Displayport converter, but when the pixel source is producing Displayport directly, it needs to emulate behavior that goes back to CRT screens.
There is a certain rationale behind this: There might be a Displayport-to-DVI converter on the monitor side, which needs to regenerate the DVI stream, with a pixel clock and horizontal/vertical timings. But this is a transient need. Not something to drag along for years to come.
This isn’t just about difficulty. The need to conform to the timing behavior of some VESA standard forces the source to create blanking periods that it could otherwise skip. The main link is hence forced to be idle a significant part of the time so that an imaginary electron beam of an imaginary CRT screen will have the time to return to the left side.
And there is more: The source is required to throttle the data rate, to make it look as if there’s a pixel source, running on a VESA standard pixel clock, that pushes the pixels. This is done by transmitting stuffing symbols (that is, zero data symbols before scrambling, enclosed by FS and FE control symbols).
To make things even less pleasant, the standard requires that the source to transmit the data in frames of a constant length, Transfer Units (TU). Each transfer unit begins with pixel data, and then switches to stuffing symbols until its end, so that the average pixel rate matches the imaginary pixel clock.
The source can decide on 32-64 symbol clocks for each TU frame. The sink may deduce this length from the positions of the FE control symbols, as this detail isn’t conveyed in another way. One may wonder why the sink would care about the TU’s frame length. And then wonder why it has to be constant.
To make things even worse, the standard doesn’t define how to throttle the data. It doesn’t give any definitive rule on how much the data stream on the symbol stream may be ahead of or behind the imaginary pixel clock that runs continuously. For example, is it OK to bombard the sink with 128 pixels with no throttling in the first TUs, and then slow down? Is it fine to send almost nothing on the first TUs, and then catch up?
It just says things like
…the number of valid data symbols per TU per lane (except for the last TU of a line which may be cut because of the end of active pixel) will be approximated with the following equation:
# of valid data symbols per lane = packed data rate/link symbol rate * TU size
“Approximated”? How much approximated?
and then there’s
The number of valid data symbols per TU will naturally alternate, and over time, the average number will come to the appropriate non-integer value calculated from the above equation
Naturally alternate? How many symbols is a natural alternation?
So even if though the pixel throttling was made for a converter to DVI, it has no guarantee on how much the pixel stream will fluctuate with respect to the pixel clock. It seems like they meant this flactuation to be ±1, but it’s not said. Not a real problem today, given the low price of FIFO memory, but the whole point was to avoid the need of storing the entire line.
Main link implementation shopping list
I found it a bit difficult at first to define what needs to be implemented on the main link, so here’s the shopping list. These are the things that a Displayport source needs to be capable of regarding transmission on the GTX lanes (at a bare minimum, with no audio):
- Three training sequences: TPS1, TPS2 and TPS3 (the last can be omitted if Displayport 1.2 isn’t supported, but it’s not a good idea)
- Pixel data, organized in Transfer Units (TU) of 32-64 symbols each. Each TU must contain stuff symbols (zeros) enclosed by a FS-FE at its end, so that the overall transmission data rate emulates the display mode’s pixel clock. The pixel data itself must be enclosed in Blank End and Blank Start markers, at times replaced by scrambler reset markers.
- Note that as of DisplayPort 1.2, there are six formats for the Blank Start sequence: There’s the enhanced framing sequence which is mandatory for Displayport 1.2, and the non-enhanced version, and the sequences differ for each of the lane configurations (1x, 2x or 4x).
- A scrambler must be implemented to scramble all non-control symbols.
- Once in each Vertical Blank period, the Main Stream Attribute (MSA) Packet must be transmitted once where pixel data would occur otherwise. This is a simple data structure of at most 39 symbols, including the double SS marker in the beginning and the SE at the end, containing the image attributes. It contains, among others, the color coding in the MISC0 field (which is set to 0x20 or 0x21 for plain 24 bpp RGB for asynchronous or synchronous clock, respectively), so it’s not imaginable that a monitor will display something without this information. Someone must have thought that 39 symbols is a huge waste of bandwidth once a frame, so there are (mandatory) shorter versions of this packet for 2 and 4 lane configurations, to make better use of the link. Hurray.
- Per section 5 of the standard, a Displayport source must be able to support both RGB and YCbCr colorimetry formats, with a few bit-per-pixel options at a bare minimum just to support the fallback modes. That may not sound a big deal, but each bit-per-pixel format is packed differently into the symbol streams. On top of that, the source must adapt itself to one of the colorimetry formats that appear in the sink’s EDID information. That doesn’t make life simple, but for practical purposes, I want to see a Displayport monitor that doesn’t support 24 bpp RGB. But that’s not good enough if you’re a graphics card vendor.
The training sequence in short
The training procedure may look quite complicated at first glance, but it merely consists of two stages:
- Clock recovery: The source transmits all symbols as 0x4a data (D10.2) without scrambling, which “happens to be” 8b/10b encoded as a 0101010101 bit sequence. The receiver performs bit clock recovery on this sequence.
- Equalization, word boundary detection and inter-lane deskew: The source transmits a special sequence of symbols without scrambling (TPS2 or TPS3, preferably the latter). The receiver, which knows what to expect, applies its own equalizer (if present) and recovers where a 10-bit symbols starts. When several lanes are used, it also deskews the lanes as required.
One could argue that today’s GTXs don’t need this kind of training, as commonly used equalizers can make up for rather lousy cabling. And it’s not clear if the GTXs actually used are designed to equalize based upon a training sequence.
Anyhow, on each of these two stages, the source runs the following loop: It applies the training sequence on the link, writes to dedicated AUX CH registers to inform the sink what sequence is sent, on which lanes and at what rate, and what gain, pre- and post-emphasis is applied on each lane. The source waits for a known period of time (announced by the sink as TRAINING_AUX_RD_INTERVAL) and then checks the sink’s registers for the status. If the registers indicate success (certain bits set, depending on the phase) the stage of the training is done. If they don’t, other registers will request a new setting of gain, pre- and post-emphasis. The source applies those, and loops again.
The source gives up after four attempts with the same set of gain, pre- and post-emphasis. In other words, if the sink doesn’t change the requests for the signal conditioning four times, it’s a sign it has nothing more to attempt. The source should try to reduce the lane count or rate, and retrain.
When the clock recovery stage is done, the source should go on to equalization without changing the signal conditioning. When the equalization is done, it should exit training and start sending pixels. Needless to say, without changing the signal conditioning.
The control symbols
This is the assignment of control symbols in Verilog. To turn into control symbols, the respective charisk bit must be set in parallel with the codes given below:
assign BS = 8'hbc; // K28.5 assign BE = 8'hfb; // K27.7 assign BF = 8'h7c; // K28.3 assign SR = 8'h1c; // K28.0 assign FS = 8'hfe; // K30.7 assign FE = 8'hf7; // K23.7 assign SS = 8'h5c; // K28.2 assign SE = 8'hfd; // K29.7
Some experiments with a real monitor
Testing my own logic with a Dell P2415Q monitor, I was curious to know what it really cares about in the main link data stream and what it ignores. So here are some insights, which are relevant to this monitor only.
- When lanes are trained properly, but the monitor is unhappy with the data stream, it enters Power Save mode immediately. The display goes out of this mode when the stream is OK.
- Writing 0x01 to DPAUX address 0x600 (attempting to wake it up) should be done after training is done, and video data is present on the link. Not before the training, as this will cause the monitor to remain sleeping.
Things that can be messed with and still get an image:
- Mvid[7:0] on each BS can be set to constant 0. The monitor didn’t care, even though MISC0 was set to asynchronous clocking.
- It had no problem with an arbitrarily large Transfer Unit (including shooting a 1024-pixel wide line in one go)
- Non-enhanced framing mode and enhanced framing mode were both accepted in any mode, even on 5.4 Gbps (which requires enhanced framing).
- There was no problem showing an image on the monitor with scrambling turned off (given that the monitor was informed about this by setting DP AUX register 0x102 to 0x20 instead of 0x00).
The monitor will not display an image without any MSA transmitted. Also, sending an MSA but messing with the following fields prevents displaying the image:
- M (three fields of Mvid)
- The total number of pixels per line, and the total number of lines
- The image’s active area (number of active pixels and rows)
Things that were fine messing in the MSA (image still shown):
- Setting N=0 in async mode is OK, probably because it has to be 32768 anyhow
- All information about the syncs: Position and length
Reader Comments
Hi Eli,
I’ve currently working my way through a DisplayPort Implementation (for Artix-7).
Thanks for your notes, they are very useful!
Mike
Hi Eli,
Your blog about M & N values really helped me grasp the concept of asynchronous clock mode in DP. I have a doubt about TU (transfer unit) concept, where I’m able to find out active symbols per TU, but should it alternate in a n, n+1 fashion as suggested in the spec?
As you say that you tried sending a large chunk of data at once and it was fine, but can it create any issues like tearing of image or issues in clock recovery?