Case study of synchronous FPGA signaling by adjusting the output timing
This is a case-study of synchronous FPGA signaling adjust the t_co (clock-to-output) timing. This study uses Xilinx's Ultrascale architecture (more precisely the xcku040-ffva1156-2-i device), however the methodology is general and can be applied to any FPGA family.
Todays protocol are mostly self synchronous, which don't need global synchronous behavior. However, in some cases we cannot avoid global synchronity. This study shows how can it be achieved using FPGAs even in hard timing cases.
Let's assume that we want to build a DAQ (Data-acquisition) unit, which requires precision trigger-timing. All modules need the trigger signal at the same time. (We need to assume that all modules use the same clock with a given uncertainty.)
This repository contains two Vivado projects. (More precisely project creator tcl files.) The first project is located in the singlecycle directory. This project demonstrates three simple experiments to meet the timing, but the requirements are too challenging to fulfill, so all output timings fail.
The second project is located in the multicycle directory. This project demonstrates how to meet the timing using a multicycle path constraint for the output ports. These are successful ideas, the design fits with the timing analyzer requirements.
To build the projects, just open a Vivado (which supports Kintex Ultrascale devices), enter the
singlecycle or multicycle directory. Then source the project creator
file: source create_mc_project.tcl
Then just generate the bitstream.
To see timing details click Open Implemented Design.
Following sections will walk you through from the very basic (but failing) implementations to three successful solutions.
This section is optional. You can skip to the next section, you only need to accept the minimum
odelay_m = 3.0
and the maximum odelay_M = 8.0
output delays.
Altera has a quite good cookbook about timing issues. Or the Xilinx's Ultrafast design methodology can help to calculate timing. The following picture is from that book. Chip-to-Chip Design with Virtual Clocks as Input/Output Ports:
This study only deals with the B side, where the FPGA is the signal driver.
Here are the output timing constraints with random values for the delays.
(The *_m
denotes the minimum, the *_M
denotes the maximum values)
# create a 100MHz clock
create_clock -period 10.000 [get_ports i_clk_p]
#create the associated virtual input clock
create_clock -name clkB_virt -period 10
#create the input delay referencing the virtual clock
#specify the maximum external clock delay from the global oscillator towards the FPGA
set CLK_fpga_m 3.5
set CLK_fpga_M 4
#specify the maximum external clock delay from the global oscillator towards the DAQ module
set CLK_daq_m 5
set CLK_daq_M 6.5
#specify the maximum setup and minimum hold time of the DAQ module
set tSUb 2
set tHb 0.5
#Board delay from FPGA to DAQ module (on trigger)
set BD_trigger_m 6.5
set BD_trigger_M 7.0
# odelay_M = 8.0
# odelay_m = 3.0
set odelay_M [expr $CLK_fpga_M + $tSUb + $BD_trigger_M - $CLK_daq_m]
set odelay_m [expr $CLK_fpga_m - $tHb + $BD_trigger_m - $CLK_daq_M]
#create the output maximum delay for the data output from the
#FPGA that accounts for all delays specified (odelay_M = 8.0)
set_output_delay -clock clkB_virt -max $odelay_M [get_ports {<out_ports>}]
#create the output minimum delay for the data output from the
#FPGA that accounts for all delays specified (odelay_m = 3.0)
set_output_delay -clock clkB_virt -min $odelay_m [get_ports {<out_ports>}]
So the final numbers for this study are odelay_M = 8.0
and odelay_m = 3.0
.
First, let's show some simple approaches, which don't need deep FPGA knowledge. Although, we will see that these implementations cannot fulfill these challenging timing requirements. And finally we will use a multi-cycle constraint in the next chapter.
In this chapter all outputs have the following output delay constraints: (See previous chapter for details)
#create the output maximum delay for the data output from the
#FPGA that accounts for all delays specified (odelay_M = 8.0)
set_output_delay -clock clkB_virt -max [expr $odelay_M] [get_ports {<out_ports>}]
#create the output minimum delay for the data output from the
#FPGA that accounts for all delays specified (odelay_m = 3.0)
set_output_delay -clock clkB_virt -min [expr $odelay_m] [get_ports {<out_ports>}]
The singlecycle design o_native_p
(/n) ports demonstrate the simplest version.
Simple means a native, fabric flip-flop output connected to the output buffer.
-- Native
inst_native_obufds : OBUFDS
generic map(
IOSTANDARD => "LVDS"
)
port map(
O => o_native_p,
OB => o_native_n,
I => q_native_d2
);
This implementation will fail the timings. The timing analyzer will report negative-slack in the
setup time of the virtual clkB_virt
clock:
Port name | setup slack | hold slack |
---|---|---|
o_native_p | -4.421 | 5.777 |
The negative setup-slack means our signal is too slow. Let's try to make it faster!
All FPGAs has a dedicated, fast output flip-flop, which is placed next to the output buffer. The
singlecycle project o_iob_p
(/n) ports demonstrate this solution.
Using Xilinx FPGAs the IOB property says the compiler to place the given flip-flop in the dedicated,
fast output register. This property can be set as the following:
set_property IOB TRUE [get_cells <register_name>]
Altough, this results a bit closer slack it still fails the timing.
Port name | setup slack | hold slack |
---|---|---|
o_iob_p | -3.821 | 5.586 |
Another dedicated flip-flop is located in the IO in modern FPGAs. This is the DDR flip-flop. This
approach is implemented by the o_ddr_p
(/n) output ports. An ODDRE1
device primitive needs to be
placed in order to drive DDR data:
ODDRE1_inst : ODDRE1
generic map (
IS_C_INVERTED => '0', -- Optional inversion for C
SRVAL => '0' -- Initializes the ODDRE1 Flip-Flops to the specified value ('0', '1')
)
port map (
Q => w_ddr, -- 1-bit output: Data output to IOB
C => w_clk, -- 1-bit input: High-speed clock input
D1 => q_ddr_d2, -- 1-bit input: Parallel data input 1
D2 => q_ddr_d2, -- 1-bit input: Parallel data input 2
SR => '0' -- 1-bit input: Active High Async Reset
);
Note, that to reach the same timing behavior we need to modify the output delay constraint. The maximum delay should be reduced by the half period of the system clock (ie. 5ns)
set_output_delay -clock clkB_virt -max [expr $odelay_M -5] [get_ports {o_ddr*}]
In spite of the efforts the timing fails, what's more this method has the worst results:
Port name | setup slack | hold slack |
---|---|---|
o_iob_p | -4.616 | 5.907 |
This FPGA is not fast enough to fulfill these timing requirements. The following tables show all the setup/hold timings:
The setup slacks:
The hold slacks:
To understand the root cause of the failed timings we should look under hood, and need to understand the timing details. The timing analyzer expects all data at the next clock edge from the launch clock by default (single-cycle). The following waveform shows the required data valid window on the FPGA pad. The data must be valid throughout this window. (It is permitted for the signal to be valid earlier or keep data even after this window, but during this slack of time the data must be valid.)
(The destination clock uncertainty and any other delays must be added/subtracted to/from odelay_M/m to get the accurate valid window, but now these are negligible.)
Let's see one particular case. (There is no essential difference between the previously demonstrated failing implementation, so let's choose the iob type implementation.)
This default (single-cycle) mode requires faster behavior, which cannot be fulfilled by this FPGA. However, the required valid window is shorter that the guaranteed, real valid data window.
The length of the required valid window is req_len = odelay_M - odelay_m = 8 - 3 = 5
The length of the real valid data window is req_len + setup_slack + hold_slack = 5 - 3.8 + 5.6 = 6.8
So if these windows can be shifted, the timing could be closed.
In most system-synchronous cases additional fix, and known delays are acceptable. Let's shift the required data valid window with a whole clock cycle. This one (or more) clock cycle delay called multicycle path.
In this case the FPGA doesn't need to be as fast as in the single-cycle mode, but now it should be relatively more accurate to hit the whole required valid window. What's more, the harder thing is not to violate the hold time requirements, in other words, to hold data till the end of the required data valid window. So we can say that the FPGA has to be "as slow as possible".
To set the multi-cycle path only the following constraint is needed:
# Set multicycle path for all outputs
set_multicycle_path -to [get_ports o_*] 2
The following chapters will show different implementations, which can solve this issue. To see more details open project from the multicycle directory.
We have seen that the compiler cannot route as fast as required, but maybe it can solve this
multi-cycle path problem. So let's just implement a simple register, and connect to output port with
the multi-cycle constraint. This idea is implemented by the o_native_mc_p
(/n) ports.
After a longer compiling the timing fails in this case too.
Port name | setup slack | hold slack |
---|---|---|
o_iob_p | -3.555 | 0.579 |
What happened? The compiler tried to use general routing resources to add delay to match the required data valid window. A huge routing time can be seen in the FPGA device view. Turn on the Routing resources option. and see the routing snake:
The detailed timing report of this failing path is also strange. Here is the setup report, with a more than 9ns routing time!
But the same routing time in the hold report (which uses the fast model of the FPGA) is less than 5ns:
So the problem is that the FPGA's routing resources has greater uncertainty than what the constraints require. Note, that in simpler timing requirements you can stop here, because the router will add a proper delay. But now we have to investigate more. Let's try to use dedicated delay elements, which called ODELAY.
Let's try to replace the routing delays with dedicated output delays. This approach is implemented
by the o_odelay_p
(/n) ports of the multicycle project. We need to replace the
routing delay of the previous (failed) solution. This was 9.4ns, with -2.4 setup slack. So we need
to delay ~7ns.
Ultrascale's ODELAYE3
primitive can delays upto 1.25ns in fixed mode. So a cascaded delay
structure is needed to delay ~7ns. But also note that using cascade, additional route delays added,
so lets try with three cascaded ODELAYE3
primitive. The cascade instantiation is described in the
UltraScale's SelectIO user guide.
Wow! This is a working solution. The timing meets the requirements:
Port name | setup slack | hold slack |
---|---|---|
o_odelay_p | 0.064 | 0.173 |
However, both setup and hold slacks are tiny. What happened with our great valid window? Let's see again the detailed timing reports (the data path delays only).
Slow model (for setup calculations):
Fast model (for hold calculations):
The same effect can be read from these numbers, as from the first multi-cycle implementation. The FPGA's uncertainty tighten the real valid window. There is big difference between the slow (11.9) and fast (7.198) models data delay. Now this unwanted effect isn't strong enough, so the timing could be closed, unlike the native implementation.
There are two disadvantages of this technique
- The cascaded delays have relatively great uncertainty, which cannot fulfill more challenging constraints.
- The other limitation of this technique is the big number of the delay elements. Cannot be delayed arbitrary number of outputs. The FPGA has a limited number of delay element.
The next two chapters will show a more sophisticated solution.
o_iob_shifted_clk_p
(/n) ports of the multicycle project meet the timing by
adjusting the clock of the last flip-flop.
This technique quasi adds extra delay to the clock path towards the FPGA (the CLK_fpga_m
(/M) in
the constraint file). If the value of the clock_shift
above equals the previously approximated
~7ns, the value of the tco
will be a simple output delay. The ~7ns of the clock_shift
has to be
converted to phase for Xilinx's clock wizzard. 7ns/10ns*360deg = 252deg
The
multicycle project uses 240deg (6.6ns)
as phase which gives better results.
The timing constraints are met again, with better results than the odelay one:
Port name | setup slack | hold slack |
---|---|---|
o_iob_shifted_clk_p | 0.850 | 1.290 |
What great slacks! Both of setup and hold are above half a nanosecond.
Two notes for this technique:
- The data have to be transferred from the
system_clk
to this newshifted_clock
, which requires one (or more to help internal timing) flip-flop. The timing requirements of this internal path (fromsystem_clk
toshifted_clock
) is auto generated, cause a clock generator is used. - Maybe a couple of recompilations are needed with adjusted phase values, to get the better output timings. First, we can think if the setup slack is greater than the hold slack, more phase shift is needed, and vice versa. But it is misleading, because router can add extra internal delay, (as in native implementation) which can lead us the wrong way.
Altough, this technique can achive the best timing results, the FPGA will run out of clocking resources if great number of output should adjusted with different requirements.
The last presented method uses a mixed technology of the previous two. For implementation see
o_odelay_nclk_p
(/n) ports of the multicycle project. Here both clock phase shift
and delay element is used. The phase shift is special: the output flip-flop driven by the inverted
system clock. The clock inversion means 50% phase shift, which is 5ns in our case. As we have seen
~7ns total delay is needed in multicycle implementation (in this particular case). Now the clock
invertion grants 5ns so ~2ns additional delay is needed, which will be added using the ODELAYE3
device primitive. This technique can also fits the timing requirements:
Port name | setup slack | hold slack |
---|---|---|
o_odelay_nclk_p | 0.687 | 1.212 |
General with a shifted clock and one delay element primitive a huge number of synchronous output signal can be handled. The clock should be shifted according to the port with the fastest requirements (the greatest odelay_M), while the fix value of the delays can be adjusted port by port.
The clock inversion has another advantage, that it does not requires PLL/MMCM module. The clock buffer itself can invert the clock.
We have seen three successful implementations for these challenging output requirements.
Port name | setup slack | hold slack |
---|---|---|
o_iob_p (fail) | -3.555 | 0.579 |
o_odelay_p | 0.064 | 0.173 |
o_iob_shifted_clk_p | 0.850 | 1.290 |
o_odelay_nclk_p | 0.687 | 1.212 |
I hope that you won't encounter such challenging timings, but now you can see that there is life after death...
Clone this repository set your target device, modify the constraint files according to your requirements and try to close the timings.