In this series:

As described in the first part of this series, Aurora is a good solution for high data rate transfers that require low latency. Transfers of this kind are commonly found in systems with a large number of data acquisition and data generation channels. Such systems are often used in fields of research like massive multiple-input/multiple-output (MIMO), medical imaging, and particle accelerators.

When the global system is complex and requires a low reaction time, the communication path latency between the various acquisition, processing and generation boards must be as small as possible. In this blog post, the latency of an Aurora transfer is benchmarked in order to see if its performance can fit your system’s needs. The tests are performed between two Perseus 601X boards connected together with a Pico backplane.

Test setup

To test the Aurora core latency, I used a PicoSDR 4×4. In this version, two single-ended GPIO in each direction link the Perseus 601X boards together over an advanced mezzanine card (AMC) backplane. One output GPIO from the first Perseus will be used as a trigger to tell the second Perseus that the data transmission to the Aurora core has started.

The input trigger on the second Perseus will start a ChipScope acquisition based on the trigger’s rising edge event. By analyzing the acquired ChipScope waveform, it will be possible to know the delay, in number of design clock cycles, between the trigger event and the first data sample available on the Aurora user port of the second Perseus.

For this test, the design uses a 100 MHz (10 ns period) clock derived from the onboard 200-MHz oscillator. Each Perseus has its own distinct clock. The phase difference between these clocks and the propagation delay of the trigger signal from one FPGA to the other should not affect the resulting delay by more than 2 clock cycles (±20 ns).

Results

Three tests were performed. The first one operated the GTX at a data rate of 3.125 Gbps and the Nutaq Aurora core was configured with a data width of 32 bits. Figure 1 shows the ChipScope waveform acquired at the trigger event.

Figure 1: Aurora latency, 3.125 Gbps, 32-bit

Figure 1: Aurora latency, 3.125 Gbps, 32-bit

The delay from the rising edge of the trigger and the first data sample available to the user logic is 49 clock cycles. Since the ChipScope clock is at 100 MHz, the data from the first Perseus user logic is available to the user logic of the second Perseus 0.49 µs later .

By increasing the GTX data rate to 5 Gbps, it is possible to lower the latency since all the logic operates at a higher frequency.

Figure 2: Aurora latency, 5 Gbps, 32-bit

Figure 2: Aurora latency, 5 Gbps, 32-bit

At an operating rate of 5 Gbps, the latency has decreased to 34 clock cycles. This corresponds to a 0.34 µs latency.

Since Nutaq’s implementation of the Aurora core uses a group of 4 GTX to create 128-bit wide transactions, when using a 32-bit wide user data port, 4 clock cycles are used to create the 128 bits required for the first transaction. Therefore, setting the user port width to 128 bits should reduce the overall latency.

Figure 3: Aurora latency, 5 Gbps, 128-bit

Figure 3: Aurora latency, 5 Gbps, 128-bit

The latency is now 0.32 µs latency. Since 3 clock cycles have been saved, the expected latency is now 0.31 µs, which is inside the uncertainty of the benchmark.

Conclusion

Using this simple benchmark setup for the Nutaq Aurora core, the latency between two Perseus 601X boards connected together with a Pico backplane has been identified in three different configurations:

Aurora ParametersLatency (±20 ns)
3.125 Gbps data rate, 32-bit data width0.59 µs
5 Gbps data rate, 32-bit data width0.34 µs
5 Gbps data rate, 128-bit data width0.32 µs

If the same test was executed on a system with a different configuration (two Perseus 601X connected together with miniSAS cables, for example), the same latency should be expected in addition the propagation delay of the signal over the physical link. This could add up a few nanoseconds to the total latency.

I hope that these Aurora core latency benchmark results are useful to your project and can fit your system needs.