

#### Uniboard Correlator Overview

- Data arrives in subbands
- Transpose: full L-R mesh
- Route per-subband to correct input chip
- Input chip: buffer, delay tracking, PFB, fringe rotation
- 2x DDR3 (2x 52.6Gb/s)
- 2x 8GB -> 3.2 s
- Correlator chip: crossand auto correlations,

integration, all baselines per chip

• QDR2+ memory: 64.8 Gb/s, 4MB



#### Bandwidths and bottlenecks

- Data arrives in subbands
- Input chip forms frequency channels (2kHz IMHz)
- Frequency channels can be sent to 4 cor. chips
- $BW_{SUB} \leq 4x BW_{COR}$  (without backplane)
- Correlator chip has  $\leq$  40 Gb/s input BW
- This contains 32 stations, 2 polarisations: 64 streams
  625Mb/s per stream, complex data
- For 64MHz per correlator chip: max. 4 bits
- For 32MHz per correlator chip: max. 9 bits
- Multipliers are 18x18 (or 18x25), only useful >9 bits!
- Run the on-board interconnect at double rate?
- Goal is to get most GMACs out, not be IO limited

## Bitgrowth in FFTs

- FFT wordsize grows by  $N_{Stages}$  + I
  - Noise-power diminishes with smaller bandwidth
  - But CW carriers (RFI) will keep the same amplitude
- RFI is local to telescope (we hope)
  - Use bitgrowth at top for RFI immunity
  - Use bitgrowth at bottom for increased precision
  - Ignore bitgrowth for more BW, sensitivity
- Example: 3 bit subbands of 32MHz, 4kHz bins:
  - bitgrowth is 15 bits, FFT output is 18 bits
- At lower frequencies (e.g. L-band) there is more RFI, but less spectrum to sample
- At higher frequencies, go for highest bandwidth
- Bitgrowth means TP growth: TP<sub>cor</sub> > TP<sub>inp</sub>

## Input chip limitations

- Must store data in memory
- Network jitter compensation
- Delay tracking
- Best bandwidth so far:Virtex-5 DDR3 400MHz
  - 64 bit dimms 50Gb/s
- Need at least twice that, so 2 memory interfaces
- Still only enough bandwidth to go in & out once
- Combine jitter buffer and delay
  - Data hits memory bottleneck only once
- Ix 32bit DDR3 interface: 1760 slices on V5
  - 5% of a V5 SX240T
- Reduce input data rate? (also allows for bitgrowth)
- Investigate performance on actual usage pattern!

## Care and feeding of multipliers

- Use built-in DSP resources (Multiplier/accumulator)
- Make sure these are always busy, always have data
- Use built-in accumulator for integration



- Input chip PFB creates time-ordered spectra
- Re-order to per-subband
- Second transpose needed
- QDR lacks bandwidth
  - Two QDR interfaces?
- Nbr of integrations depends on size of 2nd transpose
- 64 streams, 8 bit (compl) in 4MB: 65k datapoints
  - 32MHz, 64 bins (500kHz): 1024 integrations,
  - 8192 bins (4kHz): 8 integrations.

#### Correlator chip performance

- Process a single band, all stations per correlator unit
- 32 stations, all stokes => 2112 products
- Complex multiplication, 8192 MACs needed
  - Virtex 5 SX240T: 1056 DSP48 (R<sub>M</sub> = 7.7x)
  - Stratix 4 SGX230: 1288 18x18 (R<sub>M</sub> = 6.4x)
  - Virtex 6 SX475T: 2016 DSP48E (R<sub>M</sub> = 4.1x)
- Re-using multipliers means re-issuing data



- $BW_{cor} \times R_M = F_M$
- 400MHz design goal?
  - V5:  $BW_{cor} < 50MHz$
  - S4:  $BW_{cor} < 60MHz$
  - V6: BW<sub>cor</sub> < 95MHz

#### A 31 station correlator?

- Virtex 6 SX475T: 2016 DSP48E, R<sub>M31</sub>=3.8
- 1064x dual-ported BRAM, 36kbit
  - 18 bits complex data: each BRAM is 1024 points
- 32 telescopes x 2 pol = 64 datastreams
  - 16 x 1024 = 16k points probably too optimistic
- 2nd acc. on-chip (in LUTs) to limit output BW



### A 31 station correlator?

- 31 stations would need:
  - 1922 DSP48 (out of 2016)
- Store data close to multipliers (after duplication):
  - Single BRAM must store both L and R
  - 961 BRAM (out of 1064)
- Store data only once:
  - 64 BRAM (or multiple), much more routing



#### A closer look at DSP48E





#### Pros:

- No 2nd transpose
- 36 bit accumulators
- No CLB math
- Matches resources

Cons:

- 1024 accumulators
- 256 bins, 4 stokes
- Uses almost all BRAMs
- Pipeline for speed

# Summary (so far)

- Only Virtex 6 can do 64MHz per chip
- But does not exist yet
- I have no Xilinx (or Altera) shares
- Either 2 or 0 QDR chips for correlator
- Correlator chip is bandwidth limited 80Gbs please?
- On Virtex 6, 31 stations is much easier than 32
- 2nd transpose or multiple accumulators
- On-chip routing resources might be final bottleneck
- 32 (31) stations, 4 Stokes is worst (best) case
- These are all just upper limits
- A lot of design paramters to review
- Hence : The Correlator Construction Kit



#### The numbers game (II)



- TP<sub>INP</sub> (Gb/s)
- TP<sub>COR</sub> (Gb/s)
- BW<sub>COR</sub> (MHz)
- N<sub>bin</sub>
- BW<sub>bin</sub> (MHz)
- R<sub>bin</sub> (bits)

#### Some use cases

- Current EVN correlator
  - N = 16, BW = 128MHz,  $T_{int} = 0.25s$
- Next-gen correlator
  - N = 32, BW = 512MHz,  $T_{int} = 0.1s$
- EVN2015
  - N = 32, BW = 8GHz 16GHz, T<sub>int</sub> << 0.01s
- Spectral line research
  - N = 32, BW = 128MHz (1000m/s), BW<sub>bin</sub> < 10m/s
  - Full BW at lower spectral resolution
- Space
  - $BW_{sub} = I28MHz$  (VSOP2),  $N_{bin} = IE6$ , delay track

| -                              |                          |                     |                                                                                                                                |                                |                                 |                       |                                                                                                                                                                |
|--------------------------------|--------------------------|---------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input side                     |                          |                     | Output                                                                                                                         |                                | ıt                              | stril                 |                                                                                                                                                                |
| N                              | 16                       |                     | Number of stations                                                                                                             | Cpol                           | X pol, cross + auto 🛟           |                       |                                                                                                                                                                |
| Nb<br>BW<br>Npol<br>Radc<br>TP | 120<br>128<br>2<br>1.024 | MHz<br>bits<br>Gb/s | N*(N-1)/2<br>Input bandwidth<br>Number of polarizations<br>Resolution of ADC<br>Data rate per station, BW * 2<br>* Npol * Radc | Nc<br>Nbin<br>Ro<br>TPo<br>TPo | 544<br>139264<br>24<br>6.7<br>3 | bits<br>Mb/s<br>GB/hr | Number of correlation produce: Opol * (Nb + N)<br>Number of bins, * Nc * Npoints * Nsub<br>Output resolution<br>Nbin * Lo * 2.1 Tint (times 2 because complex) |
| ATP                            | 16.384                   | Gb/s                | This would be the network<br>TP without on-site PFB<br>All stations, N * TP                                                    | I<br>Fg<br>Nm                  | FPGA Res                        |                       | Frequency goal<br>Number of Multipliers or MACs.                                                                                                               |
|                                | Subbands                 |                     |                                                                                                                                | G                              | 270                             | GMAC                  | Estimated single FPGA performance                                                                                                                              |
| Nsub                           | 8                        |                     | For one polarization                                                                                                           |                                | DBB                             |                       | One of these at each telescope                                                                                                                                 |
| BWsub                          | 16                       | MHz                 | BW / Nsub                                                                                                                      | 1Ppfu                          | 512                             | Mb/s                  | PFB input (per polarization): 2 * BW * Radc                                                                                                                    |
| Rsub                           | 2                        | bits                | Quantization after PFB                                                                                                         | Dadc                           | 256                             | MHz                   | ADC clock rate, 2 * BW                                                                                                                                         |
| TPsub                          | 2048                     | Mb/s                | Subband data rate 11*2*<br>BWsub * Rsub* 1 pol<br>TP is the input rate for 1                                                   | Fadc/F                         | <sup>7</sup> g 1                |                       | ADC samples per FPGA clock (>1 requires<br>DDR/striping) BW * 2 * log2 (Nsub) * Npol /<br>1000<br>Network throughput per telescope, 2 * BW *                   |
|                                |                          |                     | correlator Fr GA, < 40Gb/s                                                                                                     | TPtel                          | 1024                            | Mb/s                  | Rsub * Npol                                                                                                                                                    |
| Frequency Resolution           |                          |                     |                                                                                                                                | Input chip (PFB)               |                                 | (PFB)                 | One for each subband                                                                                                                                           |
| Npoints                        |                          |                     | Frequency points per subband                                                                                                   | TPfft                          | 2048                            | Mb/s                  | Input rate for FFT, 2* BW / Nsub * Rsub * Npol * N                                                                                                             |
| dF                             | 1000 •                   | kHz                 | BWsub / Npoints                                                                                                                | Gfft                           | 4                               | GMAC                  | BW / Nsub * 2 * log2(Npoints) * Npol * N /<br>1000                                                                                                             |
| Fobs                           | C ban (4                 | /                   | Observing frequency/band                                                                                                       | Output chip (correlator)       |                                 | orrelator)            | One for each subband                                                                                                                                           |
| ΔV                             | \$0.0                    | km/s                | Doppler resolution                                                                                                             | Gcor                           | 35                              | GMAC                  | 4 * BW / Nsub * Nc / 1000                                                                                                                                      |
| Vmax                           | 1691                     | km/s                | Doppler range, Fobs - $BW/2$<br>to Fobs + $BW/2$                                                                               | Mcor                           | 0                               | MB                    | Storage for integration, Nc * Ro * Npoints * 2                                                                                                                 |