238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998
Signal Processing at 250 MHz Using
High-Performance FPGA’s
Brian Von Herzen
Abstract—This paper describes an application in high-perform-
ance signal processing using reconfigurable computing engines:
a 250-MHz cross correlator for radio astronomy. Experimental
results indicate that complementary metal–oxide–semiconductor
(CMOS) field programmable gate arrays (FPGA’s) can per-
form useful computation at 250 MHz. The notion of an “event
horizon” for FPGA’s leads to clear design constraints for high-
speed application developers, and can be applied to a variety of
real-time signal processing algorithms. Recent estimates indicate
that higher performance FPGA’s available early in 1998 can
attain speeds of over 300 MHz using 20% fewer logic elements
than current designs. The results of this design work provide
important clues on how to improve FPGA architectures for signal
processing at hundreds of MHz. Direct routing channels between
logic elements can significantly increase performance. Routing
architectures with four-way symmetry allow for rotations and
reflections of subcircuits needed for optimal packing density.
Experimental results indicate that clock buffering often limits
the top speed of the FPGA. Wave pipelining of clock distribution
network may improve FPGA performance.
Index Terms—Correlators, event horizon, field programmable
gate array, manual partitioning and placement, programmable
logic, real-time signal processing.
I. RECONFIGURABLE COMPUTERS
FIELD-PROGRAMMABLE gate arrays (FPGA’s) can pro-vide a useful platform for high-performance computing
and real-time interactive signal processing [1]. These arrays
perform well on real-time computations because they pro-
vide the speed of dedicated circuitry while retaining the
flexibility of a programmable system. Reconfiguring allows
for incremental optimization and improvement of hardware
much like software development techniques. Because each
element in an FPGA performs a dedicated task, application
developers can readily design each circuit in the system for
a specified performance, ideal for real-time signal processing,
which requires that data flow through the system at a specified
rate.
This paper will describe a real-time application in radio
astronomy that makes use of the fastest FPGA’s available, and
performs a computation at speeds heretofore only achievable
in custom silicon [2]. The reconfigurable computer retains the
flexibility to implement different performance tradeoffs, al-
lowing for large arrays of simple computations or short arrays
Manuscript received March 25, 1997; revised November 1, 1997. This
work was supported by The Caltech Submillimeter Observatory, and the Joint
Astronomy Center and Xilinx, Inc.
The author is with Rapid Prototypes, Inc., Carson City, NV 89701 USA.
Publisher Item Identifier S 1063-8210(98)02952-7.
of complex computations. Reconfiguring the computer on de-
mand provides a unique flexibility that has not existed before.
II. CORRELATION SPECTROMETERS
Radio astronomers analyze spectra using high-performance
real-time computers [3]. High-frequency radio astronomy,
particularly millimeter-wave and submillimeter astronomy,
utilizes wide-band spectrometers with at least 1–2 GHz of
spectrometer bandwidth. Weak astronomical signals require
full-parallel integration on all of the channels to detect measur-
able signals. Therefore, scanning spectrometers do not work
for this application. Acoustooptic spectrometers can achieve
the broad bandwidth, but become problematic in large arrays
and for space-borne applications where adjustments of the
analog elements become very difficult.
Another alternative uses a parallel digital correlator that
computes the autocorrelation function of the incoming base-
band signal [2]. A digitizer quantizes the signal at a resolution
of one, two, or three bits [4]. A 1:16 time-division demulti-
plexer reduces the data rate from a Nyquist sampling rate of 4
Gs/s down to 250 Ms/s, producing 16 parallel data streams of
250 Ms/s. The streams pass into an array of cross correlators
that correlate every stream with every other 250 MHz stream.
Digital integrators accumulate the cross-correlation results for
periods from 106 to 1012 samples, and the integration results
pass to a microprocessor for further integration. The processor
reassembles the autocorrelation of the 2 GHz input signal from
the array of cross-correlation results and computes the Fourier
transform of the correlation, producing the spectrum of the 2
GHz signal.
In this paper, we will focus on the 250 MHz cross-
correlation portion of the algorithm, the heart of the real-time
computation. This building block serves as the basic element in
a large array of spectrometers for the Caltech Submillimeter
Observatory [5] and the James Clerk Maxwell Telescopes,
Mauna Kea, HI [6].
A. Cross-Correlator Architecture
Fig. 1 shows the basic architecture of a single cross cor-
relator. Two 2-bit digital signals enter the correlator, called
the prompt and delayed signal. Each lag of the correlator
delays the delayed signal by one more clock than the prompt
signal, hence the naming convention. A hardware multiplier
at each lag computes the product of the prompt data and the
delayed data, and offsets and rounds the result to 3 bits. An
accumulator integrates the rounded products and passes its
1063–8210/98$10.00 1998 IEEE
VON HERZEN: SIGNAL PROCESSING AT 250 MHz USING HIGH-PERFORMANCE FPGA’S 239
Fig. 1. Block diagram of a cross correlator.
TABLE I
MODIFIED AND OFFSET MULTIPLICATION TABLE FOR TWO-BIT
CORRELATION INPUT SIGNALS. THE MULTIPLIER ROUNDS AND OFFSETS
THE PRODUCTS TO PRODUCE THREE PRODUCT BITS INSTEAD OF FOUR,
ALLOWING 3-BIT UNSIGNED ADDITION FURTHER DOWN THE PIPELINE
carry bit to a ripple counter acting as an accumulator. The
ripple counter accumulates for integration times approaching
10 billion samples for astronomical applications, whereupon
the results pass to a host computer, the counters reset, and the
integration process repeats. This computation occurs at each
lag in the correlator. The prompt and data signals emerge from
the right side of the correlator to permit daisy-chaining of the
correlator chips to form longer correlators.
B. Design of an Individual Correlator Lag
Fig. 2 shows the architecture of an individual correlator
lag. Delayed and prompt data enter the lag on the west side
and pass through registers on the rising edge of the clock.
Combinational logic computes a three-bit rounded product
using the values listed in Table I. In the signed-magnitude
number representation, the data values correspond as follows:
and Normally
this would produce products from 9 to 9. We offset these
products by 9 and divide by three to get the range from
zero to six. Then we round the four central entries in the table
to the value three to get integer values. The resulting entries
appear in Table I.
Table I produces 3-bit products that have LSB’s of the
inner products rounded, with all the products offset to produce
positive results. The offset allows the logic to have un-
signed adders, accumulators, and up-counters. Later a micro-
processor eliminates the offset by tracking the total number
of samples and subtracting the number of samples times the
fixed offset of three from the correlator results.
TABLE II
TIMING BUDGET AT 250 MHz FROM ONE SYNCHRONOUS REGISTER
TO ANOTHER. THE DI PIN BYPASSES THE LOOK-UP TABLE (LUT),
PROVIDING A LONGER MAXIMUM ALLOWED WIRE DELAY
The product goes to a 4-bit accumulator whose carry output
controls a ripple counter for integration. The correlator runs for
a predetermined number of cycles, after which the integration
results go to a microcontroller for low-bandwidth processing.
This basic algorithm repeats for each lag in the correlator
array. The next two sections describe the optimizations to the
architecture needed to obtain the performance objectives of
250 MHz real-time throughput with the correlator.
III. HIGH-SPEED FPGA’S
FPGA technology frequently uses lookup tables based on
static memory coupled with synchronizing registers. For this
application, we used the XC3195-09 chip from Xilinx [7].
This architecture utilizes two 4-bit lookup tables and two flip-
flop registers per configurable logic block (CLB). These chips
provide ample resources for small-scale pipelining, the key to
achieving high throughput in FPGA’s.
Two approaches to high-speed FPGA design include top-
down system design and bottom-up library element design.
Here we use a bottom-up methodology where we set the target
clock frequency in advance, and then design each element to
meet the performance objectives from the start. These library
elements connect using standard synchronous pipelining as
long as the connections between elements remain short enough
to stay within the synchronous timing window. We chose to
design at 250 MHz based on rough performance estimates
and the constraint that the clock frequency must divide the
sampling rate of 4-GHz by a power of two. In addition,
we had completed a full-custom design at 250 MHz using
1.2 m CMOS [2], and wanted to compare the performance
of full-custom with FPGA technologies.
From the 4.0 ns cycle time we must subtract 0.1 ns for clock
skew on the FPGA, providing 3.9 ns to travel from register
to register on the chip. Using worst case timing tables for the
3195-09 device [8], it takes 1.3 ns to go from the rising edge
of the clock to the CLB output, with a setup time to the CLB
of either 1.5 ns for look-up table inputs or 1.0 ns for direct-
in (DI) data inputs. Table II shows the overall timing budget.
The X-Delay timing analyzer reports a maximum clock skew
of 0.1 ns between internal CLB’s for the 3100-09. A design
frequency of 250 MHz allows for 1.1 ns of allowable wiring
delay for normal CLB inputs and 1.6 ns of wiring delay for
DI inputs. The DI inputs do not pass through the lookup tables
before going to the flip-flop registers.
These wiring delays on the “ 09” series allow nearest
neighbor communication using the direct data lines between
240 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998
Fig. 2. Basic schematic of a single correlator lag.
CLB’s within 1.1 ns, and occasionally diagonal data propaga-
tion to normal data inputs. Figs. 3 and 4 show the wiring delay
from a central CLB to other nearby CLB’s. Direct interconnect
takes 0.3 ns, and general routing resources take significantly
longer. A 250-MHz system can use any routing channel of less
than 1.1 ns for general inputs and any routing channel of less
than 1.6 ns for direct data inputs (DI) to the CLB registers.
These constraints determine the event horizon for a 250 MHz
system
Utilizing this design methodology, the entire chip can
operate with worst-case cycle delays of under 4.0 ns. Future
FPGA design tools could show for any selected CLB the event
horizon for that CLB at a specified clock frequency, perhaps
with the entire neighborhood highlighted in green. This could
vividly show designers how far they could go with a signal
before resynchronization.
Note that signals travel faster to the east than to the west,
and slightly faster south than north. The standard orientation of
the Xilinx device in the EditLCA editor defines east and west,
north and south, i.e., pin 1 located in the northwest corner,
facing the surface of the die. These directional biases in the
3100 family constrain the orientation of the chip relative to
the desired data flow direction. When floorplanning a large
chip, designers must often rotate and reflect a subcircuit
for optimal packing density. Directional biases reduce the
maximal packing density and operating speed of FPGA’s, and
may stem from a habit of drawing schematics from left to
right, and translating these schematics into chip layouts fairly
literally. Newer FPGA’s such as the XC4000 propagate signals
at the same speed to the east and west, but the larger size of
the XC4000 CLB reduces pipeline performance relative to the
XC3100 in the same chip technology. The XC4000 family
lost all direct interconnects, significantly increasing the time
for nearest neighbor communications. The XC4000EX/XL
regained direct interconnects, but only in the east and south
directions, exacerbating the directionality problems of the
original XC3100A architecture. Symmetry makes rotations and
reflections of subcircuits possible, an important operation for
building larger systems. Hopefully, future FPGA architectural
designers will see the benefits of directional symmetry for
routing networks in FPGA’s.
A. Estimates for the XC4000XL Device Architecture
Recently, Xilinx has focused its high-speed efforts on the
XC4000XL family. Worst case timings for this family pro-
vide a performance comparison with the 3100A family. The
XC4000XL CLB has more symmetry and has internal RAM
capability, useful for high-density accumulators. Early esti-
mates indicate that a correlator lag may require 20% fewer
CLB’s in the XC4000XL than in the XC3100A design.
Xilinx produces the XC4000XL-1, the fastest speed grade
currently available. For this speed grade, the timing budget
VON HERZEN: SIGNAL PROCESSING AT 250 MHz USING HIGH-PERFORMANCE FPGA’S 241
Fig. 3. The maximum distance a signal can travel in a single cycle using
the DI input pin on the CLB. All data communications must lie within this
event horizon to meet the timing specifications.
Fig. 4. The maximum distance a signal can travel in a cycle passing through
a single look-up table (LUT) before synchronization. The event horizon
indicates the wire limits for a desired clock frequency.
includes 0.1 ns clock skew, 1.6 ns clock to out for the CLB, 1.2
ns of routing delay, and 0.9 ns to pass through the lookup table
and setup the next register. This gives a total of 3.8 ns for the
CLB delays. Assuming that the clock and I/O systems can keep
up, worst-case CLB timing indicates operation at 263 MHz.
Xilinx has indicated that the “ 09” speed grade available early
in 1998 will increase the top speed of this family by 15%,
suggesting maximum operating speeds of over 300 MHz in
1998. This speedup illustrates one of the important advantages
of FPGA designs over ASIC designs: without any additional
engineering work, the design gets faster as new FPGA speed
grades become available.
IV. ON-CHIP CORRELATOR ARCHITECTURE
The schematic of Fig. 2 would probably not operate at 250
MHz by itself. The methodology described in the previous
section can transform this schematic into a highly pipelined
design operating at 250 MHz. For a synchronous 250 MHz
system, the global clock must register every CLB output. The
inputs for any look-up tables must come from nearby CLB’s.
Spanning distances of 3 CLB’s requires the DI pin. Each CLB
has only one DI pin, which limits the maximum density of the
design at times. The following Section IV-A shows how to
transform the schematic of Fig. 2 into a retimed architecture
meeting all of these design methodology constraints.
A. Retiming of the Lag Architecture
The lag architecture of Fig. 2 has a number of long prop-
agation paths that benefit from pipelining for maximum-
speed operation. Pipelining for a 250 MHz system requires
breaking down each combinational circuit into four-input look-
up tables (LUT’s) immediately followed by a register. Any
combinational blocks larger than four inputs must divide into
two or more blocks with a new register in the middle so as
to meet the single LUT rule between registers. Fortunately,
some of the logic in the multiplier can combine with some
logic in the 4 bit accumulator to shrink the total number of
CLB’s. For example, the LSB of the multiplier can combine
with the half adder to produce a running sum and carry bit for
the LSB of the accumulator in a single CLB. Both sum and
carry pass through a register. The carry propagates during the
following cycle to the next bit of the accumulator. In this way
the pipelined architecture processes one bit of carry per cycle
in a pipelined fashion and need to only traverse one lookup
table between registers.
Two four-input LUT’s in a single CLB can absorb all of the
logic in the middle and MSB bits of the product. The sign and
magnitude bits of the prompt and delayed data all affect these
product bits. If all four input registers lie adjacent to a single
CLB, then that single CLB can compute two bits of product
in each cycle. Registers latch these two product bits and pass
them to the accumulator for further processing.
A bit-pipelined accumulator provides maximum perfor-
mance for carry chains. By breaking the carry chain into
single-bit stages, carries only traverse a single LUT between
each register. Additional registers delay each input operands as
needed to align the inputs. The LSB accumulator computes one
result each cycle and passes it to the MID accumulator on the
next cycle, which passes its result to the MSB accumulator on
the following cycle. The retimed schematic appears in Fig. 5,
including all of the data registers required to properly retime
the schematic.
B. Placement of the Retimed Schematic on the FPGA
Circuit placement on the FPGA started from the middle
of the circuit out using the Xilinx EditLCA design editor
(XACT version 6.0.1). As a general rule, designers should
start with the most constrained part of a design and work
out from there. The MID and MSB product bits have the
largest placement constraints. Those bits require four inputs
and generate two outputs. The four inputs must lie close to the
CLB computing the outputs, suggesting a cross pattern with
the MID/MSB CLB in the middle, with input data bits coming
from north, south, east and west neighbors. This structure
appears in Fig. 6, with the four data input CLB’s forming
the cross and the computation CLB in the middle. Note that
for these ultrahigh-speed designs, properly positioning the data
242 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998
Fig. 5. Retimed correlator lag circuit that has only one four-input look-up table between registers. Each box with a large label represents a single CLB. Note
the high packing density for the MID and MSB bits of the multiplier. They fit in a single CLB because only four inputs feed that circuit.
Fig. 6. CLB layout on the FPGA showing cross topology of data inputs with
MID and MSB multiplier bits in the center. The four data inputs surround the
multiplier CLB and provide it with input data.
takes more CLB’s than performing the computation. This ob-
servation suggests that wiring speed has more importance than
LUT speed for ultrahigh-performance designs. It also suggests
that FPGA architectures with enhanced direct interconnect
lines would improve high-performance computational density.
Since a signal can jump at most three CLB’s at 250
MHz using the DI input, any lag pitch greater than three
requires multiple registers for the prompt data. This obser-
vation suggests a pitch of three CLB’s to maximize the wires
and CLB’s available while maintaining a single-hop distance
between lags as shown in Fig. 7. Placing the LSB logic to the
northwest of the MID/MSB product permits nearest-neighbor
communication to all the product CLB’s, since the four data
inputs can lie in any permutation around the cross. The wiring
delay from the LSB product to the MID accumulator bit
requires two pipeline stages. Two pipeline stages permit the
LSB result to pass south by two rows, and the other input
data has to pass through an equal number of registers for
synchronization with the LSB accumulator results. Wiring
delay can often limit performance more than CLB delay.
V. BOARD ARCHITECTURE
The reconfigurable correlator board can process input sig-
nals at 250 MHz and can demonstrate chip-to-chip commu-
nications at that speed as well. Fig. 8 shows the lay