Signal Processing at 250 MHz Using下载_在线阅读_9

is_044449

暂无简介

Signal Processing at 250 MHz Using 238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998 Signal Processing at 250 MHz Using High-Performance FPGA’s Brian Von Herzen Abstract—This paper describes an application in high-perform- ance signal processing using ...

238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998 Signal Processing at 250 MHz Using High-Performance FPGA’s Brian Von Herzen Abstract—This paper describes an application in high-perform- ance signal processing using reconfigurable computing engines: a 250-MHz cross correlator for radio astronomy. Experimental results indicate that complementary metal–oxide–semiconductor (CMOS) field programmable gate arrays (FPGA’s) can per- form useful computation at 250 MHz. The notion of an “event horizon” for FPGA’s leads to clear design constraints for high- speed application developers, and can be applied to a variety of real-time signal processing algorithms. Recent estimates indicate that higher performance FPGA’s available early in 1998 can attain speeds of over 300 MHz using 20% fewer logic elements than current designs. The results of this design work provide important clues on how to improve FPGA architectures for signal processing at hundreds of MHz. Direct routing channels between logic elements can significantly increase performance. Routing architectures with four-way symmetry allow for rotations and reflections of subcircuits needed for optimal packing density. Experimental results indicate that clock buffering often limits the top speed of the FPGA. Wave pipelining of clock distribution network may improve FPGA performance. Index Terms—Correlators, event horizon, field programmable gate array, manual partitioning and placement, programmable logic, real-time signal processing. I. RECONFIGURABLE COMPUTERS FIELD-PROGRAMMABLE gate arrays (FPGA’s) can pro-vide a useful platform for high-performance computing and real-time interactive signal processing [1]. These arrays perform well on real-time computations because they pro- vide the speed of dedicated circuitry while retaining the flexibility of a programmable system. Reconfiguring allows for incremental optimization and improvement of hardware much like software development techniques. Because each element in an FPGA performs a dedicated task, application developers can readily design each circuit in the system for a specified performance, ideal for real-time signal processing, which requires that data flow through the system at a specified rate. This paper will describe a real-time application in radio astronomy that makes use of the fastest FPGA’s available, and performs a computation at speeds heretofore only achievable in custom silicon [2]. The reconfigurable computer retains the flexibility to implement different performance tradeoffs, al- lowing for large arrays of simple computations or short arrays Manuscript received March 25, 1997; revised November 1, 1997. This work was supported by The Caltech Submillimeter Observatory, and the Joint Astronomy Center and Xilinx, Inc. The author is with Rapid Prototypes, Inc., Carson City, NV 89701 USA. Publisher Item Identifier S 1063-8210(98)02952-7. of complex computations. Reconfiguring the computer on de- mand provides a unique flexibility that has not existed before. II. CORRELATION SPECTROMETERS Radio astronomers analyze spectra using high-performance real-time computers [3]. High-frequency radio astronomy, particularly millimeter-wave and submillimeter astronomy, utilizes wide-band spectrometers with at least 1–2 GHz of spectrometer bandwidth. Weak astronomical signals require full-parallel integration on all of the channels to detect measur- able signals. Therefore, scanning spectrometers do not work for this application. Acoustooptic spectrometers can achieve the broad bandwidth, but become problematic in large arrays and for space-borne applications where adjustments of the analog elements become very difficult. Another alternative uses a parallel digital correlator that computes the autocorrelation function of the incoming base- band signal [2]. A digitizer quantizes the signal at a resolution of one, two, or three bits [4]. A 1:16 time-division demulti- plexer reduces the data rate from a Nyquist sampling rate of 4 Gs/s down to 250 Ms/s, producing 16 parallel data streams of 250 Ms/s. The streams pass into an array of cross correlators that correlate every stream with every other 250 MHz stream. Digital integrators accumulate the cross-correlation results for periods from 106 to 1012 samples, and the integration results pass to a microprocessor for further integration. The processor reassembles the autocorrelation of the 2 GHz input signal from the array of cross-correlation results and computes the Fourier transform of the correlation, producing the spectrum of the 2 GHz signal. In this paper, we will focus on the 250 MHz cross- correlation portion of the algorithm, the heart of the real-time computation. This building block serves as the basic element in a large array of spectrometers for the Caltech Submillimeter Observatory [5] and the James Clerk Maxwell Telescopes, Mauna Kea, HI [6]. A. Cross-Correlator Architecture Fig. 1 shows the basic architecture of a single cross cor- relator. Two 2-bit digital signals enter the correlator, called the prompt and delayed signal. Each lag of the correlator delays the delayed signal by one more clock than the prompt signal, hence the naming convention. A hardware multiplier at each lag computes the product of the prompt data and the delayed data, and offsets and rounds the result to 3 bits. An accumulator integrates the rounded products and passes its 1063–8210/98$10.00  1998 IEEE VON HERZEN: SIGNAL PROCESSING AT 250 MHz USING HIGH-PERFORMANCE FPGA’S 239 Fig. 1. Block diagram of a cross correlator. TABLE I MODIFIED AND OFFSET MULTIPLICATION TABLE FOR TWO-BIT CORRELATION INPUT SIGNALS. THE MULTIPLIER ROUNDS AND OFFSETS THE PRODUCTS TO PRODUCE THREE PRODUCT BITS INSTEAD OF FOUR, ALLOWING 3-BIT UNSIGNED ADDITION FURTHER DOWN THE PIPELINE carry bit to a ripple counter acting as an accumulator. The ripple counter accumulates for integration times approaching 10 billion samples for astronomical applications, whereupon the results pass to a host computer, the counters reset, and the integration process repeats. This computation occurs at each lag in the correlator. The prompt and data signals emerge from the right side of the correlator to permit daisy-chaining of the correlator chips to form longer correlators. B. Design of an Individual Correlator Lag Fig. 2 shows the architecture of an individual correlator lag. Delayed and prompt data enter the lag on the west side and pass through registers on the rising edge of the clock. Combinational logic computes a three-bit rounded product using the values listed in Table I. In the signed-magnitude number representation, the data values correspond as follows: and Normally this would produce products from 9 to 9. We offset these products by 9 and divide by three to get the range from zero to six. Then we round the four central entries in the table to the value three to get integer values. The resulting entries appear in Table I. Table I produces 3-bit products that have LSB’s of the inner products rounded, with all the products offset to produce positive results. The offset allows the logic to have un- signed adders, accumulators, and up-counters. Later a micro- processor eliminates the offset by tracking the total number of samples and subtracting the number of samples times the fixed offset of three from the correlator results. TABLE II TIMING BUDGET AT 250 MHz FROM ONE SYNCHRONOUS REGISTER TO ANOTHER. THE DI PIN BYPASSES THE LOOK-UP TABLE (LUT), PROVIDING A LONGER MAXIMUM ALLOWED WIRE DELAY The product goes to a 4-bit accumulator whose carry output controls a ripple counter for integration. The correlator runs for a predetermined number of cycles, after which the integration results go to a microcontroller for low-bandwidth processing. This basic algorithm repeats for each lag in the correlator array. The next two sections describe the optimizations to the architecture needed to obtain the performance objectives of 250 MHz real-time throughput with the correlator. III. HIGH-SPEED FPGA’S FPGA technology frequently uses lookup tables based on static memory coupled with synchronizing registers. For this application, we used the XC3195-09 chip from Xilinx [7]. This architecture utilizes two 4-bit lookup tables and two flip- flop registers per configurable logic block (CLB). These chips provide ample resources for small-scale pipelining, the key to achieving high throughput in FPGA’s. Two approaches to high-speed FPGA design include top- down system design and bottom-up library element design. Here we use a bottom-up methodology where we set the target clock frequency in advance, and then design each element to meet the performance objectives from the start. These library elements connect using standard synchronous pipelining as long as the connections between elements remain short enough to stay within the synchronous timing window. We chose to design at 250 MHz based on rough performance estimates and the constraint that the clock frequency must divide the sampling rate of 4-GHz by a power of two. In addition, we had completed a full-custom design at 250 MHz using 1.2 m CMOS [2], and wanted to compare the performance of full-custom with FPGA technologies. From the 4.0 ns cycle time we must subtract 0.1 ns for clock skew on the FPGA, providing 3.9 ns to travel from register to register on the chip. Using worst case timing tables for the 3195-09 device [8], it takes 1.3 ns to go from the rising edge of the clock to the CLB output, with a setup time to the CLB of either 1.5 ns for look-up table inputs or 1.0 ns for direct- in (DI) data inputs. Table II shows the overall timing budget. The X-Delay timing analyzer reports a maximum clock skew of 0.1 ns between internal CLB’s for the 3100-09. A design frequency of 250 MHz allows for 1.1 ns of allowable wiring delay for normal CLB inputs and 1.6 ns of wiring delay for DI inputs. The DI inputs do not pass through the lookup tables before going to the flip-flop registers. These wiring delays on the “ 09” series allow nearest neighbor communication using the direct data lines between 240 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998 Fig. 2. Basic schematic of a single correlator lag. CLB’s within 1.1 ns, and occasionally diagonal data propaga- tion to normal data inputs. Figs. 3 and 4 show the wiring delay from a central CLB to other nearby CLB’s. Direct interconnect takes 0.3 ns, and general routing resources take significantly longer. A 250-MHz system can use any routing channel of less than 1.1 ns for general inputs and any routing channel of less than 1.6 ns for direct data inputs (DI) to the CLB registers. These constraints determine the event horizon for a 250 MHz system Utilizing this design methodology, the entire chip can operate with worst-case cycle delays of under 4.0 ns. Future FPGA design tools could show for any selected CLB the event horizon for that CLB at a specified clock frequency, perhaps with the entire neighborhood highlighted in green. This could vividly show designers how far they could go with a signal before resynchronization. Note that signals travel faster to the east than to the west, and slightly faster south than north. The standard orientation of the Xilinx device in the EditLCA editor defines east and west, north and south, i.e., pin 1 located in the northwest corner, facing the surface of the die. These directional biases in the 3100 family constrain the orientation of the chip relative to the desired data flow direction. When floorplanning a large chip, designers must often rotate and reflect a subcircuit for optimal packing density. Directional biases reduce the maximal packing density and operating speed of FPGA’s, and may stem from a habit of drawing schematics from left to right, and translating these schematics into chip layouts fairly literally. Newer FPGA’s such as the XC4000 propagate signals at the same speed to the east and west, but the larger size of the XC4000 CLB reduces pipeline performance relative to the XC3100 in the same chip technology. The XC4000 family lost all direct interconnects, significantly increasing the time for nearest neighbor communications. The XC4000EX/XL regained direct interconnects, but only in the east and south directions, exacerbating the directionality problems of the original XC3100A architecture. Symmetry makes rotations and reflections of subcircuits possible, an important operation for building larger systems. Hopefully, future FPGA architectural designers will see the benefits of directional symmetry for routing networks in FPGA’s. A. Estimates for the XC4000XL Device Architecture Recently, Xilinx has focused its high-speed efforts on the XC4000XL family. Worst case timings for this family pro- vide a performance comparison with the 3100A family. The XC4000XL CLB has more symmetry and has internal RAM capability, useful for high-density accumulators. Early esti- mates indicate that a correlator lag may require 20% fewer CLB’s in the XC4000XL than in the XC3100A design. Xilinx produces the XC4000XL-1, the fastest speed grade currently available. For this speed grade, the timing budget VON HERZEN: SIGNAL PROCESSING AT 250 MHz USING HIGH-PERFORMANCE FPGA’S 241 Fig. 3. The maximum distance a signal can travel in a single cycle using the DI input pin on the CLB. All data communications must lie within this event horizon to meet the timing specifications. Fig. 4. The maximum distance a signal can travel in a cycle passing through a single look-up table (LUT) before synchronization. The event horizon indicates the wire limits for a desired clock frequency. includes 0.1 ns clock skew, 1.6 ns clock to out for the CLB, 1.2 ns of routing delay, and 0.9 ns to pass through the lookup table and setup the next register. This gives a total of 3.8 ns for the CLB delays. Assuming that the clock and I/O systems can keep up, worst-case CLB timing indicates operation at 263 MHz. Xilinx has indicated that the “ 09” speed grade available early in 1998 will increase the top speed of this family by 15%, suggesting maximum operating speeds of over 300 MHz in 1998. This speedup illustrates one of the important advantages of FPGA designs over ASIC designs: without any additional engineering work, the design gets faster as new FPGA speed grades become available. IV. ON-CHIP CORRELATOR ARCHITECTURE The schematic of Fig. 2 would probably not operate at 250 MHz by itself. The methodology described in the previous section can transform this schematic into a highly pipelined design operating at 250 MHz. For a synchronous 250 MHz system, the global clock must register every CLB output. The inputs for any look-up tables must come from nearby CLB’s. Spanning distances of 3 CLB’s requires the DI pin. Each CLB has only one DI pin, which limits the maximum density of the design at times. The following Section IV-A shows how to transform the schematic of Fig. 2 into a retimed architecture meeting all of these design methodology constraints. A. Retiming of the Lag Architecture The lag architecture of Fig. 2 has a number of long prop- agation paths that benefit from pipelining for maximum- speed operation. Pipelining for a 250 MHz system requires breaking down each combinational circuit into four-input look- up tables (LUT’s) immediately followed by a register. Any combinational blocks larger than four inputs must divide into two or more blocks with a new register in the middle so as to meet the single LUT rule between registers. Fortunately, some of the logic in the multiplier can combine with some logic in the 4 bit accumulator to shrink the total number of CLB’s. For example, the LSB of the multiplier can combine with the half adder to produce a running sum and carry bit for the LSB of the accumulator in a single CLB. Both sum and carry pass through a register. The carry propagates during the following cycle to the next bit of the accumulator. In this way the pipelined architecture processes one bit of carry per cycle in a pipelined fashion and need to only traverse one lookup table between registers. Two four-input LUT’s in a single CLB can absorb all of the logic in the middle and MSB bits of the product. The sign and magnitude bits of the prompt and delayed data all affect these product bits. If all four input registers lie adjacent to a single CLB, then that single CLB can compute two bits of product in each cycle. Registers latch these two product bits and pass them to the accumulator for further processing. A bit-pipelined accumulator provides maximum perfor- mance for carry chains. By breaking the carry chain into single-bit stages, carries only traverse a single LUT between each register. Additional registers delay each input operands as needed to align the inputs. The LSB accumulator computes one result each cycle and passes it to the MID accumulator on the next cycle, which passes its result to the MSB accumulator on the following cycle. The retimed schematic appears in Fig. 5, including all of the data registers required to properly retime the schematic. B. Placement of the Retimed Schematic on the FPGA Circuit placement on the FPGA started from the middle of the circuit out using the Xilinx EditLCA design editor (XACT version 6.0.1). As a general rule, designers should start with the most constrained part of a design and work out from there. The MID and MSB product bits have the largest placement constraints. Those bits require four inputs and generate two outputs. The four inputs must lie close to the CLB computing the outputs, suggesting a cross pattern with the MID/MSB CLB in the middle, with input data bits coming from north, south, east and west neighbors. This structure appears in Fig. 6, with the four data input CLB’s forming the cross and the computation CLB in the middle. Note that for these ultrahigh-speed designs, properly positioning the data 242 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 2, JUNE 1998 Fig. 5. Retimed correlator lag circuit that has only one four-input look-up table between registers. Each box with a large label represents a single CLB. Note the high packing density for the MID and MSB bits of the multiplier. They fit in a single CLB because only four inputs feed that circuit. Fig. 6. CLB layout on the FPGA showing cross topology of data inputs with MID and MSB multiplier bits in the center. The four data inputs surround the multiplier CLB and provide it with input data. takes more CLB’s than performing the computation. This ob- servation suggests that wiring speed has more importance than LUT speed for ultrahigh-performance designs. It also suggests that FPGA architectures with enhanced direct interconnect lines would improve high-performance computational density. Since a signal can jump at most three CLB’s at 250 MHz using the DI input, any lag pitch greater than three requires multiple registers for the prompt data. This obser- vation suggests a pitch of three CLB’s to maximize the wires and CLB’s available while maintaining a single-hop distance between lags as shown in Fig. 7. Placing the LSB logic to the northwest of the MID/MSB product permits nearest-neighbor communication to all the product CLB’s, since the four data inputs can lie in any permutation around the cross. The wiring delay from the LSB product to the MID accumulator bit requires two pipeline stages. Two pipeline stages permit the LSB result to pass south by two rows, and the other input data has to pass through an equal number of registers for synchronization with the LSB accumulator results. Wiring delay can often limit performance more than CLB delay. V. BOARD ARCHITECTURE The reconfigurable correlator board can process input sig- nals at 250 MHz and can demonstrate chip-to-chip commu- nications at that speed as well. Fig. 8 shows the lay

本文档为【Signal Processing at 250 MHz Using】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

Signal Processing at 250 MHz Using

热门搜索

历史搜索