By Winthrop W. Smith

In the late 1990s, a trend emerged for implementing real-time, multi-processor digital signal processing (DSP) applications using reduced instruction set computer (RISC) chip families such as the PowerPC, R10000, and others. With the introduction of the PowerPC 7400 and its Altivec hardware enhancements integrated on-chip, the trend appears to be firmly entrenched and focussed on the PowerPC chip family.

But this is not the first time this type of technology shift has occurred. In the late 1980s, Intel introduced a RISC chip for graphics processing called the i860. Once DSP board vendors discovered how well it executed signal and image processing functions, they wrote high-speed libraries, and the high-performance real-time processing world beat a fast path to the doors of i860-based board vendors for several years. By the mid '90s, DSP chips like the Texas Instruments (TI) TMS320C40 and Analog Devices ADSP-21060, had caught up with the i860 and the trend began moving back towards using DSP chips for real-time DSP applications because Intel decided not to provide next-generation versions of the i860.

Other similar shifts have also occurred over the 30-plus years I have been developing high-performance real-time processing applications. The first one I remember was the introduction of the Cooley-Tukey fast Fourier transform (FFT) about the same time as integrated circuits were taking off in the mid to late 1960s. These two events quickly spawned numerous array processor box companies, as they became the preferred way to perform real-time signal and image processing. With the introduction of the first DSP chip, the TI TMS32010 in the early '80s, the world took another turn and began focussing on signal-processing board companies such as Motorola and Analog Devices, as well as other chip vendors who followed suit and introduced DSP chip families.

Each time the paradigm shift occurred, it stimulated new ways of thinking about solving the same old problems of target detection and tracking using sensor data from radar, sonar, and Electro-optical infrared (EO/IR) sensors. New thinking involved how to interconnect these processing chips into multiprocessor arrays as well as new algorithmic approaches that made the best use of the strengths of the new chips and array architectures.

So, here we are again. We need to stand back and understand how to compare real-time processing strategies, this time for DSP and PowerPC 7400 Altivec technologies. We must determine how to use the PowerPC efficiently and ensure it replaces DSP chip technology for good reasons, not just because it is the next greatest widget. Each new technology has its place but none is all things to all people as most are so often touted at introduction.

#### Let's get specific

Let's shift to a specific example for comparing DSP and Altivec technologies, the FFT. It is one of the most commonly used computational building blocks in signal and image processing. It appears somewhere in a huge percentage of applications, not only in military uses, but also in commercial applications such as seismic exploration, medical imaging, and factory automation.

The FFT has a mixture of computational and data movement steps that make it a good bellwether for the performance of complete applications. In addition, the 1,024-point complex FFT has become the standard benchmark to gauge the capability of chip, board, and system architectures for real-time signal processing applications.

In the 1960s and 1970s, computer scientists measured the complexity of an FFT algorithm by the number of multiplies required because multiplier chips took up huge amounts of space and were power hogs. This was the drive behind the industry stir in 1976 when the Winograd FFT was introduced, because it had the theoretical minimum number of multiplies. It failed to sustain popularity because the increased data movement requirements of the Winograd FFT, compared with standard FFTs, overshadowed the improvement in the number of multiplications.

In the '80s, with the introduction of numerous DSP integrated circuits with built-in multipliers, the measure shifted to FFT algorithms with the minimum number of adds and multiples. For an N-point power-of-two FFT, the standard measure was the 5*N*log(N) operations needed to compute FFTs using two-point building blocks. However, with the advent of one-clock-cycle multiply-accumulate (multiply and add) functions on DSP chips, the performance measure reduced to 3*N*log(N), the number of adds in an FFT algorithm. This was because the smaller number of multiplies (2*N*log(N)) could be computed at the same time as the adds, thus "hiding" the multiplies from an execution time perspective.

Analog Devices went one step further in the 1990s with its ADSP-21060 SHARC chip that had the ability to perform a multiply, as well as a "butterfly" add/subtract in the same clock cycle. Since most FFT algorithms have this so-called "butterfly" computational building block, the performance measure for that chip was reduced from 3*N*log(N) to more like 2*N*log(N) operations and the ADSP-21060 became the performance leader for DSP applications. The 21060 also had six link ports that enabled systems designers to connect the chips easily into multiprocessor configurations, again providing a performance advantage at the multi-chip level.

So what makes the PowerPC 7400 the current chip of choice for DSP? The 1,024-point complex FFT affords a simple example for comparison because the data and multiplier constants required to execute it can be stored in on-chip RAM for the PowerPC7400 (L1 cache) and ADSP-21060 (Data RAM). The 40 MHz version of the ADSP-21060, first introduced by Analog Devices, took roughly 460 microseconds to compute the 1,024-point FFT. If its clock speed were just scaled to that of a PowerPC 7400 running at 400 MHz, the ADSP-21060 would compute the 1,024-point FFT in 46 microseconds. However, Mercury Computer Systems and others report 25 microseconds for this algorithm using the 400 MHz PowerPC 7400.

#### Altivec vs. SHARC

Why is there such a performance difference, even if the ADSP-21060 ran as fast as the PowerPC 7400, which it does not? There are two differences in chip architecture that provide the difference. Namely, the PowerPC 7400 allows four 32-bit words to move to/from memory in a clock cycle (Memory I/O advantage) as well as performing four adds and four multiples (computational advantage) during that same clock cycle. The ADSP-21060 allows two words moved in and one stored, one multiply, and two adds per clock cycle. This accounts for the nearly 2-to-1 difference once the chips are normalized for clock speed.

The bottom line is the PowerPC7400 runs faster and has more functions on the chip. This is due to the much larger development money being invested in the PowerPC 7400 because of its larger application potential. Another advantage of this wider application space is the larger emphasis on sophisticated software development tools and efficient computational libraries that allow faster development as well as more efficient performance of high-level language coding. There is no reason to believe these economic factors will shift. The conclusion is that the PowerPC family will continue to outperform DSP chip families for the foreseeable future and the PowerPC software development cycle will be shorter, hence more affordable. Finally, the increased use of high-level language programming, rather than the assembly code traditionally needed to squeeze performance out of DSP chips, makes high-performance PowerPC code more transportable to the next generation of the same chip family. Hence, the shift to the PowerPC.

#### Rethinking processing strategies

So, how do processing strategies have to change to make best use of the PowerPC? The answer lies in configuring and decomposing algorithms so that they fit into the PowerPC's L1 cache. There are two reasons for this. First, PowerPC memory accesses from L1 cache are at least twice as efficient as from L2 cache and often as much as four times as fast. Second, four multiply-accumulates (eight operations) can perform in parallel with every memory-access instruction. This means that most algorithm arithmetic can perform in parallel with memory accesses ("hidden" as was the case mentioned above for DSP chip multiples for the FFT), making their contribution to the total execution time minimal.

The bottom line: don't configure your algorithms on the PowerPC for the least number of adds and multiplies as you have done for DSP chips for decades; configure them for the least amount of data movement between memory and the chip's computational registers. Generally, this translates into configuring algorithms into chunks of data on the order, but not larger than the 32 PowerPC computational registers, and then doing as much work on them as you can before returning them to memory, even to L1 cache. Then it is critical to configure data sets so that they fit into L1 cache as often as possible because of its bandwidth advantages.

A good example of this strategy is the FFT algorithms found in the Vector Signal and Image Processing Library (VSIPL) developed under supervision of the U.S. Defense Advanced Research Projects Agency (DARPA) to provide a common, portable library for system developers across many different signal processing board architectures.

For power-of-two FFTs, the library strategy defined for the Core library profile available in source code form at www.vsipl.org has two-, four- and eight-point building blocks. The strategy for decomposing the FFT length is to find all the possible factors of eight and then supplement it with a single factor of four or two to complete the computation. For our 1,024-point standard, this means using eight*eight*eight*two, which requires only four moves of the full 1,024-point data set from L1 cache to and from the computational registers. The more classical DSP approach to the 1,024-point FFT would use five sets of four-point transforms. But, this would require five passes of data in and out of computational registers, thus lengthening the execution time.

The 1,024-point FFT illustrates another interesting example of differences to beware of between DSPs and the PowerPC. Almost all applications using the FFT will also need a weighting function multiplier in front of the FFT to control frequency sidelobes of the resulting frequency analysis. Using 3*N*log(N) as the performance measure for DSP chips, the 1,024-point FFT requires 30,720 operations. Putting the weighting function ahead of the 1,024-point FFT adds an additional 2,048 operations (2048/30,720 = 6.7 percent), a small increase in processing time.

With the PowerPC the results are considerably different. Since the PowerPC processing time is dominated by moving data in and out of memory, even if the data resides in L1 cache, using the eight*eight*eight*two 1,024-point FFT implementation requires four moves of the entire data set between L1 cache and computational registers. The weighting function, performed before FFT algorithm call, requires the entire data set move between L1 cache and registers one more time, for a total of five times. So now, the weighting function adds 25 percent to the execution time using the PowerPC.

Dr. Winthrop W. Smith is chief technical officer at DNA Computing Solutions in Richardson, Texas. During his 33 years in the commercial and aerospace real-time signal/image processing marketplace, he has developed numerous DSP inventions, and holds 10 patents in digital signal processing and the design and development of sophisticated radar systems. He was named Inventor of the Year at Martin-Marietta Aerospace Corp. and Author of the Year at E-Systems, Inc. He led the development, manufacture, and marketing of the world's first two massively parallel processor integrated circuits (GAPP and DAP). He wrote the popular IEEE Press book, Handbook of Real-time Fast Fourier Transforms, and has published more than 60 technical articles.