Designing portable multiprocessor solutions without sacrificing performance

By Richard Jaenicke

Sky Computers Inc.

When designers apply portability to embedded multiprocessor applications, they collide head-on with performance. In fact, making the most of performance within a given budget for power, cost, and space are the reasons for using several processors to solve an embedded problem. Traditional approaches to this conflict have sacrificed performance to achieve portability, often by discarding non-portable performance enhancements without looking for an alternative. Of note are three components of portability in multiprocessor systems: processor independence, topology independence, and scalability.

At the basic level, processor independence is the ability to apply a different processor to each part of the application that has different requirements. For example, a signal processing application may benefit from having a heterogeneous processing environment with digital signal processors (DSPs) performing some front-end filtering on the incoming data and some RISC processors for back-end processing such as pattern recognition and decision analysis.

Over time the initial processor selection will likely change, and designers will need to port the application software to a new set of processors. Experts could change the processor to upgrade the application to the latest technology or to reuse the code for a new application. In either case, the goal is to avoid re-architecting the application software.

The primary method of achieving portability between processors is to write the application in a high-level language. This places the burden of processor independence on a sophisticated software tool: the compiler. Even when systems engineers use a high-level language, coding processor specific parameters such as the size of the cache can compromise portability. The problem in many applications involves hand-tuned code, which achieves significantly better performance than processor-independent code. The performance boost typically comes from detailed knowledge of the processor or from applying domain-specific knowledge, such as optimizing for large vector sizes. Even limiting the hand tuning to portions of the application identified as bottlenecks can result in hand coding considerable portions of the application. Good software engineering practices dictate encapsulating these processor-dependent portions in easily separated modules, but it only slightly eases the burden of re-writing and re-tuning the code.

The solution to this dilemma is to make the software tool smarter. Experts have used intelligent compilers with supercomputers for more than a decade, and they can apply that technology to other types of systems. A particularly useful technique for large data sets is automatic vectorization.

A vectorizing compiler recognizes loops that operate on sets of data as vectors and replaces those loops with function calls tuned for a particular processor. As part of this operation, designers must analyze the size of the vectors to check if they will fit into the processor`s cache. If they are too large, then new vector elements could overwrite previous elements even before they execute, thereby causing cache thrashing. To prevent this performance-degrading behavior, designers need to use strip-mining, which divides the vector elements into groups of data that will fit into cache process separately. Designers must combine the results of these separate operations to produce the final result.

Certainly an experienced programmer could perform the strip-mining manually. If he does, however, several problems will quickly surface. First, the programmer will require a detailed knowledge of the processor at hand, such as the size of the cache. Second, the programmer will need to call the low-level functions that take processor-specific parameters as arguments. When the designer changes the processor, he will need to rewrite this code completely. A vectorizing compiler automatically knows how to translate portable high-level function calls into low-level vector function calls and insert the processor-specific parameters. Having the compiler do the strip-mining saves development time, is less error prone, and is portable.

Topology independence

Topology independence enables designers to port an application they developed on one communications architecture, such as a mesh, to another architecture, such as a tree. More importantly, topology independence also enables the designer to change the shape of the mesh or tree as well as the position of the processing nodes within the mesh or tree. Ideally, this freedom to change the location of a resource extends to memory and I/O.

Designers usually accomplish portability among topologies with message passing between processors. Message passing forms a communication layer that insulates the application from the relative location of the resource - the path the communication took to get from the source to the destination. However, experts often reject the overhead of such a large software layer on performance grounds in favor of hand-coding the topology nuances directly in the application. For example, the designer might assign a particular task to a particular processor based on the designer`s knowledge that the global memory band will be architecturally nearby. If the designer changes or upgrades the system architecture, that assumption may no longer be valid. The result could be not only worse performance but also non-functional communications.

Systems designers can achieve topology independence plus performance by choosing a combined memory and communications architecture that supports distributed shared memory in hardware. They can implement distributed shared memory for any communications topology, including rings, meshes, and trees. In this architecture, the processor, memory, and I/O resources are physically distributed but logically connected as single linear memory address space. The hardware communication architecture supports automatic routing of data transfers, thereby hiding the topology and achieving portability without a significant performance penalty. Any processor can potentially access any resource in the system simply by knowing its address. Such logical addresses are absolute so that no knowledge of the communication path is needed. To the applications developer, the entire system appears as local memory while maintaining high-bandwidth performance. Thus portability is maintained by hiding the underlying topology without a significant performance penalty.

System scalability

The third and most complex component of portability is scalability. Being portable across various system sizes enables the system size to grow with the data set or shrink with the introduction of faster processors. From the hardware side, scalability implies changing the system size without introducing bottlenecks - that is keeping the processor, memory, and communications performance balanced. From the software side, scalability implies being able to change the assignment of tasks or data sets easily to processors as the number of processors changes.

The complexity of creating a scalable system is a difficult goal by itself even before placing performance as a priority. Traditional approaches to scalability require restricting scalability to a certain limited range and to a given architecture type, such as symmetric multi-processing. Alternatively, developers often choose to emphasize performance over scalability. They achieve performance by keeping software layers to a minimum by hard-coding a particular communication method that is not scalable. The performance technique of hard-coding to particular processors also prevents scalability of the system in addition to changing system topology. Implementing with the priority on performance may even entail directly programming direct memory access (DMA) engines to perform the data transfers.

One solution for scalable hardware is a communication network of switches that connects processors, memory, and I/O nodes. As the system grows, designers add more switches to scale the communication bandwidth proportionally. Even with extra switches, large systems can bog down unless engineers design the communication protocol for many contending requests. These circuit-based protocols require a complete end-to-end path to be free before transmitting any data, and can contribute to congestion in the network even when the attempt to establish a connection fails. Modern communication systems instead use packet-switched protocols with FIFOs between switches. This architecture permits packets of data to progress partially through a congested network and wait in a FIFO for any individual blocked network segments to become free.

A communication library provides scalable software solution that designers implement with knowledge of the underlying hardware communications architecture. A scalable communications library enables position-independent communications through the use of virtual data paths connecting communicating tasks. These virtual data paths allow the tasks to be redistributed and potentially replicated to fill the available number of processing nodes. If the communications architecture does not provide automatic routing, that functionality can be provided by the communications library, albeit at lower performance. Using the knowledge of the hardware communications architecture, the communication library can automatically select the most appropriate type of communication such as shared memory transfers, direct loads and stores, and transfers using DMA engines.

Implementation examples

Leaders of Sky Computers Inc. have applied automatic vectorizing compilers to embedded real-time systems for more than five years. This technology builds on off-the-shelf compiler front-end parsers and back-end code generators by adding a manipulation of the intermediate language to support automatic vectorization. The compiler transforms standard library calls into optimized, processor-specific calls and then starts the process of vectorizing the application.

One example involves a base execution time is for the code compiled by a good optimizing compiler. The first level of vectorization is same code where the C code loops are replaced by vector function calls and compiled with the SKYvec vectorizing compiler with strip mining. The second level adds function chaining, and the third level removes temporary variables. Significant performance enhancements are possible using this intelligent compiler technology, even beyond calling individually tuned library functions. The performance was progressively increased in a portable manner with very little help from the application developer using only processor-independent C code.

Engineers can realize the topology independence of a distributed shared memory system in the ANSI standard SKYchannel (ANSI/VITA 10-1995) architecture. SKYchannel sends each data transfer in a packet between bi-directional FIFOs at 320 Megabytes per second. Designers can connect FIFOs by either buses or crosspoint switches depending on the bandwidth requirements. They can route packets automatically through the system based on the 44-bit destination address in the packet header, making the architecture easy to use as well as high performance.

Current SKYchannel crosspoint switches have 10 ports and are capable of 5 simultaneous connections at the full 320 megabytes per second. These switches are used on a SKYchannel backplane, which attaches to the rear of the second row of connectors in a VME enclosure. The SKYchannel backplane used 8 ports of the switch to connect to VME boards, leaving 2 ports available for connecting deeper into the network. Multiple SKYchannel backplanes in multiple VME chassis can be connected by cable links to a higher-level crosspoint switch called a SKYchannel Chassis Crossbar. Multiple levels of SKYchannel Chassis Crossbars enable designers to scale up the hardware communication network to 256 VME boards, all connected by a single SKYchannel system with all memory and I/O accessible by all the processors in a single memory-mapped address space.

Software scalability comes from SKYscl, the scalable communications library. By separately defining the software tasks with their data communication connection and the hardware processors with their physical communication connections, SKYscl facilitates scalability from one to hundreds of processors. Because SKYscl automatically adjusts to the location of the source and destination node in each communication, the library automatically determines how the data should be moved for the highest performance as configurations are changed.

Portability and performance

By using techniques from other arenas, SKY Computers has achieved portability and performance for embedded computing. The key technologies applied can:

- automatically select low-level functions for performance given the high-level portable version;

- automatically fill in processor-specific parameters using existing knowledge of the target processor;

- automatically route communications regardless of topology;

- automatically choose a low-level communication method, given a high-level communication call;

- automatically keep a resource connected to a consumer when either moves between processors or across topologies; and

- automatically scale communication with number of processors while maintaining high performance.

When portability becomes as imperative as performance, choosing the right architectures will lead to achieving both goals.

Richard Jaenicke is director of marketing at Sky Computers Inc. of Chelmsford, Mass. He began his career at the General Electric Corporate Research and Development Center in Schenectady, N.Y., moved to the MIT Lincoln Laboratory, and later worked at Eastman Kodak. He received his bachelor`s degree in computer science from Dartmouth College and his master`s degree in computer engineering from Rensselaer Polytechnic Institute.