The industry-standard PCI bus and RACE architecture team up to provide the bandwidth and flexible configurability.
Medical imaging system development is driven by two demands in seeming conflict: one, the expectation of ever increasing resolution, which requires ever more computation to yield images, and two, the need to maximize patient throughput in order to make high-end systems affordable. Improvements in resolution and versatility in turn lead to greater demand for imaging resources. Imaging systems have entered diagnostic and interventional settings where they had been little used before.
Higher-performance embedded computer systems for image processing and image formation must be developed in response to these trends. Embedded systems must handle increasing volumes of sensor data in order to create images of greater quality. Simultaneously, they must also generate images more quickly in order to process more patients and provide near-real-time images for interventional applications and acute diagnostics.
Today's medical imaging applications require computers with far greater processing and data-handling bandwidth than ever before. Proprietary pixel buses that have a large yet restricted bandwidth must often be used to augment bus-oriented, board-based systems. The pipeline bus must be arranged in segments so that bandwidth does not become throttled as system size increases. But segmentation results in diminished flexibility: the data stream is more predetermined, and the connection between bus segments has to be designed carefully.
System flexibility and bandwidth can be improved, however, by recourse to a switched-fabric architecture. A switching fabric interconnects system resources by means of switches in multiple stages, which route transactions between an initiator and a target. Each stage of a switching fabric typically consists of an intelligent multiport crossbar switch. The switch device can recognize an identifying data-stream header in order to route the communication transaction dynamically through the appropriate port to the next stage of the network. The initiator and target of the transaction can be any combination of processor and I/O controller, which allows for multiprocessor architectures and flexible I/O configurations. Such a crossbar architecture enables system bandwidth to be increased as the size of the system increases.
Figure 1. The RACE-series VantageRT base module, which combines the industry-standard PCI bus and the RACEway Interlink standard.
Some embedded computing systems, such as Mercury Computer Systems' FACE architecture, embody this capability. Using the RACEway Interlink standard, they offer multiprocessor computing supported by multiple software-selectable data paths at full bandwidth.1 The systems are available in both PCI (peripheral component interconnect) and VME form factors. They make possible field upgradability for entry-level to high-end medical imaging systems.
This article discusses the hardware and software components of the flexible, high-bandwidth systems, and examines their performance in medical imaging applications.
Embedded Processing Based on Combination Architecture
The VantageRT series combines the industry-standard PCI bus with the RACEway Interlink standard. This RACE-PCI combination has found broad acceptance for medical imaging applications. PCI-based systems may be less rugged than those based on the VME bus, but they provide a cost advantage and also access to a broader range of component choices.
Base Module. Each base module, or board, has a PCI edge connector on the bottom, two RACE ports on the top, and two six-way RACE crossbars. An SCSI (small-computer systems interface) connector allows communication to such devices as an SCSI disk array. Several base modules can be configured together, typically in an industrial PC chassis, with PCI Interlink modules resting on top of them for attachment to the RACE ports and providing flexible paths for communication between boards through another crossbar (Figure 1).
Processing is performed by compute nodes (CNs), of which there are two on each base module. The processors in a CN may be either SHARC digital signal processors or PowerPC microprocessors. Other CN components include SDRAM (static-dynamic random- access memory), level 2 cache (in the case of the PowerPC), and an application-specific integrated circuit (ASIC) chip that acts as both memory controller and network interface to the RACE switched-fabric interconnect.
Ports. Each connection provides a bandwidth of 160 Mbyte/sec, with multiple connections occurring at the same time; a six-way crossbar supports three simultaneous communication paths. The two RACE communication ports located on top of the base module provide 320 Mbyte/sec of peak data-transfer access for interconnecting boards. With a six-board configuration, the system has 960 Mbyte/sec of peak data-transfer bandwidth. Additional system bandwidth is available via PCI communications.
High Bandwidth. The high-bandwidth capacity of this RACE-plus-PCI architecture stands in sharp contrast to the limited bandwidth achievable in a PCI-only system. The PCI standard was not designed to handle many processors or many interrelated I/O streams. The bandwidth of a PCI-bus segment is 133 Mbyte/sec; a typical segment can accept up to four plug-in boards, each of which may contain several processors, but there can be only one communication path along that segment at any particular time. Contention for use of the PCI-bus segment can induce long latencies that limit real-time operation.
Scalability. PCI system scalability can be increased with PCI-to-PCI bridges connecting multiple PCI-bus segments in a passive backplane. Unfortunately, such bridges exacerbate the contention problems of the bus. A single communication path from one side of a bridge to another can consume all of the communication resources for both the upstream and downstream PCI-bus segments. A medical imaging computer system that requires multiple processors and multiple I/O modules will face difficult contention issues if it has PCI-only architecture.
Figure 2. One base module and a display device can be used to construct a vascular review station. See text for the role played by each labeled component.
Although 64-bit PCI systems expand the bandwidth of the PCI segment to 266 Mbyte/sec, that greater bandwidth is still shared by all the resources in the system. Switched-fabric architecture offers higher bandwidth that scales with the number of nodes—up to 2.5 Gbyte/sec connecting 32 CNs. In the RACE++ architecture available in VME-based systems from Mercury—and soon in its RACE-PCI products as well—each connection provides a bandwidth of 266 Mbyte/sec, with an eight-way crossbar supporting four simultaneous communication paths. As many as 32 CNs can be connected with more than 4 Gbyte/sec of bandwidth.
Software for Real-Time Systems
Real-time systems for medical imaging applications need deterministic, low-latency software components that can exploit the performance capabilities of high-bandwidth, multiprocessor hardware. Multiprocessor configurations require system-level software services at the intraprocessor and interprocessor levels.
The software core of the system under discussion is a run-time environment that has been designed to leverage the full potential of the RACE architecture. The operating system (OS) includes a nanokernel that is executed on each processor to provide intraprocessor services. Known as POSIX, it is a real-time portable operating system for UNIX kernel with single-processor application programming interfaces (APIs) for process and thread control, timers, interrupts, and device control. Also within the OS is an interprocessor communication system (ICS) for interprocessor services. The ICS supplies a uniform set of process-to-process communication facilities that operate between processes running anywhere within a network of processors. It makes possible remote process control and synchronization, shared-memory objects, and data transfers.
Also necessary to achieve higher productivity during application development are higher-level APIs and development tools. The VantageRT includes a multiprocessor communication API with a high-performance set of libraries in C language. The libraries constitute a complete programming environment for developing parallel applications in a distributed-memory multicomputer system while maintaining optimal hardware performance.
Integral to the run-time environment is a scientific algorithm library (SAL) consisting of more that 400 hand-coded assembly language routines. The SAL is optimized for each processor in the RACE architecture and is designed to promote code reusability. It encompasses a comprehensive group of functions, including vector processing, fast Fourier transform (FFT) algorithms, and data conversion, that are callable from high-level-language programs.
The technology just described can be used in vascular image processing. In the systems discussed below, a PowerPC microprocessor was used. Optional special-processing hardware, such as a convolver, or I/O interfaces may be added to the system at the RACE daughtercard locations.
Digital data can be brought into the system by means of a variety of standard or custom input devices, connected either directly to a RACEway module or by a PCI-to-RACEway interface. Data may be archived to or replayed from an array of high-performance SCSI disks interfaced directly to the real-time system. Image processing is executed on RACE CNs running compiled C programs and special library functions. A programmable RACE convolver node provides up to 9 X 9 spatial filtering at full frame rates.
One base module and a display device can be used to configure a vascular review station as diagrammed in Figure 2. To create such a system, frames acquired at 512 X 512-pixel resolution are archived to a disk drive attached to the board's SCSI interface (point [a] in the figure). During review, the images are retrieved from the disk at 30 frames per second (fps) and sent to a CN (b), where each image is subtracted from a mask image. Images are then edge-enhanced with a 5 X 5 filter and passed through a log look-up table (c). In the display card, the 512 X 512-pixel images are zoomed to 1024 X 1024 pixels (d).
A 30-fps cardiac interventional system can be built from three base modules connected by a RACEway Interlink module, along with a display device, as shown in Figure 3. The live frames, 1024 X 1024 reasolution with 10-bit depth at 30 fps, are sent from the acquisition system to a RACEway input module (a). The frame lines are divided between two CNs (b), where a complex adaptive algorithm is applied and where the pixel depth is reduced to 8 bits. The frame lines are rejoined as they enter the convolution module (c), where a 9 X 9 sharpening filter is applied in real time and with eight lines (260 microseconds) of latency. In another CN (d), a region of interest of each frame is zoomed and formatted and then sent to the left half of the frame buffer in the display card (e).
Synchronized with this live-image processing chain, a second imaging chain plays back a previously acquired series of 1024 X 1024 X 8-bit pre- and post-contrast frames. These frames have been archived on two SCSI disk arrays (SDAs) of four disks each (f); each SDA can be read at 30 fps (30 Mbyte/sec). These pairs of frames are sent to a CN (g) for subtraction, and then to a convolver module (h) for edge enhancement. A CN (i) performs formatting and delivers the frames to the right half of the display frame buffer (j).
Scalability. System performance can be increased by adding more base modules, thereby increasing the number of CNs. Standard interfaces are used to facilitate upgrades and new technology. The software environment is common to each. System bandwidth grows through the addition of PCI segments and RACEway paths, while I/O and processing capabilities increase.
Reconfigurability. The speed at which the operating mode of a system can change is governed by interrupt times. A node-to-node interrupt is measured at 20 microseconds. The time to change algorithms and look-up tables is then determined by memory bandwidth and is on the order of a few microseconds per kilobyte.
High Bandwidth. The SDA interface bandwidth is 40 Mbyte/sec peak, which may be achieved with three or four SCSI disks. This is equivalent to 19 fps for 1024 X 1024 X 10-bit frames. RACE 1.0 bandwidth is approximately 157 Mbyte/sec sustained, sufficient for simultaneous reading and writing of 2048 X 2048 X 10-bit frames at 19 fps.
Low Latency. At 30 fps for 1024 X 1024-pixel frames, one line of the image represents 32 microseconds of latency, and two-frame intervals are 67 milliseconds. The system's convolver module introduces 260 microseconds of latency for 8-bit frames, and one frame plus 260 microseconds for 10- or 12-bit images (two convolver passes are used for these pixel depths). RACEway transfers are most efficient for blocks larger than 2048 bytes; therefore, system operation must be tuned for the most time-critical processing steps in terms of fast transfers (more lines per block) and low latency (fewer lines per block). In addition, special algorithms may introduce more latency because of operating on multiple lines at one time.
Figure 3. Diagram of a cardiac interventional system based on three base modules connected by a RACEway Interlink module and including a display device. See text for an explanation of the role played by each labeled component.
Computation power. A system convolver performs spatial filtering with 9 X 9 arbitrary kernels on 1024 X 1024 X 8-bit frames at full frame rate up to 30 fps. Systems based on standard PCI and RACEway architectures feature highly scalable performance. Basic configurations of system elements support 30 fps in acquisition and display of 1024 X 1024 frames, while larger configurations support larger frames, greater pixel depths, faster frame rates, and more functionality. Simpler configurations can sacrifice performance in trade for lower cost. All system configurations use the same scalable, real-time software base. These programmable systems can be reconfigured on the fly by means of software switching.
Other Medical Imaging
Real-time magnetic resonance (MR) scanning has historically been hampered by slowness of image acquisition, reconstruction, and display. The computer processing alone could take several seconds, and images could be analyzed only after an examination had been completed. Additionally, any movement within the body, such as blood flow, slowed image capture and made getting a clear image difficult. Images frequently had to be retaken. These limitations resulted in only a relatively small number of patients being processed by each imaging machine, and then only for a small set of diagnostic applications.
Technologists have been able to solve these problems by designing a real-time MR scanning system using the embedded processing technology described above. Data acquisition and image reconstruction can be performed at speeds that make possible interactive MR examinations of a beating heart. In less than 1 second, a complete 3-D image can be acquired, reconstructed, and displayed. As a result of these improvements, blood flow in deep tissues now can be recorded with high resolution.
Solutions to challenging imaging
problems require the use of high-performance embedded computing systems and low-latency, real-time communication capabilities. Significant new trends in medical imaging, such as multislice computer tomography and 3-D imaging, will also require such technologies. The performance of the system described here suggests its potential utility in devices created to fill present and future needs.
1. ANSI/VITA 5—1994, "RACEway Interlink," VMEbus International Trade Association, Scottsdale, AZ.
Iain Goddard, ScD, is manager of Medical Technology Applications, Mercury Computer Systems Inc. (Chelmsford, MA).