Before choosing the specific OS, you must determine the type that’s best for your design.
Medical device manufacturers understand the importance of the operating system (OS). In fact, contrary to common practice in the world of embedded systems, they often select the OS even before they choose the board. According to VDC Research, for example, in 2010, 36.4% of medical device projects chose the OS first, compared to 20.8% of telecommunications projects, and just 9.3% of transportation projects.
1. Percent of projects selecting the OS first, by industry.
This anomaly underlines just how much medical devices depend on their OSs. It does not, however, help with OS selection, which is made more difficult thanks to constant innovation and development that combine to present a bewildering line-up of possibilities: Android, QNX Neutrino RTOS, myriad Linux flavors, Windows CE, and roll-your-own, to name just a few.
Of course, no serious engineer would formulate the question of OS selection as “which OS?” but rather, “what does the project need from its OS?” The answers to this question lead to a shortlist of viable candidates.
Though these answers will be unique to every project, we can make a few assumptions. The OS must support the project’s business requirements; it must support the device’s regulatory compliance requirements; and it must possess whatever characteristics the device requires of it, starting with, in most cases, dependability.
The business needs driving OS selection for medical devices are like those for most other devices, and require little elaboration here: cost, quality, time-to-market, portability, support, vendor history, ecosystem, and vendor track record and long-term viability. Before a medical device can go to market, the manufacturer must demonstrate that it complies with legislation in the jurisdictions where it will be sold, such as FDA 510(k) pre-market notification in the U.S., and the Medical Devices Directive (MDD) and myriad national standards in Europe.
Though agencies such as the FDA evaluate devices as a whole, how device components (including the OS) are developed, and how their characteristics, chiefly their functional safety claims, are validated can prove crucial to a device achieving compliance. Things to look for from a vendor include:
In its broadest definition, a medical device can be anything from a bathroom scale (which today almost certainly includes some electronics) to a dialysis system. For the purposes of this discussion, we exclude consumer-grade devices like a bathroom scale, whose failure implies nothing more than a minor inconvenience. For devices whose failure carries serious consequences, we can group key OS characteristics as follows:
To these we can add:
Each of these characteristics merits in-depth discussion. We will focus here on one that is arguably the most important, dependability, and some OS characteristics that support it.
2. A simple patient-monitoring system.
GPOS or RTOS
Dependability is a combination of two characteristics: availability—how often the system responds to requests in a timely manner, and reliability—how often these responses are correct. In other words, a dependable OS is one that responds when it’s required in the time required, and responds correctly.
This fundamental requirement for a dependable OS precludes using a general-purpose operating system (GPOS), because GPOSs can only offer best-effort performance. They’re designed to do many things well—often extremely well—but they can’t offer the strict guarantees of availability and reliability required of a medical device. And they can’t guarantee that they’ll always perform as required.
In contrast, an RTOS is explicitly engineered to guarantee availability and reliability. A designer can count on the RTOS always being available when it’s expected, and always completing tasks as expected. Assuming then that the RTOS supports the other required functionality, we can conclude that most medical devices (other than low-end consumer disposables) require an RTOS.
In a recent MED article (Android is the best operating system for many medical applications), the author suggests that medical devices might be well-served by two OSs, presumably an RTOS and a GPOS. He says, “a high-reliability OS to perform critical functions and a processor with a heavyweight OS to support less-critical tasks.” This suggestion implies either that no RTOS can deliver the functionality required for the medical device, or that such an RTOS would be more costly to buy, maintain and certify that two, different OSs running on distinct processors. There are, in fact, reasonably-priced RTOSs available that provide both dependability and a full feature set.
The question then becomes not “GPOS or RTOS,” but “which RTOS?” To help answer this question, we should first consider the RTOS’s architecture— no, not all RTOSs are created equal—and second, key characteristics that support the RTOS’s claims to dependability.
Even a device as simple as an in-home medication dispenser can’t afford a failure. If a malfunction causes data loss or corruption, the dispenser might skip or double a medication, with dire consequences. Because an OS’s architecture has a profound effect on a system’s reliability and ability to recover from faults, it should, therefore, be the first item under scrutiny. The three most common RTOS architectures are real-time executive, monolithic, and microkernel.
The real-time executive model is 50 years old, yet still forms the basis of many RTOSs, particularly roll-your-own RTOSs. With this model, all software components—kernel, networking stacks, file systems, drivers, and applications—run together in one memory address space. Though it’s efficient, this architecture has two immediate drawbacks. First, a single pointer error in any module can corrupt memory used by the kernel or any other module, leading to unpredictable behavior or system-wide failure. Second, the system can crash without leaving diagnostic information to help pinpoint the bug.
Some RTOSs attempt to address the problem of a memory error provoking system-wide corruption by using a monolithic architecture in which user applications run as memory-protected processes. This architecture does protect the kernel from errant user code, but kernel components still share the same address space as file systems, protocol stacks, drivers, and other system services. Hence, a single programming error in any service can cause the entire system to fail.
In a microkernel RTOS, device drivers, file systems, networking stacks, and applications all reside outside the kernel in separate address spaces. They are thus isolated from both the kernel and each other. A fault in one component won’t bring down the entire system. Memory faults in a component can’t corrupt other processes or the kernel, and the OS can automatically restart any failed component without a system reboot.
3. A microkernel OS in a patient-monitoring system.
Key RTOS characteristics
Architecture is only one of many OS characteristics that must be evaluated. Other important characteristics include the OS’s ability to:
A preemptible kernel is essential to any system that must meet real-time commitments. For instance, an alarm triggered when a patient falls should be able to preempt processes drawing a display, as should processes required to send out the alarm. It doesn’t really matter how long it takes the system to display a meal reminder if the person being reminded is lying on the floor with a broken hip. The alarm and communications stack must get CPU cycles to summon help.
To ensure that high-priority processes always get the CPU cycles they need, RTOS kernel operations are preemptible. As in a GPOS, there are time windows during which preemption may not occur; though in a well-designed RTOS these windows are extremely brief, often in the order of hundreds of nanoseconds. Moreover, the RTOS imposes an upper bound on how long preemption is held off and interrupts disabled.
To realize this goal of consistent timely completion of critical activities, the RTOS kernel must be as simple as possible, so that there is an upper bound on the longest non-preemptible code path through the kernel. This simplicity is achievable in an OS with a kernel that assigns work-intensive operations (such as process loading) to external processes or threads, and thus includes only services with a short execution path.
Protect against priority inversions
One of the more common (and notorious) errors in an OS is priority inversion. This problem, which infamously plagued the Mars Pathfinder project in July 1997, is a condition where a low-priority task prevents a higher-priority task from completing its work. For example, in a patient-monitoring system where the alarm control, data logger, and data aggregator share a resource, the higher-priority task (alarm control) must wait for the lower-priority task (data logger) to complete before it can continue. A third task (data aggregator) has a lower priority than the alarm control, but a higher priority that the data logger. The data aggregator preempts the data logger, effectively preempting the alarm control, which can no longer meet its real-time commitments.
Priority inheritance is a technique for preventing priority inversions by assigning the priority of a blocked higher-priority task to the lower-priority thread doing the blocking until the blocking task completes. For example, the data logger inherits the alarm control’s priority, and hence can’t be preempted by the data aggregator. When it completes, it reverts to its original priority, and the alarm control unblocks and continues, unaffected by the data aggregator.
4. Priority inheritance prevents priority inversion
For many systems, guaranteeing resource availability is critical. If, for instance, a key subsystem is starved of CPU cycles, the services it provides become unavailable to other subsystems, with possible dire consequences. For example, a heart monitor that loses connectivity may cause the central monitoring system to incorrectly assume an alarm condition and dispatch help, or—far worse—the patient may be in distress with no one alerted and no help forthcoming.
Process starvation can have a variety of causes, from denial-of-service attacks (DoS), to the addition of new software functionality. Historically, the solution to this problem was either to retrofit hardware or to redesign software, both undesirable alternatives. While it would be possible to push redesigned software out to connected medical devices, not only would the software redesign be costly, but it would likely invalidate the device’s certification. Hardware retrofits would amount to a product recall, with all the attendant damages to the manufacturer’s revenue and reputation.
Partitioning addresses resource starvation by enforcing CPU budgets and preventing processes or threads from monopolizing CPU cycles. Two types of partitioning are possible: fixed and adaptive. With fixed partitioning, the system designer divides tasks into partitions, allocating a portion of CPU time to each. No task in any partition may consume more than that partition's percentage of CPU time. If a partition is allocated 30% of the CPU, that’s all the CPU time the processes in that partition may consume. This limit allows processes in other partitions to maintain their availability, and ensures that all key processes are always available.
Unfortunately, with fixed partitioning a process can never use more CPU cycles than the allocated limit of its partition, even if cycles allocated to other partitions are unused. Fixed partitioning protects against resource starvation, but it squanders CPU cycles and reduces the system’s ability to handle peak demands.
Like fixed partitioning, adaptive partitioning protects against resource starvation. Unlike static partitioning, though, adaptive partitioning uses a dynamic scheduling algorithm, reassigning CPU cycles from partitions that aren’t using them to partitions that can benefit from extra processing time. When processes in more than one partition compete for cycles, the partitioning enforces resource budgets. Designers can thus count on resource guarantees, while not having to work around what is, in effect, the reduced CPU capacity imposed by fixed partitioning.
Monitor and stop or restart processes
Safeguards against process failures cascading through the system, and self-healing capabilities are crucial to a highly-dependable OS. Devices requiring availability or safety guarantees may implement hardware-oriented high-availability solutions, as well as a software watchdog.
A watchdog is a process that monitors the system and performs multi-stage recoveries or clean shutdowns as required. Depending on the implementation, in the event of a failure it should do one of three things: abort then restart the process that failed without a system reboot; terminate the failed process and any related processes, initialize the hardware to a “safe” state, then restart the terminated processes in a coordinated manner; or if the failure is critical (and especially if the failure might compromise safety), perform a controlled shutdown or reset of the entire system, and sound an alarm to system operators
In all cases, the watchdog must be self-monitoring and resilient to internal failures. If, for whatever reason, it’s stopped abnormally, it must immediately and completely reconstruct its own state by handing over to a mirror process.
Finally, a software watchdog can monitor for system events that are invisible to a conventional hardware watchdog. For example, a hardware watchdog can ensure that a driver is servicing the hardware, but may have difficulty detecting whether other programs are talking to that driver correctly. A software watchdog can bridge this gap and take action before the driver itself shows any problems.
Justin Moon is currently QNX Software Systems’ product manager for the medical market. Since joining the company ten years ago, he has worked on the Custom Engineering Team, specializing in BSP and driver development, and on the Automotive Team. Moon studied computer engineering at St. Lawrence College.