Lab Home Lab Phone Lab Search
Home Research Papers Software People Jobs Los Alamos Photos Site Map


Modern high-end computing design calls for large numbers of processors and/or nodes to be combined into an image of a single machine. Such computing environments are prone to periodic hardware failures and must be able to overcome these failures and still perform useful work. In order to minimize the amount of down time experienced, we are developing new techniques and algorithms to provide fault tolerant and highly efficient task scheduling in multi-processor computing systems.

One often neglected measure of robustness is the timeliness of system operation. In addition to fault-tolerance and scheduling work, we have developed some useful approaches for designing and reasoning about systems with timeliness requirements, i.e., real-time systems.

As an area where robustness is a factor, we are also looking into the building of clusters which minimize the "total cost of ownership / performance" ratio. Where as the traditional "price / performance" ratio addresses the question "What is the highest performance supercomputer I can afford to buy?" it does not address the more important question "What is the highest performance supercomputer that I can afford to own?" In addition to the cost of acquisition, owning a supercomputer requires one to consider the facility and utility costs (including space, power and cooling). It also requires consideration of the maintenance and system administration costs. The latter costs are directly impacted by robustness (or the lack thereof). We call this project Supercomputing in Small Spaces.