LA-MPI is no longer in active development, but is being maintained for use on production systems at LANL, and we welcome other users.
Our future development is focused on the Open MPI project, a new component-based, extensible implementation of MPI-2.
LA-MPI is an implementation of the Message Passing Interface (MPI) motivated by a growing need for fault tolerance at the software level in large high-performance computing (HPC) systems.
This need is caused by the vast number of components present in modern HPC systems, particularly clusters. The individual components -- processors, memory modules, network interface cards (NICs), etc. -- are typically manufactured to tolerances adequate for small or desktop systems. When aggregated into a large HPC system, however, system-wide error rates may be too great to successfully complete a long application run. For example, a network device may have an error rate which is perfectly acceptable for a desktop system, but not in a cluster of thousands of nodes, which must run error free for many hours or even days to complete a scientific calculation.
LA-MPI has two primary goals: network fault tolerance and high performance.
Network fault tolerance is acheived by implementing a highly efficient checksum/retransmission protocol. The integrity of delivered data is (optionally) verified at the user-level using a checksum or CRC. Data that is corrupt (or never delivered) is retransmitted.
As for high performance, LA-MPI's lightweight checksum/retransmission protocol allows us to achieve low latency messaging. Furthermore, the flexible approach taken to the use of redundant data paths in a network-device-rich system leads to high network bandwidth since different messages and/or message-fragments can be sent in parallel along different paths. Also, since LA-MPI is developed for use on the the large systems at Los Alamos National Laboratory we have verified that LA-MPI is scalable to over 3,500 processes.
An alternative solution to the network fault tolerance problem is to use the TCP/IP protocol. We believe, however, that this protocol -- developed to handle unreliable, inhomogeneous and oversubscribed networks -- performs poorly and is overly complex for HPC system messaging, and that LA-MPI's lightweight checksum/retransmission protocol is a more appropriate choice.
The current release of LA-MPI is
Some earlier releases are also available:
LA-MPI is installed in the usual way
configure [OPTIONS] make make install
where configure options include
--enable-debug enable debugging --enable-lsf use LSF --enable-rms use RMS --enable-bproc use BPROC --enable-udp enable UDP path --enable-tcp enable TCP path --enable-qsnet enable QSNET path --enable-gm enable Myrinet GM path --enable-ib enable InfiniBand path --with-romio include MPI-IO support
LA-MPI is developed by the Application Communications and Performance Research Team of the Advanced Computing Laboratory at LANL. We are actively investigating other aspects of fault tolerance and performance optimization. Topics of current interest include
Many of these ideas will be explored as part of the Open MPI project.
Also see our Open MPI papers.
LA-MPI is developed at the Advanced Computing Laboratory of Los Alamos National Laboratory. For more information contact email@example.com
|© 2002-2004 University of California | Disclaimer|