This device is experimental, and its version of mpirun is a little different from that for the other devices. In this section we describe how the mpd system of daemons works and how to run MPI programs using it. To use this system, mpich must have been configured with the ch_p4mpd device, and the daemons must have been started on the machines where you will be running. This section describes how to do these things.
The goal of the multipurpose daemon ( mpd and the associated ch_p4mpd device) is to make mpirun behave like a single program even as it starts multiple processes to execute an MPI job. We will refer to the mpirun process and the MPI processes. Such behavior includes
fast, scalable startup of MPI (and even non-MPI) processes. For those accustomed to using the ch_p4 device on TCP networks, this will be the most immediately noticeable change. Job startup is now much faster.
collection of stdout and stderr from the MPI processes to the stdout and stderr of the mpirun process.
delivery of mpirun's stdin to the stdin of MPI process 0.
delivery of signals from the mpirun process to the MPI processes. This means that it is easy to kill, suspend, and resume your parallel job just as if it were a single process, with cntl-C, cntl-Z, and bg and fg commands.
delivery of command-line arguments to all MPI processes.
copying of the PATH environment from the environment in which mpirun is executed to the environments in which the MPI processes are executed.
use of an optional argument to provide other environment variables.
use of a further optional argument to specify where the MPI processes will run (see below).
The ch_p4 device relies by default on rsh for process startup on remote machines. The need for authentication at job startup time, combined with the sequential process by which contact information is collected from each remote machine and broadcast back to all machines, makes job startup unscalably slow, especially for large numbers of processes.
With version 1.2.0 of mpich, we introduced a new method of process startup based on daemons. This mechanism, which requires configuration with a new device, has not yet been widely enough tested to become the default for clusters, but we anticipate that it eventually will become so. In the current version of mpich, it has been significantly enhanced, and will now be installed when mpich is installed with make install. On systems with gdb, it supports a simple parallel debugger we call mpigdb.
The basic idea is to establish, ahead of job-startup time, a network of daemons on the machines where MPI processes will run, and also on the machine on which mpirun will be executed. Then job startup commands (and other commands) will contact the local daemon and use the pre-existing daemons to start processes. Much of the initial synchronization done by the ch_p4 device is eliminated, since the daemons can be used at run time to aid in establishing communication between processes.
To use the new startup mechanism, you must
configure with the new device:
configure --with-device=ch_p4mpdmake as usual:
makego to the mpich/mpid/mpd directory, where the daemons code is located and the daemons are built, or else put this directory in your PATH.
start the daemons:
The daemons can be started by hand on the remote machines using the port numbers advertised by the daemons as they come up:
On fire:
fire% mpd & [2] 23792 [fire_55681]: MPD starting fire%On soot:
soot% mpd -h fire -p 55681 & [1] 6629 [soot_35836]: MPD starting soot%
The mpd's are identified by a host and port number.If the daemons do not advertise themselves, one can find the host and port by using the mpdtrace command:
On fire:
fire% mpd & fire% mpdtrace mpdtrace: fire_55681: lhs=fire_55681 rhs=fire_55681 rhs2=fire_55681 fire%On soot:
soot% mpd -h fire -p 55681 & soot% mpdtrace mpdtrace: fire_55681: lhs=soot_33239 rhs=soot_33239 rhs2=fire_55681 mpdtrace: soot_33239: lhs=fire_55681 rhs=fire_55681 rhs2=soot_33239 soot%What mpidtrace is showing is the ring of mpd's, by hostname and port that can be used to introduce another mpd into the ring. The left and right neighbor of each mpd in the ring is shown as lhs and rhs respectively. rhs2 shows the daemon two steps away to the right (which in this case is the daemon itself).You can also use mpd -b to start the daemons as real daemons, disconnected from any terminal. This has advantages and disadvantages.
There is also a pair of scripts in the mpich/mpid/mpd directory that can help:
localmpds <number>will start <number> mpds on the local machine. This is only really useful for testing. Usually you would do
mpd &to start one mpd on the local machine. Then other mpd's can be started on remote machines via rsh, if that is available:
remotempds <hostfile>where <hostfile> contains the names of the other machines to start the mpd's on. It is a simple list of hostnames only, unlike the format of the MACHINES files used by the ch_p4 device, which can contain comments and other symbols.See also the startdaemons script, which will be installed when mpich is installed.
Finally, start jobs with the mpirun command as usual:
mpirun -np 4 a.out
Here are a few examples of usage of the mpirun that is built when the mpich is configured and built with the ch_p4mpd device.
Run the cpi example
mpirun -np 16 cpiYou can get line labels on stdout and stderr from your program by including the -l option. Output lines will be labeled by process rank.
Run the fpi program, which prompts for a number of intervals to use.
mpirun -np 32 fpiThe streams stdin, stdout, and stderr will be mapped back to your mpirun process, even if the MPI process with rank 0 is executed on a remote machine.
Use arguments and environment variables.
mpirun -np 32 myprog arg1 arg2 -MPDENV- MPE_LOG_FORMAT=SLOG \ GLOBMEMSIZE=16000000The argument -MPDENV- is a fence. All arguments after it are handled by mpirun rather than the application program.
Specify where the first process is to run. By default, MPI processes are spawned by by consecutive mpd's in the rung, starting with the one after the local one (the one running on the same machine as the mpirun process. Thus if you are logged into dion and there are mpd's running dion and on belmont1, belmont2, ..., belmont64, and you type
mpirun -np 32 cpiyour processes will run on belmont1, belmont2, ..., belmont32. You can force your MPI processes to start elsewhere by giving mpirun optional location arguments. If you type
mpirun -np 32 cpi -MPDLOC- belmont33 belmont34 ... belmont64then your job will run on belmont33, belmont34, ..., belmont64. In general, processes will only be run on machines in the list of machines after -MPDLOC-.This provides an extremely preliminary and crude way for mpirun to choose locations for MPI processes. In the long run we intend to use the mpd project as an environment for exploring the interfaces among job schedules, process managers, parallel application programs (particularly in the dynamic environment of MPI-2), and user commands.
Find out what hosts your mpd's are running on:
mpirun -np 32 hostname | sort | uniqThis will run 32 instances of hostname assuming /bin is in your path, regardless of how many mpd's there are. The other processes will be wrapped around the ring of mpd's.
Once the daemons are started they are connected in a ring:
The mpd's fork that number of manager processes (the executable is called mpdman and is located in the mpich/mpid/mpd directory). The managers are forked consecutively by the mpd's around the ring, wrapping around if necessary.
The managers form themselves into a ring, and fork the application processes, called clients.
The console disconnects from the mpd and reconnects to the first manager. stdin from mpirun is delivered to the client of manager 0.
The managers intercept standard I/O fro the clients, and deliver command-line arguments and the environment variables that were specified on the mpirun command. The sockets carrying stdout and sdterr form a tree with manager 0 at the root.
When the clients need to contact each other, they use the managers to find the appropriate process on the destination host. The mpirun process can be suspended, in which case it and the clients are suspended, but the mpd's and managers remain executing, so that they can unsuspend the clients when mpirun is unsuspended. Killing the mpirun process kills the clients and managers.
The same ring of mpd's can be used to run multiple jobs from multiple consoles at the same time. Under ordinary circumstances, there still needs to be a separate ring of mpd's for each user. For security purposes, each user needs to have a .mpdpasswd file in the user's home directory, readable only by the user, containing a password. This file is read when the mpd is started. Only mpd's that know this password can enter a ring of existing mpd's.
A new feature is the ability to configure the mpd system so that the daemons can be run as root. To do this, after configuring mpich you need to reconfigure in the mpid/mpd directory with --enable-root and remake. Then mpirun should be installed as a setuid program. Multiple users can use the same set of mpd's, which are run as root, although their mpirun, managers, and clients will be run as the user who invoked mpirun.
Because the MPD daemons are already in communication with one another
before the job starts, job startup is much faster than with the
ch_p4 device. The mpirun command for the ch_p4mpd device
has a number of special command-line arguments. If you type mpirun with
no arguments, they are displayed:
% mpirun Usage: mpirun <args> executable <args_to_executable> Arguments are: -np num_processes_to_run (required as first two args) [-s] (close stdin; can run in bkgd w/o tty input problems) [-g group_size] (start group_size processes per mpd) [-m machine_file] (filename for allowed machines) [-l] (line labels; unique id for each process' output [-1] (do NOT start first process locally) [-y] (run as Myrinet job)The -1 option allows you, for example, to run mpirun on a ``login'' or ``development'' node on your cluster but to start all the application processes on ``computation'' nodes.
The program mpirun runs in a separate (non-MPI) process that starts the MPI processes running the specified executable. It serves as a single-process representative of the parallel MPI processes in that signals sent to it, such as ^Z and ^C are conveyed by the MPD system to all the processes. The output streams stdout and stderr from the MPI processes are routed back to the stdout and stderr of mpirun. As in most MPI implementations, mpirun's stdin is routed to the stdin of the MPI process with rank 0.