icc -c cfunc.c ifort -o myprog myprog.for cfunc.o icc -c cmain.cifort -nofor_main cmain.o fsub.f90
module avail openmpi module show openmpi
% ifort myprog.f -o myprog.exe $OMPI_FLIBS
% mpif77 myprog.f -o myprog.exe
% mpif90 myprog.f90 -o myprog.exe
The environment variable $OMPI_FLIBS has been set up to insert the correct libraries for linking.
These are the same as is used by the wrapper functions mpif77 and mpif90.
If you are using the Fortran90 bindings for MPI (unlikely), then you need $OPENMPI_F90LIBS.
For C and C++, compile with one of:
% icc -pthread myprog.c -o myprog.exe $OMPI_CLIBS
% mpicc myprog.c -o myprog.exe
% icpc -pthread myprog.C -o myprog.exe $OMPI_CXXLIBS
% mpiCC myprog.C -o myprog.exe
Note that $OMPI_CXXLIBS is only relevant if you are actually using the C++ bindings for MPI.
Most C++ MPI applications use the C bindings so linking with $OMPI_CLIBS is sufficient.
As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries. Alternatively, use -03 -ipo (which is equivalent to -fast without -static).
If you do not have an Intel compiler module loaded, the MPI compiler wrappers will use the GNU compilers by default. In that case, the following pairs of commands are equivalent:
% mpif90 myprog.F
% gfortran myprog.F $OMPI_FLIBS
% mpicc myprog.c
% gcc -pthread myprog.c $OMPI_CLIBS
% mpiCC myprog.C
% g++ -pthread myprog.C $OMPI_CXXLIBS
Note that that the appropriate include paths are placed in the CPATH and FPATH environment
variables when you load the openmpi module.
MPI programs are executed using the mpirun command. To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter any of the following equivalent commands:
% mpirun -n 4 ./a.out
% mpirun -np 4 ./a.out
The argument to -n or -np is the number of a.out
processes that will be run.
For larger jobs and production use, submit a job to the PBS batch system with a command like
% qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
module load openmpi
mpirun ./a.out
^D
%
By not specifying the -n option with the batch job mpirun,
mpiprun will start as many MPI processes as there have been cpus
requested with qsub. It is possible to specify the number of
processes on the batch job mpirun command, as
mpirun -n 4 a.out, or more generally mpirun -n $PBS_NCPUS a.out
On the VU you may find that you get better performance by using
% mpirun -mca mpi_affinity_alone 1 -n 4 ./a.out
% ifort -openmp myprog.f -o myprog.exe
% gfortran -fopenmp myprog.f -o myprog.exe
C code with OpenMP directives is compiled as:
% icc -openmp myprog.c -o myprog.exe
% gcc -fopenmp myprog.c -o myprog.exe
% setenv OMP_NUM_THREADS 4
% ./a.out
For larger jobs and production use, submit a job to the PBS batch system
with something like
% qsub -q express -lnodes=1:ppn=4,walltime=30:00,vmem=400mb -wd
#!/bin/csh
setenv OMP_NUM_THREADS $PBS_NCPUS
./a.out
^D
%
OpenMP is a shared memory parallelism model - only one host (node) can be used
to execute an OpenMP application. The clusters have nodes with 8 cpu cores. It makes
no sense to try to run an OpenMP application on more than 8 processes. Note that in
the above qsub example, the request specifies 1 node and the number of "processors
per node" (ppn) required.
You should time your OpenMP code on a single processor then on increasing numbers of CPUs to find the optimal number of processors for running it. Keep in mind that your job is charged ncpus*walltime.
4 cpus 8 cpus
time ---- ----
| startup startup
| ----
V ----
work work
____
____ cleanup
cleanup ----
----
Bottom line: the amount of work in a parallel loop (or section) has to be
large compared with the startup
time. You're looking at 10's of microseconds startup cost or the
equivalent time for doing 1000's of floating point ops. Given another
order-of-magnitude because you're splitting work over O(10) threads and
at least another order-of-magnitude because you want the work to dominate over
startup cost and very quickly you need O(million) ops in a parallelised
loop to make it scale OK.
% cc -g prog.c
% idb ./a.out
(idb) list
(idb) stop at 10
(idb) run
(idb) print var
(idb) quit
module load totalview
% ifort -g -Oo prog.f -lmpi
% totalview mpirun -a -np 4 ./a.out
Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.
Totalview shows source code for mpirun when it first starts an MPI job. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.
If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will "dive" into the source code for that routine and break points can be set there.
The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose "Add to expression list" and the variable will appear listed in a new window. If it is marked "Invalid compilation scope" in this new window right click again on the variable name in this window and chose "Compilation scope". Change this to "Floating" and the value of the variable or array should appear. Right clicking on it again and chosing "Dive" will give you values for arrays. In this window you can chose "Laminate" then "Process" under the View menu to see the values on different processors.
Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.
It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.
For more information on memory debugging see here.
% ifort -p -o prog.exe prog.f % ./prog.exe % gprof ./prog.exe gmon.outFor the GNU compilers do
% gfortran -pg -o prog.exe prog.f
% gprof ./prog.exe gmon.out
gprof is not useful for parallel code. More information on using
gprof is available
here.