ifort prog.f90 prog_cpp.o -cxxlib-gcc -lstdc++
INCLUDE 'mpif.h'
and compile with a command like:
% ifort myprog.f -o myprog.exe -lmpi
For C, use this include directive:
#include <mpi.h>
and compile with a command similar to:
% icc myprog.c -o myprog.exe -lmpi
As mentioned above, do not use the -fast option as this sets the -static
option which
conflicts with using the MPI libraries which are shared libraries. Alternatively,
use -03 -ipo (which is equivalent to -fast without -static).
For the GNU compilers on C or C++ code use one of these:
%gcc myprog.c -lmpi
%g++ myprog.C -lmpi++ -lmpi
% mpirun -n 4 ./a.out
% mpirun -np 4 ./a.out
% prun -n 4 ./a.out
The argument to -n or -np is the number of a.out
processes required to run.
For larger jobs and production use, submit a job to the PBS batch system with a command like
% qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
mpirun ./a.out
^D
%
By not specifying the -n option with the batch job mpirun,
mpiprun will start as many MPI processes as there have been cpus
requested with qsub. It is possible to specify the number of
processes on the batch job mpirun command, as
mpirun -n 4 a.out, or more generally mpirun -n $PBS_NCPUS a.out
The mpirun and prun commands are equivalent and are both just links to the locally written anumpirun command which provides:
% ifort -fast -openmp myprog.f -o myprog.exe
C code with OpenMP directives is compiled as:
% icc -fast -openmp myprog.c -o myprog.exe
% setenv OMP_NUM_THREADS 4
% ./a.out
For larger jobs and production use, submit a job to the PBS batch system
with something like
% qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
#!/bin/csh
setenv OMP_NUM_THREADS 4
./a.out
^D
%
The AC is configured with 32 CPUs per partition made up of 2CPU "NUMA" nodes. Because
of the NUMA (non-uniform memory access) architecture of the AC not all OpenMP
codes will scale well to 32 processors. You should time your OpenMP code on a single
processor then on increasing numbers of CPUs to find the optimal number of
processors for running it. Keep in mind that your job is charged ncpus*walltime.
Currently, the setup of the PBS batch system means that if you are using more than 8 CPUs, you should use -lncpus=N:N (for N up to 24) to ensure that the CPUs allocated are on a single partition (host).
To ensure data locality of the OpenMP program you should initialise any arrays in parallel that are going to be used in parallel later in the program. Use the same number of threads and scheduling scheme for both intialisation and later calculation.
To ensure that threads are distributed across the cpus when they do initialize data, you should use the dplace command. The dplace documentation suggests using
dplace -x2 ./a.out
however this has been found to give poor thread distribution in some cases. In
particular you can find two threads locked on to the same cpu while another
cpu is idle. We recommend using the "exact" placement options, ie.
dplace -e -c0,x,1-3 ./a.out
for 4 cpu jobs,
dplace -e -c0,x,1-7 ./a.out
for 8 cpu jobs etc.
4 cpus 8 cpus
time ---- ----
| startup startup
| ----
V ----
work work
____
____ cleanup
cleanup ----
----
Bottom line: the amount of work in a parallel loop (or section) has to be
large compared with the startup
time. You're looking at 10's of microseconds startup cost or the
equivalent time for doing 1000's of floating point ops. Given another
order-of-magnitude because you're splitting work over O(10) threads and
at least another order-of-magnitude because you want the work to dominate over
startup cost and very quickly you need O(million) ops in a parallelised
loop to make it scale OK.
Each NUMA node has 2 cpus and 4 or 8 GB of memory. The memory access overhead for referencing memory not connected to the cpu where the thread is running is quite high. Application slowdowns of 20-50% are possible due solely to variations in memory access. Ideally all threads should make the vast majority of memory requests to local memory. So at an algorithmic level, "locality" means being able to conceptually divide the memory of the job and nominate which thread "owns" that memory, i.e. which is the thread most commonly referencing that part of the memory. For a grid-based problem, this is often just a geometric division of the grid. Locality does require that the parallel DO loops consistently access most arrays, i.e. the way loop indices and hence array elements are divided amongst threads is the same for most parallel loops.
If there is locality at the algorithmic level, then that needs to be transferred to locality at the system memory and NUMA node level in terms of where the jobs threads and memory live. Physical memory is allocated to your job (in 16KB "pages") at the time that memory is first referenced by your program. The placement of each page (which NUMA node it is allocated on) is generally on a "first touch" basis. By default it will be allocated in memory local to the cpu where the thread that first referenced (initialized) that memory ran.
One of the major issues is that threads can migrate from cpu to cpu but (currently) memory pages do not migrate from NUMA node to NUMA node. So to achieve good locality we require:
Segmentation violations can also be caused by the thread stack size
being too small. Change this by setting the environment variable
KMP_STACK_SIZE to a larger value than the default of 4Mb, for example,
setenv KMP_STACK_SIZE 5m
% cc -g prog.c
% idb ./a.out
(idb) list
(idb) stop at 10
(idb) run
(idb) print var
(idb) quit
module load totalview
% ifort -g -Oo prog.f -lmpi
% totalview mpirun -a -np 4 ./a.out
Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.
Totalview shows source code for mpirun when it first starts an MPI job. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.
If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will "dive" into the source code for that routine and break points can be set there.
The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose "Add to expression list" and the variable will appear listed in a new window. If it is marked "Invalid compilation scope" in this new window right click again on the variable name in this window and chose "Compilation scope". Change this to "Floating" and the value of the variable or array should appear. Right clicking on it again and chosing "Dive" will give you values for arrays. In this window you can chose "Laminate" then "Process" under the View menu to see the values on different processors.
Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.
It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.
For more information on memory debugging see here.
% ifort -p -o prog.exe prog.f % ./prog.exe % gprof ./prog.exe gmon.outFor the GNU compilers do
% g77 -pg -o prog.exe prog.f
% gprof ./prog.exe gmon.out
gprof is not useful for parallel code. More information on using
gprof is available
here.
% module load histx
% ifort -g -O3 prog.f
% histx -l ./a.out
% iprep histx.a.out.***
Type histx to see the various options available, for example,
% histx -s callstack10 a.out
% histx -e pm:CPU_CYCLES@1000 a.out
% histx -e pm:L3_REFERENCES@1000 a.out
More details are on the AC in /opt/histx/histx-1.3b/doc/histx.txt
% module load histx
% ifort -O3 mpiprog.f -lmpi
% prun -n 4 histx ./a.out
% iprep histx.a.out.*
For this example there will be five histx.a.out.* files produced, one for
each process and one for the MPI shepherd process. Look at the iprep output
for the four processes. The percentage of time spent in MPI calls
is reported in
the counts for different calls from libmpi.so. Well balanced efficient MPI
code will display low percentages for the MPI calls and spend the
major part of its time in the computational a.out.
Histx can also be used in the same way as for MPI codes.
% module load histx
% ifort -openmp -g -O2 omp_prog.f90
% histx -l a.out
% iprep histx.a.out.*****
% module load jumpshot
% ifort -O3 prog.f -L/opt/jumpshot/mpe/lib -lmpe_f2cmpi -llmpe -lmpe -lmpi
% mpirun -n 4 ./a.out
The trace information is stored in the file Unknown.clog. Then start up
jumpshot as
% jumpshot Unknown.clog
Once jumpshot is started the first step is to convert the .clog file
to .slog2 format. Details on using Jumpshot are given on the
software pages.