The majority of content on this page has been migrated to www.nci.org.au.
When migration is complete you will automatically be redirected to the new site.
ifort prog.f90 prog_cpp.o -cxxlib-gcc -lstdc++
INCLUDE 'mpif.h'and compile with a command like:
% ifort myprog.f -o myprog.exe -lmpiFor C, use this include directive:
#include <mpi.h>and compile with a command similar to:
% icc myprog.c -o myprog.exe -lmpiAs mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries. Alternatively, use -03 -ipo (which is equivalent to -fast without -static).
For the GNU compilers on C or C++ code use one of these:
%gcc myprog.c -lmpi %g++ myprog.C -lmpi++ -lmpi
% mpirun -n 4 ./a.out % mpirun -np 4 ./a.out % prun -n 4 ./a.outThe argument to -n or -np is the number of a.out processes required to run.
For larger jobs and production use, submit a job to the PBS batch system with a command like
% qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd mpirun ./a.out ^D %By not specifying the -n option with the batch job mpirun, mpiprun will start as many MPI processes as there have been cpus requested with qsub. It is possible to specify the number of processes on the batch job mpirun command, as mpirun -n 4 a.out, or more generally mpirun -n $PBS_NCPUS a.out
The mpirun and prun commands are equivalent and are both just links to the locally written anumpirun command which provides:
% ifort -fast -openmp myprog.f -o myprog.exe
C code with OpenMP directives is compiled as:
% icc -fast -openmp myprog.c -o myprog.exe
% setenv OMP_NUM_THREADS 4 % ./a.outFor larger jobs and production use, submit a job to the PBS batch system with something like
% qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd #!/bin/csh setenv OMP_NUM_THREADS 4 ./a.out ^D %The AC is configured with 32 CPUs per partition made up of 2CPU "NUMA" nodes. Because of the NUMA (non-uniform memory access) architecture of the AC not all OpenMP codes will scale well to 32 processors. You should time your OpenMP code on a single processor then on increasing numbers of CPUs to find the optimal number of processors for running it. Keep in mind that your job is charged ncpus*walltime.
Currently, the setup of the PBS batch system means that if you are using more than 8 CPUs, you should use -lncpus=N:N (for N up to 24) to ensure that the CPUs allocated are on a single partition (host).
To ensure data locality of the OpenMP program you should initialise any arrays in parallel that are going to be used in parallel later in the program. Use the same number of threads and scheduling scheme for both intialisation and later calculation.
To ensure that threads are distributed across the cpus when they do initialize data, you should use the dplace command. The dplace documentation suggests using
dplace -x2 ./a.outhowever this has been found to give poor thread distribution in some cases. In particular you can find two threads locked on to the same cpu while another cpu is idle. We recommend using the "exact" placement options, ie.
dplace -e -c0,x,1-3 ./a.outfor 4 cpu jobs,
dplace -e -c0,x,1-7 ./a.outfor 8 cpu jobs etc.
4 cpus 8 cpus time ---- ---- | startup startup | ---- V ---- work work ____ ____ cleanup cleanup ---- ----Bottom line: the amount of work in a parallel loop (or section) has to be large compared with the startup time. You're looking at 10's of microseconds startup cost or the equivalent time for doing 1000's of floating point ops. Given another order-of-magnitude because you're splitting work over O(10) threads and at least another order-of-magnitude because you want the work to dominate over startup cost and very quickly you need O(million) ops in a parallelised loop to make it scale OK.
Each NUMA node has 2 cpus and 4 or 8 GB of memory. The memory access overhead for referencing memory not connected to the cpu where the thread is running is quite high. Application slowdowns of 20-50% are possible due solely to variations in memory access. Ideally all threads should make the vast majority of memory requests to local memory. So at an algorithmic level, "locality" means being able to conceptually divide the memory of the job and nominate which thread "owns" that memory, i.e. which is the thread most commonly referencing that part of the memory. For a grid-based problem, this is often just a geometric division of the grid. Locality does require that the parallel DO loops consistently access most arrays, i.e. the way loop indices and hence array elements are divided amongst threads is the same for most parallel loops.
If there is locality at the algorithmic level, then that needs to be transferred to locality at the system memory and NUMA node level in terms of where the jobs threads and memory live. Physical memory is allocated to your job (in 16KB "pages") at the time that memory is first referenced by your program. The placement of each page (which NUMA node it is allocated on) is generally on a "first touch" basis. By default it will be allocated in memory local to the cpu where the thread that first referenced (initialized) that memory ran.
One of the major issues is that threads can migrate from cpu to cpu but (currently) memory pages do not migrate from NUMA node to NUMA node. So to achieve good locality we require:
Segmentation violations can also be caused by the thread stack size
being too small. Change this by setting the environment variable
KMP_STACK_SIZE to a larger value than the default of 4Mb, for example,
setenv KMP_STACK_SIZE 5m
% cc -g prog.c
% idb ./a.out
(idb) list (idb) stop at 10 (idb) run (idb) print var (idb) quit
module load totalview
% ifort -g -Oo prog.f -lmpi
% totalview mpirun -a -np 4 ./a.out
Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.
Totalview shows source code for mpirun when it first starts an MPI job. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.
If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will "dive" into the source code for that routine and break points can be set there.
The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose "Add to expression list" and the variable will appear listed in a new window. If it is marked "Invalid compilation scope" in this new window right click again on the variable name in this window and chose "Compilation scope". Change this to "Floating" and the value of the variable or array should appear. Right clicking on it again and chosing "Dive" will give you values for arrays. In this window you can chose "Laminate" then "Process" under the View menu to see the values on different processors.
Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.
It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.
For more information on memory debugging see here.
% ifort -p -o prog.exe prog.f % ./prog.exe % gprof ./prog.exe gmon.outFor the GNU compilers do
% g77 -pg -o prog.exe prog.f % gprof ./prog.exe gmon.outgprof is not useful for parallel code. More information on using gprof is available here.
% module load histx % ifort -g -O3 prog.f % histx -l ./a.out % iprep histx.a.out.***Type histx to see the various options available, for example,
% histx -s callstack10 a.out % histx -e pm:CPU_CYCLES@1000 a.out % histx -e pm:L3_REFERENCES@1000 a.outMore details are on the AC in /opt/histx/histx-1.3b/doc/histx.txt
% module load histx % ifort -O3 mpiprog.f -lmpi % prun -n 4 histx ./a.out % iprep histx.a.out.*For this example there will be five histx.a.out.* files produced, one for each process and one for the MPI shepherd process. Look at the iprep output for the four processes. The percentage of time spent in MPI calls is reported in the counts for different calls from libmpi.so. Well balanced efficient MPI code will display low percentages for the MPI calls and spend the major part of its time in the computational a.out.
Histx can also be used in the same way as for MPI codes.
% module load histx % ifort -openmp -g -O2 omp_prog.f90 % histx -l a.out % iprep histx.a.out.*****
% module load jumpshot % ifort -O3 prog.f -L/opt/jumpshot/mpe/lib -lmpe_f2cmpi -llmpe -lmpe -lmpi % mpirun -n 4 ./a.outThe trace information is stored in the file Unknown.clog. Then start up jumpshot as
% jumpshot Unknown.clogOnce jumpshot is started the first step is to convert the .clog file to .slog2 format. Details on using Jumpshot are given on the software pages.