nf logo
nf Home
National Computational Infrastructure
NCI National Facility

Contents

+Introduction to the Sun Constellation VAYU Cluster and SGI XE Cluster

-Using vayu.nci.org.au (Sun Constellation VAYU Cluster) and xe.nci.org.au (SGI XE Cluster)

Using vayu.nci.org.au (Sun Constellation VAYU Cluster) and xe.nci.org.au (SGI XE Cluster)

Compiling

Compilers and Options

  1. Several versions of the Intel compilers are installed. These are icc for the C compiler, ifort for the Fortran compiler and icpc for the C++ compiler. To access these compilers you will need to load the relevant module. Type module avail to see what versions of the Intel compiler are available and which is the default. Note that the C and Fortran compilers are loaded as separate modules, for example, module load intel-cc/10.1.018 and module load intel-fc/10.1.018 to load a version of the Intel 10.1 compiler.

  2. User Guides and other documentation are available online for the Intel Fortran and C/C++ compilers. Local versions of the compiler documentation can be accessed via the software web pages for Fortran and C/C++ and choosing a particular version.

  3. If required, the GNU compilers gfortran, gcc and g++ are available. We recommend the Intel compiler for best performance for Fortran code.

  4. Mixed language programming hints.
    • If your application contains both C and Fortran codes you should link as follows:
      icc -c cfunc.c
      ifort -o myprog myprog.for cfunc.o
    • Use the -cxxlib compiler option to tell the compiler to link using the C++ run-time libraries provided by gcc. By default, C++ libraries are not linked with Fortran applications.
    • Use the -fexceptions compiler option to enable C++ exception handling table generation so C++ programs can handle C++ exceptions when there are calls to Fortran routines on the call stack. This option causes additional information to be added to the object file that is required during C++ exception handling. By default, mixed Fortran/C++ applications abort in the Fortran code if a C++ exception is thrown.
    • Use the -nofor_main compiler option if your C/C++ program calls an Intel Fortran subprogram, as shown:
      icc -c cmain.c
      ifort -nofor_main cmain.o fsub.f90
  5. The handling of Fortran90 modules by ifort and gfortran is incompatible - the resultant .mod and object files are not interoperable. Otherwise gfortran and ifort are generally compatible.
    The Intel C and C++ compilers are highly compatible and interoperable with GCC.

  6. A full list of compiler options can be obtained from the man page for each command. Some pointers to useful options for the Intel compiler are as follows:
    • We recommend that Intel Fortran users start with the options -O2 -ip -fpe0.
    • The default -fpe setting for ifort is -fpe3 which means that all floating point exceptions produce exceptional values and execution continues. To be sure that you are not getting floating point exceptions use -fpe0. This means that floating underflows are set to zero and all other exceptions cause the code to abort.If you are certain that these errors can be ignored then you can recompile with the -fpe3 option.
    • -fast sets the options -O3 -ipo -static -no-prec-div -static and maximises speed across the entire program.
    • However -fast cannot be used for MPI programs as the MPI libraries are shared, not static. To use with MPI programs use -O3 -ipo.
    • The -ipo option provides interprocedural optimization but should be used with care as it does not produce the standard .o files. Do not use this if you are linking to libraries.
    • -O0, -O1, -O2, -O3 give increasing levels of optimisation from no optimization to agressive optimization. The option -O is equivalent to -O2. Note that if -g is specified then the default optimization level is -O0.
    • -parallel tells the auto-parallelizer to generate multithreaded code for loops that can safely be executed in parallel. This option requires that you also specify -O2 or -O3. Before using this option for production work make sure that it is resulting in a worthwhile increase in speed by timing the code on a single processor then multiple processors. This option rarely gives appreciable parallel speedup.

  7. Environment variables can be used to affect the behaviour of various programs (particularly compilers and build systems), for example $PATH, $FC, $CFLAGS, $LD_LIBRARY_PATH and many others. Our Canonical User Environment Variables webpage has a detailed list of these variables, including information on which programs use what variables, how they are used, common misconceptions/gotchas, and so on.

  8. Handling floating exceptions.
    • The standard Intel ifort option is -fpe3 by default.All floating-point exceptions are thus disabled. Floating-point underflow is gradual, unless you explicitly specify a compiler option that enables flush-to-zero. This is the default; it provides full IEEE support. (Also see -ftz.) The option -fpe0 will lead to the code aborting at errors such as divide by zeros.
    • For C++ code using icpc the default behaviour is to replace arithmetic exceptions with NaNs and continue the program. If you rely on seeing the arithmetic exceptions and the code aborting you will need to include the fenv.h header and raise signals by using feenableexcept. See man fenv for further details.

  9. The Intel compiler provides an optimised math library, libimf, which is linked before the standard libm by default. If the -lm link option is used then this behaviour changes and libm is linked before libimf.


Back to top

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes - you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.
 

Compiling and linking

The preferred MPI library is OpenMPI. To see what versions are available type
module avail openmpi

Loading the openmpi module sets a variety of environment variables which you can see from
module show openmpi

For Fortran, compile with one of the following commands:
    % ifort myprog.f -o myprog.exe $OMPI_FLIBS
    % mpif77 myprog.f -o myprog.exe
    % mpif90 myprog.f90 -o myprog.exe

The environment variable $OMPI_FLIBS has been set up to insert the correct libraries for linking. These are the same as is used by the wrapper functions mpif77 and mpif90

For C and C++, compile with one of:

    % icc myprog.c -o myprog.exe $OMPI_CLIBS
    % mpicc myprog.c -o myprog.exe
    % icpc myprog.C -o myprog.exe $OMPI_CXXLIBS
    % mpiCC myprog.C -o myprog.exe
As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries. Alternatively, use -03 -ipo (which is equivalent to -fast without -static).

If you do not have an Intel compiler module loaded, the MPI compiler wrappers will use the GNU compilers by default. In that case, the following pairs of commands are equivalent:

    % mpif90 myprog.F
    % gfortran myprog.F $OMPI_FLIBS
    % mpicc myprog.c
    % gcc myprog.c $OMPI_CLIBS
    % mpiCC myprog.C
    % g++ myprog.C $OMPI_CXXLIBS
Note that that the appropriate include paths are placed in the CPATH and FPATH environment variables when you load the openmpi module.

Running MPI jobs

To run an MPI application, you need to have an MPI module loaded in your environment. The modules of software packages requiring MPI will generally load the appropriate MPI module for you.

MPI programs are executed using the mpirun command. To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter any of the following equivalent commands:

    % mpirun -n 4 ./a.out
    % mpirun -np 4 ./a.out
The argument to -n or -np is the number of a.out processes that will be run.

For larger jobs and production use, submit a job to the PBS batch system with a command like

    % qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
    module load openmpi
    mpirun ./a.out
    ^D
    %
By not specifying the -n option with the batch job mpirun, mpiprun will start as many MPI processes as there have been cpus requested with qsub. It is possible to specify the number of processes on the batch job mpirun command, as mpirun -n 4 a.out, or more generally mpirun -n $PBS_NCPUS a.out


Back to top

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables. (Note that OpenMP and OpenMPI are unrelated.)
 

Compiling and linking

Fortran with OpenMP directives is compiled as:
    % ifort -openmp myprog.f -o myprog.exe
    % gfortran -fopenmp myprog.f -o myprog.exe

C code with OpenMP directives is compiled as:

    % icc -openmp myprog.c -o myprog.exe
    % gcc -fopenmp myprog.c -o myprog.exe

Running OpenMP jobs

To run the OpenMP job interactively, first set the OMP_NUM_THREADS environment variable then run the executable:
    % setenv OMP_NUM_THREADS 4
    % ./a.out
For larger jobs and production use, submit a job to the PBS batch system with something like
    % qsub -q express -lnodes=1:ppn=4,walltime=30:00,vmem=400mb -wd
    #!/bin/csh
    setenv OMP_NUM_THREADS $PBS_NCPUS
    ./a.out
    ^D
    %
OpenMP is a shared memory parallelism model - only one host (node) can be used to execute an OpenMP application. The clusters have nodes with 8 cpu cores. It makes no sense to try to run an OpenMP application on more than 8 processes. Note that in the above qsub example, the request specifies 1 node and the number of "processors per node" (ppn) required.

You should time your OpenMP code on a single processor then on increasing numbers of CPUs to find the optimal number of processors for running it. Keep in mind that your job is charged ncpus*walltime.

OpenMP Performance

Parallel loop overheads

There is an overhead in starting and ending any parallel work distribution construct - an empty parallel loop takes a lot longer than an empty serial loop. And that overhead in wasted time grows with the number of threads used. Meanwhile the time to do the real work has (hopefully) decreased by using more threads. So you can end up with timelines like the following for a parallel work distribuition region:
              4 cpus             8 cpus

       time    ----               ----
        |     startup            startup
        |      ----
        V                         ----
               work               work

                                  ____
               ____              cleanup
              cleanup             ----
               ----
Bottom line: the amount of work in a parallel loop (or section) has to be large compared with the startup time. You're looking at 10's of microseconds startup cost or the equivalent time for doing 1000's of floating point ops. Given another order-of-magnitude because you're splitting work over O(10) threads and at least another order-of-magnitude because you want the work to dominate over startup cost and very quickly you need O(million) ops in a parallelised loop to make it scale OK.

Common problems

  • One of the most common problems encountered after parallelizing a code is the generation of floating point exceptions or segmentation violations that were not occurring before. This is usually due to uninitialized variables - check your code very carefully.

  • Segmentation violations can also be caused by the thread stack size being too small. Change this by setting the environment variable OMP_STACKSIZE, for example,
    setenv OMP_STACKSIZE 5m

Back to top

Code Development

Debugging

The Intel idb debugger for C, C++ and Fortran as well as GNU C/C++ can be used in either DBX or GDB mode. It supports the debugging of simple programs, core files and code with multiple threads. The GNU debugger gdb is also available. Read man idb for further information.
  1. To use first compile and link your program using the -g switch e.g.
    	% cc -g prog.c
    	
  2. Start the debugger
    	% idb ./a.out
    	
  3. Enter commands such as
            (idb) list
    	(idb) stop at 10
    	(idb) run
    	(idb) print var
    	(idb) quit
    	
  4. By starting idb up with the option -gui you get a useful graphical user interface.

Debugging Parallel programs

Totalview can be used to debug parallel MPI or OpenMP programs. Introductory information and userguides on using Totalview are available from this site.
  1. First add the following line to your .cshrc file.
    	module load totalview
    	
  2. Compile code with the -g option. For example, for an MPI program,
    	% ifort -g -Oo prog.f -lmpi
    	
  3. Start Totalview. For example, to debug an MPI program using 4 processors,
    	% totalview mpirun -a -np 4 ./a.out
    	

Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.

Totalview shows source code for mpirun when it first starts an MPI job. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.

If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will "dive" into the source code for that routine and break points can be set there.

The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose "Add to expression list" and the variable will appear listed in a new window. If it is marked "Invalid compilation scope" in this new window right click again on the variable name in this window and chose "Compilation scope". Change this to "Floating" and the value of the variable or array should appear. Right clicking on it again and chosing "Dive" will give you values for arrays. In this window you can chose "Laminate" then "Process" under the View menu to see the values on different processors.

Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.

It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.

For more information on memory debugging see here.

Profiling

The gprof profiling tool is available for sequential codes.
  1. The gprof profiler provides information on the most time-consuming subprograms in your code. Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:
    	% ifort -p -o prog.exe prog.f
    	% ./prog.exe
    	% gprof ./prog.exe gmon.out 
    	
    For the GNU compilers do
            % gfortran -pg -o prog.exe prog.f
            % gprof ./prog.exe gmon.out
             
    gprof is not useful for parallel code. More information on using gprof is available here.

Graphical profiling of MPI Code

  1. COMING!!

Back to top
<< Previous
Email problems, suggestions, questions to