nf logo
nf Home
National Computational Infrastructure
NCI National Facility

Contents

The majority of content on this page has been migrated to www.nci.org.au.
When migration is complete you will automatically be redirected to the new site.

+Software Development at NCI

-Using ac.nci.org.au (SGI Altix cluster)

Using ac.nci.org.au (SGI Altix cluster)

AC: Compiling

Compilers and Options

  1. Several versions of the Intel compilers are installed. For Intel 8.1 these are icc for the C compiler, ifort for the Fortran compiler and icpc for the C++ compiler. To access these compilers you will need to load the relevant module. Type module avail to see what versions of the Intel compiler are available and which is the default. Note that the C and Fortran compilers are loaded as separate modules, for example, module load intel-cc/8.1.030 and module load intel-fc/8.1.026 to load a version of the Intel8 compiler.
  2. User Guides and other documentation are available online for the Intel Fortran and C/C++ compilers. Local versions of the compiler documentation can be accessed via the software web pages for Fortran and C/C++ and choosign a particular version.
  3. If required, the GNU compilers gfortran, gcc and g++ are available.
  4. The Intel C and C++ version 8.1 compilers are compatible with GCC 3.2, 3.3 and 3.4. The exact GCC version is the version of the gcc on your path but can be changed by using the -gcc-version option.
  5. The mechanism for linking Fortran and C++ applications has changed between Intel8 and Intel9 compilers.If the Intel Fortran driver ( ifort) is used to link mixed C++ and Fortran applications, the appropriate C++ runtime libraries must be added explicitly to the command line. If the Intel C++ driver (icc) is used to link mixed C++ and Fortran applications, the appropriate Fortran runtime libraries must be added to the command line. For example, the link line could be
    ifort prog.f90 prog_cpp.o -cxxlib-gcc -lstdc++
  6. Mixing object files compiled with different Fortran compilers, g77 and ifort, is not recommended.
  7. A full list of compiler options can be obtained from the man page for each command. Some pointers to useful options for the Intel compiler are as follows:
    • We recommend that Intel Fortran users start with the options -O2 -ip -fpe0.
    • The default -fpe setting for ifort is -fpe3 which means that all floating point exceptions produce exceptional values and execution continues. To be sure that you are not getting floating point exceptions use -fpe0. This means that floating underflows are set to zero and all other exceptions cause the code to abort. Versions of the Intel Fortran compiler on the ac now have -fpe0 overriding the default. This means that code will abort if it encounters a segmentation fault. If you are certain that these errors can be ignored then you can recompile with the -fpe3 option.
    • -fast sets the options -O3 -ipo -static and maximises speed across the entire program.
    • However -fast cannot be used for MPI programs as the MPI libraries are shared, not static. To use with MPI programs use -O3 -ipo.
    • -fast should also not be used when compiling OpenMP code as it causes an older threads library to be statically linked.
    • The -ipo option provides interprocedural optimization but should be used with care as it does not produce the standard .o files. Do not use this if you are linking to libraries.
    • -O0, -O1, -O2, -O3 give increasing levels of optimisation from no optimization to agressive optimization. The option -O is equivalent to -O2.
    • -parallel tells the auto-parallelizer to generate multithreaded code for loops that can safely be executed in parallel. This option requires that you also specify -O2 or -O3. Before using this option for production work make sure that it is resulting in a worthwhile increase in speed by timing the code on a single processor then multiple processors. This option rarely gives appreciable parallel speedup.
  8. Handling floating exceptions.
    • The standard Intel ifort option is -fpe3 by default. However this may lead to large system time overheads as the floating exceptions are handled by the software. The option -fpe0 will lead to the code aborting at errors such as divide by zeros and is now the locally set default for ifort on the ac.
    • For C++ code using icpc the default behaviour is to replace arithmetic exceptions with NaNs and continue the program. If you rely on seeing the arithmetic exceptions and the code aborting you will need to include the fenv.h header and raise signals by using feenableexcept. See man fenv for further details.
  9. The Intel compiler provides an optimised math library, libimf, which is linked before the standard libm by default. If the -lm link option is used then this behaviour changes and libm is linked before libimf.
  10. Be prepared for long compilation times.

Back to top

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes - you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.
 

Compiling and linking

For Fortran, use this include directive in the source (in either fixed- or free-form) of any program unit using MPI:
    INCLUDE 'mpif.h'
and compile with a command like:
    % ifort myprog.f -o myprog.exe -lmpi

For C, use this include directive:
    #include <mpi.h>
and compile with a command similar to:
    % icc myprog.c -o myprog.exe -lmpi 
As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries. Alternatively, use -03 -ipo (which is equivalent to -fast without -static).

For the GNU compilers on C or C++ code use one of these:

    %gcc myprog.c -lmpi	
    %g++ myprog.C -lmpi++ -lmpi

Running MPI jobs

MPI programs are executed using the mpirun or prun commands. To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter any of the following equivalent commands:
    % mpirun -n 4 ./a.out
    % mpirun -np 4 ./a.out
    % prun -n 4 ./a.out
The argument to -n or -np is the number of a.out processes required to run.

For larger jobs and production use, submit a job to the PBS batch system with a command like

    % qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
    mpirun ./a.out
    ^D
    %
By not specifying the -n option with the batch job mpirun, mpiprun will start as many MPI processes as there have been cpus requested with qsub. It is possible to specify the number of processes on the batch job mpirun command, as mpirun -n 4 a.out, or more generally mpirun -n $PBS_NCPUS a.out

The mpirun and prun commands are equivalent and are both just links to the locally written anumpirun command which provides:

  • compatability with both the SGI mpirun command and the AlphaServer SC (Quadrics) prun command including most options to both
  • integration with PBS on the AC
  • better use and control of the NUMA features of the AC
See the combined anumpirun/mpirun/prun man page for full details. The underlying MPI is provided by SGI's MPT - Message Passing Toolbox.

Further mpirun details

There are a number of options to mpirun and prun such as:
-s,--stats
Provides cumulative statistics on the MPI traffic for that mpirun job
-t
Prepends lines of standard output with the rank of the process that generated that output.
-w
Forces the program to be treated as an MPI program.
-O
Oversubscribe the cpus, i.e. request that more MPI processes be spawned than cpus requested. Generally this will require the MPI_NAP environment variable to be set.
-c num
Specifies the number of CPUs required per process (default 1). Only really necessary for hybrid MPI/OpenMP jobs.
For full details read the man page for mpirun.

Controlling MPI execution

A number of environment variables are available to control the behaviour of MPI jobs. Read man mpi for a full listing. Some useful environment variables are as follows:
MPI_BUFFER_MAX
This specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer.

MPI_COMM_MAX
Sets the maximum number of communicators that can be used in an MPI program. Use this variable to increase internal default limits. (Might be required by standard-compliant programs.) MPI generates an error message if this limit (or the default, if not set) is exceeded. The default value is 256.

Common problems and restrictions

  • The startup mechanism for SGI MPI jobs is quite complex and depends on the type of executable you are running. If you encounter errors such as
    ctrl_connect/connect: Connection refused
    or
    mpirun: MPT error (MPI_RM_sethosts): err=-1: could not run executable (case #3)
    contact us for an explanation.
  • MPI-2 dynamic process spawning is not currently supported.

Back to top

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables.
 

Compiling and linking

Fortran with OpenMP directives is compiled as:
    % ifort -fast -openmp myprog.f -o myprog.exe

C code with OpenMP directives is compiled as:

    % icc -fast -openmp myprog.c -o myprog.exe

Running OpenMP jobs

To run the OpenMP job interactively, first set the OMP_NUM_THREADS environment variable then run the executable:
    % setenv OMP_NUM_THREADS 4
    % ./a.out
For larger jobs and production use, submit a job to the PBS batch system with something like
    % qsub -q express -lncpus=4,walltime=30:00,vmem=400mb -wd
    #!/bin/csh
    setenv OMP_NUM_THREADS 4
    ./a.out
    ^D
    %
The AC is configured with 32 CPUs per partition made up of 2CPU "NUMA" nodes. Because of the NUMA (non-uniform memory access) architecture of the AC not all OpenMP codes will scale well to 32 processors. You should time your OpenMP code on a single processor then on increasing numbers of CPUs to find the optimal number of processors for running it. Keep in mind that your job is charged ncpus*walltime.

Currently, the setup of the PBS batch system means that if you are using more than 8 CPUs, you should use -lncpus=N:N (for N up to 24) to ensure that the CPUs allocated are on a single partition (host).

To ensure data locality of the OpenMP program you should initialise any arrays in parallel that are going to be used in parallel later in the program. Use the same number of threads and scheduling scheme for both intialisation and later calculation.

To ensure that threads are distributed across the cpus when they do initialize data, you should use the dplace command. The dplace documentation suggests using

    dplace -x2 ./a.out
however this has been found to give poor thread distribution in some cases. In particular you can find two threads locked on to the same cpu while another cpu is idle. We recommend using the "exact" placement options, ie.
    dplace -e -c0,x,1-3 ./a.out
for 4 cpu jobs,
    dplace -e -c0,x,1-7 ./a.out
for 8 cpu jobs etc.
(Note that the dplace man page suggests using the option -x6 for OpenMP code - this is now incorrect.)

OpenMP Performance

Parallel loop overheads

There is an overhead in starting and ending any parallel work distribution construct - an empty parallel loop takes a lot longer than an empty serial loop. And that overhead in wasted time grows with the number of threads used. Meanwhile the time to do the real work has (hopefully) decreased by using more threads. So you can end up with timelines like the following for a parallel work distribuition region:
              4 cpus             8 cpus

       time    ----               ----
        |     startup            startup
        |      ----
        V                         ----
               work               work

                                  ____
               ____              cleanup
              cleanup             ----
               ----
Bottom line: the amount of work in a parallel loop (or section) has to be large compared with the startup time. You're looking at 10's of microseconds startup cost or the equivalent time for doing 1000's of floating point ops. Given another order-of-magnitude because you're splitting work over O(10) threads and at least another order-of-magnitude because you want the work to dominate over startup cost and very quickly you need O(million) ops in a parallelised loop to make it scale OK.

NUMA Issues

The Altix is a fairly extreme NUMA system in which concepts of "locality" are important for achieving best performance.

Each NUMA node has 2 cpus and 4 or 8 GB of memory. The memory access overhead for referencing memory not connected to the cpu where the thread is running is quite high. Application slowdowns of 20-50% are possible due solely to variations in memory access. Ideally all threads should make the vast majority of memory requests to local memory. So at an algorithmic level, "locality" means being able to conceptually divide the memory of the job and nominate which thread "owns" that memory, i.e. which is the thread most commonly referencing that part of the memory. For a grid-based problem, this is often just a geometric division of the grid. Locality does require that the parallel DO loops consistently access most arrays, i.e. the way loop indices and hence array elements are divided amongst threads is the same for most parallel loops.

If there is locality at the algorithmic level, then that needs to be transferred to locality at the system memory and NUMA node level in terms of where the jobs threads and memory live. Physical memory is allocated to your job (in 16KB "pages") at the time that memory is first referenced by your program. The placement of each page (which NUMA node it is allocated on) is generally on a "first touch" basis. By default it will be allocated in memory local to the cpu where the thread that first referenced (initialized) that memory ran.

One of the major issues is that threads can migrate from cpu to cpu but (currently) memory pages do not migrate from NUMA node to NUMA node. So to achieve good locality we require:

  • the threads to be distributed over the cpus in a consistent manner. Using dplace will ensure that threads are tied to cpus.
  • the parallel loop distribution of the array initialization loops matches that of the later calculation loops as closely as possible.

Common problems

One of the most common problems encountered after parallelizing a code is the generation of floating point exceptions or segmentation violations that were not occurring before. This is usually due to uninitialized variables - check your code very carefully.

Segmentation violations can also be caused by the thread stack size being too small. Change this by setting the environment variable KMP_STACK_SIZE to a larger value than the default of 4Mb, for example,
setenv KMP_STACK_SIZE 5m


Back to top

Code Development

Debugging

The Intel idb debugger for C, C++ and Fortran as well as GNU C/C++ can be used in either DBX or GDB mode. It supports the debugging of simple programs, core files and code with multiple threads. The GNU debugger gdb is also available. Read man idb for further information.
  1. To use first compile and link your program using the -g switch e.g.
    	% cc -g prog.c
    	
  2. Start the debugger
    	% idb ./a.out
    	
  3. Enter commands such as
            (idb) list
    	(idb) stop at 10
    	(idb) run
    	(idb) print var
    	(idb) quit
    	
  4. By starting idb up with the option -gui you get a useful graphical user interface.

Debugging Parallel programs

Totalview can be used to debug parallel MPI or OpenMP programs. Introductory information and userguides on using Totalview are available from this site.
  1. First add the following line to your .cshrc file.
    	module load totalview
    	
  2. Compile code with the -g option. For example, for an MPI program,
    	% ifort -g -Oo prog.f -lmpi
    	
  3. Start Totalview. For example, to debug an MPI program using 4 processors,
    	% totalview mpirun -a -np 4 ./a.out
    	

Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.

Totalview shows source code for mpirun when it first starts an MPI job. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.

If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will "dive" into the source code for that routine and break points can be set there.

The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose "Add to expression list" and the variable will appear listed in a new window. If it is marked "Invalid compilation scope" in this new window right click again on the variable name in this window and chose "Compilation scope". Change this to "Floating" and the value of the variable or array should appear. Right clicking on it again and chosing "Dive" will give you values for arrays. In this window you can chose "Laminate" then "Process" under the View menu to see the values on different processors.

Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.

It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.

For more information on memory debugging see here.

Profiling

The gprof and histx profiling tools are available on the AC for sequential codes and histx can be used to investigate performance issues for parallel code.
  1. The gprof profiler provides information on the most time-consuming subprograms in your code. Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:
    	% ifort -p -o prog.exe prog.f
    	% ./prog.exe
    	% gprof ./prog.exe gmon.out 
    	
    For the GNU compilers do
            % g77 -pg -o prog.exe prog.f
            % gprof ./prog.exe gmon.out
             
    gprof is not useful for parallel code. More information on using gprof is available here.

  2. Profiling information to line level for sequential code can be obtained using histx. First load the histx module, then compile with the -g option, run the executable under histx and view the profiling output with iprep as follows:
            % module load histx
            % ifort -g -O3 prog.f
            % histx -l ./a.out
            % iprep histx.a.out.***
      
    Type histx to see the various options available, for example,
             % histx -s callstack10 a.out
             % histx -e pm:CPU_CYCLES@1000 a.out
             % histx -e pm:L3_REFERENCES@1000 a.out
      
    More details are on the AC in /opt/histx/histx-1.3b/doc/histx.txt

  3. For MPI codes, histx can be used to assess how much time the code is spending in MPI calls. In particular, if some or all of the processors spend a significant amount of time spinning, that is, waiting for MPI calls to complete. To use histx after compiling your MPI code run histx through mpirun as follows:
            % module load histx
            % ifort -O3 mpiprog.f -lmpi
            % prun -n 4 histx ./a.out
            % iprep histx.a.out.*
         
    For this example there will be five histx.a.out.* files produced, one for each process and one for the MPI shepherd process. Look at the iprep output for the four processes. The percentage of time spent in MPI calls is reported in the counts for different calls from libmpi.so. Well balanced efficient MPI code will display low percentages for the MPI calls and spend the major part of its time in the computational a.out.
    Using histx -l will give profiling information to line level.

  4. To profile OpenMP code use the -openmp_profile option to the Intel compilers. Then, after the code has run, a file guide.gvs is created and this contains details of time spent in different parallel regions of the code.

    Histx can also be used in the same way as for MPI codes.

              % module load histx
              % ifort -openmp -g -O2 omp_prog.f90
              % histx -l a.out
              % iprep histx.a.out.*****
    

Graphical profiling of MPI Code

  1. The Intel Trace collector and analyzer have been installed. These provide considerable detail about the messages being passed.
    To access them load both modules
    module load intel-itc
    module load intel-ita
    More details are given on the software web page

  2. Jumpshot-4 is also available for investigating issues such as load balancing in MPI codes that do not generate a large log file. To use for a Fortran program do
          % module load jumpshot
          % ifort -O3 prog.f -L/opt/jumpshot/mpe/lib -lmpe_f2cmpi -llmpe -lmpe -lmpi
          % mpirun -n 4 ./a.out
     
    The trace information is stored in the file Unknown.clog. Then start up jumpshot as
           % jumpshot Unknown.clog
      
    Once jumpshot is started the first step is to convert the .clog file to .slog2 format. Details on using Jumpshot are given on the software pages.

Back to top
Email problems, suggestions, questions to