What is the National Facility?

NCRIS 5.16 - Platforms for Collaboration

  • NCRIS - National Collaborative Research Infrastructure Strategy
  • NCRIS 5.16 Platforms for Collaboration http://www.pfc.org.au
    • Capability Computing (NCI)
    • Data Commons (ANDS)
    • Research Collaboration Services, Grid (ARCS)
    • Research Connectivity (AARNET)


National Facility

URL   http://nf.nci.org.au/

Email   help@nf.nci.org.au

Detailed usage information in the User Guide


National Facility Resources


The VU (vayu) at the National Facility

Image of vayu cluster goes here


The XE Architecture

Schematic Diagram of the XE
Note
  • VU is the same basic architecture
  • Multiple filesystems, and the filesystem locality
  • Multiple nodes, most only accessible through PBS
  • Distributed vs shared memory


Sun Constellation Cluster - 'vayu'

Hardware:

  • 746 X6275 blades (1492 nodes) in 16 C48 racks
  • 4 M9 Quad-data-rata (40Gb/s) Infiniband switches
  • 40 auxilliary servers
  • Approx 900TB of Lustre storage with 26 Object Servers
  • Compute nodes contain:
  • no disks!
  • Flash DIMM for swap
  • 2 NUMA nodes


Applying for accounts on VU

Online application forms at http://nf.nci.org.au/accounts/

  • National Merit Allocation Scheme (MAS)

  • Partner allocations

  • Startup allocation

Distribution of time across VU and XE:

  • MAS 42%
  • ANU 24%
  • CSIRO 24%
  • INTERSECT 4%
  • Monash E-Research .5%
  • iVec .4%
  • QCIF .3%


Accessing the VU

Example 1

Logging on to the vu - for example for course account aaa777.

The project code is c23

        ssh vu.nci.org.au -l aaa777
        cd INTRO_COURSE
        ls -l
If you already have an account on the VU please use a course account for today to save any complications with the exercises and connection to the Mass Data Storage system. At the end of the course you can always make a copy of the course material to your own account if you wish.

Remember to read the Message of the Day (MOTD) as you login.

Commands to try:

     # to see the node you are logged into
     uname -a


     # see how many users have batch jobs running on the ac.
     # q to quit from the less command
     nqstat -a | less


Edit Your Environment Settings

Various modules are loaded into your environment at login to provide a workable environment. To modify this for your own needs add more modules in the .cshrc, .profile or .login files.

  • Check the modules that are loaded on login by typing
       module list
       
  • Try the command emacs.
  • See what is on your path by typing
       echo $PATH
       

Type module avail to see which software packages are installed and accessible in this way.


Summary - Logging in and Getting Information


Editors on the XE

Several editors are available.

  • vi
  • emacs
  • nano
If you are not familiar with any of these you will find that nano has a simple interface. Just type nano.


Xterm under Windows or Macintosh

VNC freeware (Virtual Network Computing) http://nf.nci.org.au/facilities/software

  • vncviewer
    to start VNC, then enter the
    machine:port_number
Be sure to logout of xterm sessions, and quit the Window Manager before leaving the system.
More information on using VNC is available from the VNC software web page.

There is also information on using ssh and sftp.


Compiling, Optimising, using Libraries


Compiling and Optimising

We recommend that you use the Intel compilers. Note that the Intel C/C++ compiler is compatible with gcc/g++.

module list
module avail intel-fc
module avail intel-cc
  • Compiling and Linking
           ifort -o matmulf matmul.f
           matmulf
           icc -o matmulc matmul.c 
           matmulc
           icpc -o matmulC matmul.C
           matmulC
           
  • Read the reference pages for the compilers:
           man ifort
           man icc
           man icpc
           
See the User Guide.< /a>


Using Compiler Options

Example

  • Optimisation Compiler Options
    • Default optimisation level is -O2
    • No optimisation, -O0, is very, very slow.
    • Debug option, -g, uses -O0
    • Highest optimisation is -O3, use with care.
  • Compiler Options, C++:
    • Link C++ code with icpc, not icc
The default optimisation level for the Intel compilers is -O2.
The default optimisation level for the GNU compilers is -O0.
READ THE MAN PAGES.


Compiler optimisation

Example

ifort -O3 -o matmulf matmul.f
time matmulf
ifort -O0 -o matmuls matmul.f
time matmuls
icc -O3 -o matmulf matmul.c
time matmulf
icc -O0 -o matmuls matmul.c
time matmuls


Libraries

Example

Using the Intel MKL BLAS library.

  • Look at the code in blas_ex.f or blas_ex.c and blas_ex.C
  • These call the BLAS complex dot product (Fortran) routine zdotc.
  • We recommend using the Intel-MKL library
    Check out the software web page.
    You have to load the relevant module to access these libraries,
           module avail intel-mkl
           module load intel-mkl
    
  • Note that there are significant differences between the linking instructions for version 9 and 10. The XE has version 9 as default and the VU has a version 10 as default.
  • These libraries can be run in parallel using OpenMP.


Libraries

Example Continued

  • Fortran programmers
           ifort -o blas_exf blas_ex.f -L$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
           blas_exf
            

  • C programmers
           icc -o blas_exc blas_ex.c -L$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
           blas_exc
     	
  • C++ programmers
    	icpc -o blas_exC blas_ex.C -$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
    	blas_exC
    	
      
See MKL software web page.


Batch Queueing System


Batch Queueing System

  • Most work done as batch jobs (interactive process limits are small).

  • Queueing system:
    • distributes work evenly over the system
    • ensures that jobs cannot impact each other (e.g. exhaust memory or other resources)
    • provide equitable access to the system

  • APAC-NF uses a modified version of OpenPBS


Using the Queueing System

  • Read the "PBS Batch Use" and "Queues and Scheduling" sections of the Userguide

  • Request resources for your job (using qsub).
    See man pbs_resources:
    • walltime
    • (v)memory
    • disk (jobfs)
    • number of cpus
    • software

  • PBS will then
    • schedule the job when the resources become available
    • prevent other jobs from infringing on the allocated resources
    • if necessary delay starting job until software licence is available
    • display progress of the jobs (nqstat)
    • terminate the job when it exceeds its requested resources
    • return stdout and stderr in batch output files


Scheduling and Job Suspension

  • Jobs won't be started until sufficient resources are free
  • Resources allocated to a job are unavailable to other jobs
  • Jobs can be suspended to run parallel jobs but the fraction of time suspended is limited (depends on how many jobs you have running, number of cpus, etc.)

Only ask for the resources your job really needs!
  • Avoids your job being delayed in the queue or suspended unnecessarily
  • Avoids other users jobs being delayed unnecessarily by wasted resources
  • Experiment in express and look at the bottom of the PBS stdout file to see what resources were used by jobs


Batch queue structure

normal

  • Default queue designed for production use
  • Charging rate of 1 SU per processor-hour (walltime)
  • Largest allowed resources
  • If your grant is exhausted you still get access at a lower priority

express

  • High priority for testing, debugging etc.
  • Charging rate of 3 SUs per processor-hour (walltime)
  • Smaller limits to discourage "production use" by projects with too much grant left

copyq

  • Used for file manipulation - e.g. copying files to MDSS
  • Only queue to run on the file server node for /short

Job charging is based on wall clock time used, number of cpus requested, queue choice and machine choice. One hour of time on the XE is worth half of one hour on the VU.

See:   nf_limits


Long running jobs?

  • Many (most?) users run jobs that last longer than the queue limits

  • Use checkpoint-restart - periodically (every hour or so) write out the state of your job to a file and give your program the ability to restart from these checkpoints

  • Allows long jobs AND protects against hardware/system failure


Submit a job to the batch queue

Example 6

First look at the man pages for qsub, nf_limits, nqstat and qdel, and type nf_limits to see what restrictions there are on size and number of jobs that you can submit to the PBS queuing system.
    man qsub
    man nf_limits
    man qdel
    nf_limits

Look at the job script runjob, then submit it using qsub

    less runjob
    qsub runjob
    nqstat
    qps jobid
View the output in the file runjob.o**** and any error messages in runjob.e**** after the job completes.

  • Write a job script called "sleepjob" which requests 5 minutes walltime limit, and executes the command "sleep 300".
  • Submit this job, note the returned job id number, then use qdel to delete it from the queue.


runjob script


#!/bin/csh
#PBS -wd 
#PBS -q express
#PBS -l walltime=00:02:00,vmem=50MB
time matmulfs
time matmulff


Filesystems


Filesystems Overview

The Filesystems section of the userguide has this table in greater detail:

Filesystem Size Limit Backup Location Time Limit
/home ~800MB Yes Global No
/short 20GB - 40GB per project No Global 60 days
/jobfs 50GB No Local to node Duration of job
MDSS 20GB 2 copies External, access
using special commands
No

Note that these limits can be changed on request if necessary.


Global File Systems

/home
Used for irreproducible data - e.g. source code, scripts, etc.

/short
Used for "active" data files - the input and output data of batch jobs. Allocated per project.


Writing to /short

Example 7a

Look at your project's /short area. Anyone from your project can create their own directories and files here. Create a directory of your own under your project area.
      cd /short/$PROJECT
      ls -ld .
      mkdir $USER
Remember that files in /short are not backed up and are deleted after the "expiry" number of days if not accessed.


Input/Output Warning

  • Lots of small IO to /short (or /home) can be very slow and can severely impact other jobs on the system.

  • Avoid "dribbly" IO, e.g. writing 2 numbers from your inner loop.
    Writing to /short every second is far too often!
  • Use buffered IO. For Fortran, to get 32KB buffering use
    open(unit=...,file=...,buffered="yes",buffercount=4)
  • Use /jobfs instead of /short for jobs that do lots of file manipulation


Using the MDSS

Accessing and Using the Mass Data Storage System (MDSS)

  • MDSS is used for long term storage of large datasets.

  • If you have numerous small files to archive - bundle into a tar file FIRST.

  • Every project has a directory on the MDSS at
    /massdata/$PROJECT
    All members of the project group have read and write access to the top project directory.

  • The mdss command can be used to "get" and "put" data between the interactive nodes of the vu or xe and the MDSS, as well as to list files and directories on the MDSS.

  • netcp and netmv can be used from within batch jobs to
    • Generate a batch script for copying/moving files to the MDSS
    • Submit the generated batch script to the special copyq which runs copy/move job on an interactive node.
  • netcp and netmv can also be used interactively to save you work creating tar files and generating mdss commands.


Using the MDSS - Example

Example 7b

To see these commands in action do
     cd /short/$PROJECT/$USER
     mdss get Data/data.tar
     ls -l
     tar xvf data.tar
     ls
     rm data.tar
     mdss mkdir $USER
     netmv -t $USER.tar DATA $USER
     nqstat
     more DATA.o*
     mdss sls $USER
     mdss rm $USER/$USER.tar


Using /jobfs

  • Fast IO local to the compute node
    (not shared beyond the 64 cpus of the partition)

  • Only available through queueing system:
    Request like -ljobfs=1GB
    Access via PBS_JOBFS environment variable
  • All files are deleted at end of job. Copy what you need to /short or other global filesystem in job script.

  • Cannot use mdss or netcp commands for files on /jobfs.


Managing Files between /short, /jobfs and MDSS

Example 7c

Submit a batch job with a /jobfs request, where the job:
  • Copies an input file from /short to /jobfs
  • Runs a code to use the input file and generate some output
  • Saves the output data back to the /short area
  • Uses the netcp command to archive the data to the MDSS
First compile the code which reads an input file and generates a single output file:
       cd ~/INTRO_COURSE
       ifort -O3 -o jacobs jacob_serial.f
Read the runjobfs script then submit it to the queueing system, monitor the job with nqstat, and examine the batch job output files:
       qsub runjobfs
       nqstat
       cat runjobfs.e*
       cat runjobfs.o*
Check out the output file that this job created on /short and the copy on the MDSS
       cd /short/$PROJECT/$USER
       ls -ltr
       less save_data.o*
       mdss sls $USER
       mdss rm -r $USER


Example job script using /jobfs

#!/bin/bash
#PBS -q express
#PBS -l walltime=2:00
#PBS -l jobfs=10mb
#PBS -l vmem=30mb

cd $PBS_JOBFS

echo "Moving files from home directory to the local directory"
cp $HOME/INTRO_COURSE/input.1 .
cp $HOME/INTRO_COURSE/jacobs .

# Run program and write an output file to the local disk.
time ./jacobs < input.1 > output$PBS_JOBID 2>&1

# Move output data to /short space.
echo "The output files are now on my /short space."
mv output$PBS_JOBID /short/$PROJECT/$USER

# Archive to MDSS using netcp
cd /short/$PROJECT/$USER
netcp -N save_data output$PBS_JOBID $USER/output$PBS_JOBID



Parallelisation

  • Writing parallel code (OpenMP, MPI)
  • Automatic parallelisation with Intel compiler
  • Running OpenMP code
  • Running MPI code
  • Debugging


Parallelism

Shared vs Distributed Memory Schematic Diagram


An Accountant's View of the Parallel Economy

Example:

12% serial code (not parallelisable)
88% parallel code
No. of CPUs Walltime
Serial Part
Walltime
Parallel Part
Total Walltime Total Cost
= Walltime * ncpus
1 12 88 100 100
2 12 44 56 112
4 12 22 34 136
8 12 11 23 184
16 12 5.5 17.5 280


Writing parallel code

Popular models of parallel computation:

  • OpenMP
  • MPI (Message Passing Interface)
  • HPF (not discussed today)
  • Posix threads (not discussed today)


OpenMP Advantages

  • Correct single-thread programs can be parallelised incrementally.
      => Sequence of increasingly more efficient parallel programs that are always correct.

  • Only one version of the source code to maintain for both single-thread and multi-thread operation.


OpenMP Disdvantages

  • OpenMP operation is restricted to shared-memory systems.

  • OpenMP directives are usually restricted to the parallelisation of loops or systems of loops. Other forms of OpenMP parallelisation usually break the single source-code advantage above.

Note that shared memory parallelism may be relevant for multicore architectures.


Running OpenMP code

Example 8

For Fortran users look at matmul_omp.f
           ifort -O3 -openmp matmul_omp.f -o matmul_omp
           setenv OMP_NUM_THREADS 2
           time matmul_omp
           setenv OMP_NUM_THREADS 4
           time matmul_omp
  • These timings may not scale as expected because of conflicts with other users on the interactive nodes.
  • Do the timings by submitting a batch job.
    1. Edit the script runompjob
    2. Submit it to the queuing system.
    3. The timings will appear in runompjob.e**** and other output in runompjob.o****.

For C programmers, compile with

           icc -openmp matmul_omp.c -lm -o matmul_omp
then as above.


Automatic parallelisation

Example 9

The code jacob_serial.f is a serial program which solves the Helmholtz equation using the Jacobi method. To see how this code runs do
          ifort -O3 -o jacobs jacob_serial.f 
          time jacobs <  input.1

Now see what the automatic parallelisation can achieve. Do

          ifort -o jacobp -O3 -parallel jacob_serial.f

          setenv OMP_NUM_THREADS 2
          time jacobp < input.1
          setenv OMP_NUM_THREADS 4
          time jacobp < input.1
C programmers can look at matmul.c. Do
          icc -o matmulp -parallel  matmul.c -lm
          setenv OMP_NUM_THREADS 2
          time matmulp
          setenv OMP_NUM_THREADS 4
          time matmulp


MPI

An MPI application can be viewed as several copies of the same program running on individual but interconnected computers.

Each copy knows its own thread-number (often referred to as id or rank) thus the instruction flow can be modified.

ADVANTAGES

  • MPI programs can be run on any computer equipment
  • The programming model is simple and the parallelism is explicit (no "gotchas")
  • Virtually any application can be parallelised in some way with MPI
  • Performance/scaling is usually better

DISADVANTAGES

  • Often requires complete recoding/redesign of programs for data decomposition - no incremental parallelism
  • Low level functionality - can be tedious for some problems (can be alleviated with good programming)


Running MPI code

Example 10

There are four MPI example codes in the INTRO_COURSE directory.

To see a simple MPI code run:

          mpif90 mpiexample1.f -o mpiexample.exe 
          mpirun -np 4 mpiexample.exe
or for a more complicated example:
          mpif90 mpiexample2.f -o mpiexample.exe 
          mpirun -s -n 4  mpiexample.exe

mpirun is the usual instruction to start an MPI program.

          man mpirun 
for further details on usage. C code simple example:
          mpicc mpiexample3.c -o mpiexample.exe 
          mpirun -np 4  mpiexample.exe
and for a more complicated code:
          mpicc mpiexample4.c -o mpiexample.exe 
          mpirun -n 4 mpiexample.exe
The job script runmpijob can be used to submit an MPI job to the batch queues.


MPI vs OpenMP - which should I use?

OpenMP:

  • Shared data (good and bad)
  • Supposedly "simpler" and easier to use
  • Incremental parallelism (use existing codes)
  • Scalability?
  • Only on shared memory SMPs or multicore
  • Proprietary compilers are necessary (but will be in gcc 4.2)
MPI:
  • Shared nothing!
  • Programming model clearer
  • Can be tedious for some distributed data
  • "All or nothing" (start from scratch)
  • Portable - "runs on anything"
  • Download free library or use vendor specific library


Debugging


idb

Example 11

First look at the Intel debugger on a sequential program.
The following shows some introductory aspects of idb in action.

C:

       > icc -g matmul.c -lm
       > idb a.out
         Linux Application Debugger for Itanium(R)-based applications, Version 8.1-10, Build 20050429
        ------------------ 
        object file name: a.out 
        Reading symbolic information from /home/900/mhk900/INTRO_COURSE/a.out...done
         (idb) help
         (idb) stop in main
         (idb) run
         (idb) list 5:10
         (idb) print i
         (idb) print j
         (idb) step
         (idb) print i
         (idb) print j
         (idb) step
         (idb) print i
         (idb) print j
         (idb) where
         (idb) cont
         (idb) quit
       

Fortran:

       > ifort -g matmul.f
       > idb a.out
       Linux Application Debugger for Itanium(R)-based applications, Version 8.1-10, Build 20050429
       ------------------ 
       object file name: a.out 
       Reading symbolic information from /home/900/mhk900/INTRO_COURSE/a.out...done
         (idb) list
         (idb) stop at 8
         (idb) run
         (idb) print i
         (idb) print j
         (idb) where
         (idb) cont
         (idb) quit
       

There is a graphical interface for idb which can be invoked by idb -gui a.out


Totalview

Example 12

  • Totalview can be used to debug sequential or parallel programs.

  • Follow the instructions on the software web page.

  • Use one of the MPI examples and insert break points and look at current values of variables.