|
The VU (vayu) at the National Facility
|
|
The XE Architecture
Note
- VU is the same basic architecture
- Multiple filesystems, and the filesystem locality
- Multiple nodes, most only accessible through PBS
- Distributed vs shared memory
|
|
|
Sun Constellation Cluster - 'vayu'
Hardware:
- 746 X6275 blades (1492 nodes) in 16 C48 racks
- 4 M9 Quad-data-rata (40Gb/s) Infiniband switches
- 40 auxilliary servers
- Approx 900TB of Lustre storage with 26 Object Servers
- Compute nodes contain:
- no disks!
- Flash DIMM for swap
- 2 NUMA nodes
|
|
|
Applying for accounts on VU
Online application forms at http://nf.nci.org.au/accounts/
- National Merit Allocation Scheme (MAS)
- Partner allocations
- Startup allocation
Distribution of time across VU and XE:
- MAS 42%
- ANU 24%
- CSIRO 24%
- INTERSECT 4%
- Monash E-Research .5%
- iVec .4%
- QCIF .3%
|
|
|
Accessing the VU
Example 1
Logging on to the vu - for example for course account aaa777.
The project code is c23
ssh vu.nci.org.au -l aaa777
cd INTRO_COURSE
ls -l
If you already have an account on the VU please use a course account for
today to save any complications with the exercises and connection to
the Mass Data Storage system. At the end of the course you can always
make a copy of the course material to your own account if you wish.
Remember to read the Message of the Day (MOTD) as you login.
Commands to try:
# to see the node you are logged into
uname -a
# see how many users have batch jobs running on the ac.
# q to quit from the less command
nqstat -a | less
|
|
|
Edit Your Environment Settings
Various modules are loaded into your environment at login to provide a
workable environment. To modify this for your own needs add more modules
in the .cshrc, .profile or .login files.
- Check the modules that are loaded on login by typing
module list
- Try the command emacs.
- See what is on your path by typing
echo $PATH
Type module avail to see which software
packages are installed and accessible in this way.
|
|
|
Summary - Logging in and Getting Information
|
|
|
Editors on the XE
Several editors are available.
If you are not familiar with any of these you will find that nano has a simple interface. Just type nano.
|
|
|
Xterm under Windows or Macintosh
VNC freeware (Virtual Network Computing)
http://nf.nci.org.au/facilities/software
vncviewer to start VNC, then enter the
machine:port_number
Be sure to logout of xterm sessions, and quit the Window Manager
before leaving the system.
More information on using VNC is available from the
VNC software web page.
There is also information on using
ssh and
sftp.
|
|
|
Compiling, Optimising, using Libraries
|
|
|
Compiling and Optimising
We recommend that you use the Intel compilers. Note that the Intel C/C++ compiler
is compatible with gcc/g++.
module list
module avail intel-fc
module avail intel-cc
See the User Guide.<
/a>
|
|
|
Using Compiler Options
Example
- Optimisation Compiler Options
- Default optimisation level is -O2
- No optimisation, -O0, is very, very slow.
- Debug option, -g, uses -O0
- Highest optimisation is -O3, use with care.
- Compiler Options, C++:
- Link C++ code with icpc, not icc
The default optimisation level for the Intel compilers is -O2.
The default optimisation level for the GNU compilers is -O0.
READ THE MAN PAGES.
|
|
|
Compiler optimisation
Example
ifort -O3 -o matmulf matmul.f
time matmulf
ifort -O0 -o matmuls matmul.f
time matmuls
icc -O3 -o matmulf matmul.c
time matmulf
icc -O0 -o matmuls matmul.c
time matmuls
|
|
|
Libraries
Example
Using the Intel MKL BLAS library.
|
|
|
Libraries
Example Continued
- Fortran programmers
ifort -o blas_exf blas_ex.f -L$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
blas_exf
- C programmers
icc -o blas_exc blas_ex.c -L$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
blas_exc
- C++ programmers
icpc -o blas_exC blas_ex.C -$MKL/lib/em64t -lmkl_em64t -lguide -lpthread
blas_exC
See MKL software web page.
|
|
|
Batch Queueing System
- Most work done as batch jobs (interactive process limits are small).
- Queueing system:
- distributes work evenly over the system
- ensures that jobs cannot impact each other (e.g. exhaust
memory or other resources)
- provide equitable access to the system
- APAC-NF uses a modified version of OpenPBS
|
|
|
Using the Queueing System
- Read the "PBS Batch Use" and "Queues and Scheduling" sections of
the Userguide
- Request resources for your job (using qsub).
See man pbs_resources:
- walltime
- (v)memory
- disk (jobfs)
- number of cpus
- software
- PBS will then
- schedule the job when the resources become available
- prevent other jobs from infringing on the allocated resources
- if necessary delay starting job until software licence is available
- display progress of the jobs (nqstat)
- terminate the job when it exceeds its requested resources
- return stdout and stderr in batch output files
|
|
|
Scheduling and Job Suspension
- Jobs won't be started until sufficient resources are free
- Resources allocated to a job are unavailable to other jobs
- Jobs can be suspended to run parallel jobs but the fraction of time
suspended is limited (depends on how many jobs you have
running, number of cpus, etc.)
Only ask for the resources your job really needs!
- Avoids your job being delayed in the queue or suspended unnecessarily
- Avoids other users jobs being delayed unnecessarily by wasted
resources
- Experiment in express and look at the bottom of the PBS stdout file
to see what resources were used by jobs
|
|
|
Batch queue structure
normal
- Default queue designed for production use
- Charging rate of 1 SU per processor-hour (walltime)
- Largest allowed resources
- If your grant is exhausted you still get access at a lower priority
express
- High priority for testing, debugging etc.
- Charging rate of 3 SUs per processor-hour (walltime)
- Smaller limits to discourage "production use" by projects with
too much grant left
copyq
- Used for file manipulation - e.g. copying files to MDSS
- Only queue to run on the file server node for /short
Job charging is based on wall clock time used, number of cpus requested, queue choice and machine choice. One hour of time on the XE is
worth half of one hour on the VU.
See: nf_limits
|
|
|
Long running jobs?
- Many (most?) users run jobs that last longer than the queue
limits
- Use checkpoint-restart - periodically (every hour or so) write out the
state of your job to a file and give your program the ability
to restart from these checkpoints
- Allows long jobs AND protects against hardware/system failure
|
|
|
Submit a job to the batch queue
Example 6
First look at the man pages for qsub, nf_limits, nqstat and qdel,
and type nf_limits to see what restrictions there are on size and number of
jobs that you can submit to the PBS queuing system.
man qsub
man nf_limits
man qdel
nf_limits
Look at the job script runjob, then submit it using qsub
less runjob
qsub runjob
nqstat
qps jobid
View the output in the file runjob.o**** and any error messages in
runjob.e**** after the job completes.
- Write a job script called "sleepjob" which requests 5 minutes
walltime limit, and executes the command "sleep 300".
- Submit this job, note the returned job id number, then use
qdel to delete it from the queue.
|
|
|
runjob script
#!/bin/csh
#PBS -wd
#PBS -q express
#PBS -l walltime=00:02:00,vmem=50MB
time matmulfs
time matmulff
|
|
|
Filesystems Overview
The Filesystems section of the userguide has this table in greater detail:
| Filesystem |
Size Limit |
Backup |
Location |
Time Limit |
| /home |
~800MB |
Yes |
Global |
No |
| /short |
20GB - 40GB per project |
No |
Global |
60 days |
| /jobfs |
50GB |
No |
Local to node |
Duration of job |
| MDSS |
20GB |
2 copies |
External, access using special commands |
No |
Note that these limits can be changed on request if necessary.
|
|
|
Global File Systems
- /home
- Used for irreproducible data - e.g. source code, scripts, etc.
- /short
- Used for "active" data files - the input and output data
of batch jobs. Allocated per project.
|
|
|
Writing to /short
Example 7a
Look at your project's /short area. Anyone from your project
can create their own directories and files here.
Create a directory of your own under your project area.
cd /short/$PROJECT
ls -ld .
mkdir $USER
Remember that files in /short are not backed up and are deleted
after the "expiry" number of days if not accessed.
|
|
|
Input/Output Warning
- Lots of small IO to /short (or /home) can be very slow
and can severely impact other jobs on the system.
- Avoid "dribbly" IO, e.g. writing 2 numbers from
your inner loop.
Writing to /short every second is far too often!
- Use buffered IO. For Fortran, to get 32KB buffering use
open(unit=...,file=...,buffered="yes",buffercount=4)
- Use /jobfs instead of /short for jobs that do lots of file manipulation
|
|
|
Using the MDSS
Accessing and Using the Mass Data Storage System (MDSS)
- MDSS is used for long term storage of large datasets.
- If you have numerous small files to archive - bundle into a tar file
FIRST.
- Every project has a directory on the MDSS at
/massdata/$PROJECT
All members of the project group have read and write access to
the top project directory.
- The mdss command can be used to "get" and "put" data
between the interactive nodes of the vu
or xe and the MDSS,
as well as to list files and directories on the MDSS.
- netcp and netmv can be used from within batch
jobs to
- Generate a batch script for copying/moving files to the MDSS
- Submit the generated batch script to the special copyq
which runs copy/move job on an interactive node.
- netcp and netmv can also be used interactively
to save you work creating tar files and generating mdss commands.
|
|
|
Using the MDSS - Example
Example 7b
To see these commands in action do
cd /short/$PROJECT/$USER
mdss get Data/data.tar
ls -l
tar xvf data.tar
ls
rm data.tar
mdss mkdir $USER
netmv -t $USER.tar DATA $USER
nqstat
more DATA.o*
mdss sls $USER
mdss rm $USER/$USER.tar
|
|
|
Using /jobfs
- Fast IO local to the compute node
(not shared beyond the 64 cpus of the partition)
- Only available through queueing system:
Request like -ljobfs=1GB
Access via PBS_JOBFS environment variable
- All files are deleted at end of job. Copy what you
need to /short or other global filesystem in job script.
- Cannot use mdss or netcp commands for
files on /jobfs.
|
|
|
Managing Files between /short, /jobfs and MDSS
Example 7c
Submit a batch job with a /jobfs request, where the job:
- Copies an input file from /short to /jobfs
- Runs a code to use the input file and generate some output
- Saves the output data back to the /short area
- Uses the netcp command to archive the data to the MDSS
First compile the code which reads an input file and generates
a single output file:
cd ~/INTRO_COURSE
ifort -O3 -o jacobs jacob_serial.f
Read the runjobfs script then submit it to the queueing system,
monitor the job with nqstat, and examine the batch job output files:
qsub runjobfs
nqstat
cat runjobfs.e*
cat runjobfs.o*
Check out the output file that this job created on /short and the copy
on the MDSS
cd /short/$PROJECT/$USER
ls -ltr
less save_data.o*
mdss sls $USER
mdss rm -r $USER
|
|
|
Example job script using /jobfs
#!/bin/bash
#PBS -q express
#PBS -l walltime=2:00
#PBS -l jobfs=10mb
#PBS -l vmem=30mb
cd $PBS_JOBFS
echo "Moving files from home directory to the local directory"
cp $HOME/INTRO_COURSE/input.1 .
cp $HOME/INTRO_COURSE/jacobs .
# Run program and write an output file to the local disk.
time ./jacobs < input.1 > output$PBS_JOBID 2>&1
# Move output data to /short space.
echo "The output files are now on my /short space."
mv output$PBS_JOBID /short/$PROJECT/$USER
# Archive to MDSS using netcp
cd /short/$PROJECT/$USER
netcp -N save_data output$PBS_JOBID $USER/output$PBS_JOBID
|
|
|
Parallelisation
- Writing parallel code (OpenMP, MPI)
- Automatic parallelisation with Intel compiler
- Running OpenMP code
- Running MPI code
- Debugging
|
|
|
Parallelism
|
|
|
An Accountant's View of the Parallel Economy
Example:
12% serial code (not parallelisable)
88% parallel code
| No. of CPUs |
Walltime Serial Part |
Walltime Parallel Part |
Total Walltime |
Total Cost = Walltime * ncpus |
| 1 |
12 |
88 |
100 |
100 |
| 2 |
12 |
44 |
56 |
112 |
| 4 |
12 |
22 |
34 |
136 |
| 8 |
12 |
11 |
23 |
184 |
| 16 |
12 |
5.5 |
17.5 |
280 |
|
|
|
Writing parallel code
Popular models of parallel computation:
- OpenMP
- MPI (Message Passing Interface)
- HPF (not discussed today)
- Posix threads (not discussed today)
|
|
|
OpenMP Advantages
- Correct single-thread programs can be parallelised incrementally.
=> Sequence of increasingly more efficient parallel
programs that are always correct.
- Only one version of the source code to maintain for both single-thread
and multi-thread operation.
|
|
|
OpenMP Disdvantages
- OpenMP operation is restricted to shared-memory systems.
- OpenMP directives are usually restricted to the parallelisation of
loops or systems of loops. Other forms of OpenMP parallelisation
usually break the single source-code advantage above.
Note that shared memory parallelism may be relevant for multicore architectures.
|
|
|
Running OpenMP code
Example 8
For Fortran users look at matmul_omp.f
ifort -O3 -openmp matmul_omp.f -o matmul_omp
setenv OMP_NUM_THREADS 2
time matmul_omp
setenv OMP_NUM_THREADS 4
time matmul_omp
- These timings may not scale as expected because of conflicts with
other users on the interactive nodes.
- Do the timings by submitting a batch job.
- Edit the script runompjob
- Submit it to the queuing system.
- The timings will appear in runompjob.e**** and
other output in runompjob.o****.
For C programmers, compile with
icc -openmp matmul_omp.c -lm -o matmul_omp
then as above.
|
|
|
Automatic parallelisation
Example 9
The code jacob_serial.f is a serial program which solves the Helmholtz
equation using the Jacobi method. To see how this code runs do
ifort -O3 -o jacobs jacob_serial.f
time jacobs < input.1
Now see what the automatic parallelisation can achieve. Do
ifort -o jacobp -O3 -parallel jacob_serial.f
setenv OMP_NUM_THREADS 2
time jacobp < input.1
setenv OMP_NUM_THREADS 4
time jacobp < input.1
C programmers can look at matmul.c. Do
icc -o matmulp -parallel matmul.c -lm
setenv OMP_NUM_THREADS 2
time matmulp
setenv OMP_NUM_THREADS 4
time matmulp
|
|
|
MPI
An MPI application can be viewed as several copies of the same program running
on individual but interconnected computers.
Each copy knows its own thread-number (often referred to
as id or rank) thus the instruction flow can be modified.
ADVANTAGES
- MPI programs can be run on any computer equipment
- The programming model is simple and the parallelism is
explicit (no "gotchas")
- Virtually any application can be parallelised in some way
with MPI
- Performance/scaling is usually better
DISADVANTAGES
- Often requires complete recoding/redesign of programs
for data decomposition - no incremental parallelism
- Low level functionality - can be tedious for some problems (can
be alleviated with good programming)
|
|
|
Running MPI code
Example 10
There are four MPI example codes in the INTRO_COURSE directory.
To see a simple MPI code run:
mpif90 mpiexample1.f -o mpiexample.exe
mpirun -np 4 mpiexample.exe
or for a more complicated example:
mpif90 mpiexample2.f -o mpiexample.exe
mpirun -s -n 4 mpiexample.exe
mpirun is the usual instruction to start an MPI program.
man mpirun
for further details on usage.
C code simple example:
mpicc mpiexample3.c -o mpiexample.exe
mpirun -np 4 mpiexample.exe
and for a more complicated code:
mpicc mpiexample4.c -o mpiexample.exe
mpirun -n 4 mpiexample.exe
The job script runmpijob can be used to submit an MPI job
to the batch queues.
|
|
|
MPI vs OpenMP - which should I use?
OpenMP:
- Shared data (good and bad)
- Supposedly "simpler" and easier to use
- Incremental parallelism (use existing codes)
- Scalability?
- Only on shared memory SMPs or multicore
- Proprietary compilers are necessary (but will be in gcc 4.2)
MPI:
- Shared nothing!
- Programming model clearer
- Can be tedious for some distributed data
- "All or nothing" (start from scratch)
- Portable - "runs on anything"
- Download free library or use vendor specific library
|
|
|
idb
Example 11
First look at the Intel debugger on a sequential program.
The following shows some introductory aspects of idb in action.
C:
> icc -g matmul.c -lm
> idb a.out
Linux Application Debugger for Itanium(R)-based applications, Version 8.1-10, Build 20050429
------------------
object file name: a.out
Reading symbolic information from /home/900/mhk900/INTRO_COURSE/a.out...done
(idb) help
(idb) stop in main
(idb) run
(idb) list 5:10
(idb) print i
(idb) print j
(idb) step
(idb) print i
(idb) print j
(idb) step
(idb) print i
(idb) print j
(idb) where
(idb) cont
(idb) quit
Fortran:
> ifort -g matmul.f
> idb a.out
Linux Application Debugger for Itanium(R)-based applications, Version 8.1-10, Build 20050429
------------------
object file name: a.out
Reading symbolic information from /home/900/mhk900/INTRO_COURSE/a.out...done
(idb) list
(idb) stop at 8
(idb) run
(idb) print i
(idb) print j
(idb) where
(idb) cont
(idb) quit
There is a graphical interface for idb which can be invoked by idb -gui a.out
|
|
|
Totalview
Example 12
- Totalview can be used to debug sequential or parallel programs.
- Follow the instructions on the software web page.
- Use one of the MPI examples and insert break points and look at
current values of variables.
|
|
|