Once you have obtained an account via one of the mechanisms on the
Accounts page,
you will be sent an initial
email informing you of your login name, and project code.
The respective hostnames of the Sun Constellation VAYU Cluster and SGI XE Cluster are
vayu.nci.org.au
xe.nci.org.au
You can use
secure shell
(ssh) to connect to the
Sun Constellation VAYU Cluster and SGI XE Cluster.
See the Software Page for the details
of other Network Access software available.
If you are connecting for the first time, please change your initial
password to one of your own choice via the passwd command,
which will prompt you as below: (Note the % is the command
prompt supplied by the interactive "shell" as in all examples in
this document - it is not something you type in.)
% passwd
Old password:
New password:
Re-enter new password:
Changing your password on either machine will also change it on the
other machine.
The operating system on all systems is Unix.
A basic guide to Unix operating system commands
is available HERE.
When you login you will come in under the Resource Accounting SHell,
(referred to as RASH), which is a local shell used to impose interactive limits
and account for the time used in each interactive session.
Your account will be set up with an initial environment via a default
.login file, and an equivalent .profile
file, as well as a .rashrc file. The .rashrc
file can be edited to change the default project (see Project Accounting)
and the command interface
shell to be started by RASH as you login. Your initial command
interface shell will be the
tcsh. You can change this to bash by changing the
line in .rashrc from
setenv SHELL /bin/tcsh
to be
setenv SHELL /bin/bash
instead. Other shells including ksh are available but may
not provide the same support for modules
as tcsh and bash.
If you try to use a shell not registered with rash
for the particular machine you will default to the tcsh.
Each interactive process you run has imposed on it a time limit and a
memory use limit. To see what these limits are enter the command
nf_limits. This shows not only the details of the memory
limits and time limits for interactive processes, but for batch jobs
as well. The limits are not published here as they are liable to
change, and it is also possible to vary these limits on an 'as needs'
basis by project or user.
All use on the compute systems is accounted against projects. Each
project has a single grant of time per 3 month quarter, which can be
used on one or other, or both the compute systems. (The grant is NOT
per machine, but rather may be used wherever you choose).
If your username is connected to more than one project, you are prompted
for which project to charge each session to as you login.
A default project should be available
for you to avoid typing. Batch job usage will also be charged to whatever
project chosen at login, unless you otherwise specify a project on the
qsub command line, or within the batch job script file.
To change or set the default project, edit your
.rashrc file in your home directory, and change the
PROJECT variable as desired. The correct syntax is
setenv PROJECT x99
For projects allocated time under the Merit Allocation or Partner
shares, it is possible to keep submitting jobs to the queues after
the project grant is exhausted. The jobs will run at a lower priority.
nf_limits displays imposed limits and charging rates
relevant to the machine it is run on.
quotasu -P project -h displays the usage of the project
in the current quarter, as well as some recent history of
the project if available. Total usage is shown across
both machines, but it is also possible to see the usage
per queue on each machine.
quota -v displays your disk usage and quota in your home
directory and the project usages in both the
/short/<proj>/ directories and on the Massdata
Storage System for the projects which you are connected to.
See quota -h for details of other options for the command.
More details on managing group ownership and project quotas is
available via the faq.
nqstat displays status of all running and queued batch
jobs.
Environment Modules
are available on vayu and XE
to allow easy customisation of your shell environment to the requirements
of whatever software you wish to use. The module command syntax
is the same no matter which command shell you are using. module avail will show you a list of the software environments
which can be loaded via a module load package command.
module help package should give you a little information
about what the module load package will achieve for you.
Alternatively module show package will detail the
commands in the module file.
Most jobs will require greater resources than are available to interactive
processes. Larger jobs must be scheduled by the batch job system
(which however does allow an
interactive mode).
The batch system software in use on both machines is a locally modified
version of
Portable Batch System (PBS)(1), a queueing system similar to NQS.
You submit jobs to PBS
specifying the number of CPUs, the amount of memory, and the length of
time needed (and, possibly, other resources). PBS runs the job when the
resources are available, subject to constraints on maximum resource usage.
1. This product includes software
developed by NASA Ames Research Center, Lawrence Livermore National
Laboratory, and Veridian Information Solutions, Inc. Visit the
OpenPBS site for
OpenPBS software support, products, and information.
The basic PBS commands are the same on both systems.
qstat
Standard queue status command supplied by PBS. See
man qstat for details of options. (But see the local
nqstat command below.)
nqstat
Local version of qstat. The queue header of nqstat
gives the limit on wall clock time and memory for
you and your project. The fields in the job lines are fairly
straightforward.
qdeljobid
Delete your unwanted jobs from the queues. The jobid is
returned by qsub at job submission time, and is also
displayed in the nqstat output.
qsub
Submit jobs to the queues. The simplest use of the qsub
command is typified by the following example (note that there is
a carriage-return after -wd and ./a.out):
% qsub -P a99 -q normal -l walltime=20:00:00,vmem=300MB -wd
./a.out
^D (that is control-D)
or
% qsub -P a99 -q normal -l walltime=20:00,vmem=300MB -wd jobscript
where jobscript is an ascii file containing the shell
script to run your commands (not the compiled executable which
is a binary file). More conveniently, the qsub options can be
placed within the script to avoid typing them for each job:
You submit this script for execution by PBS using the command:
% qsub jobscript
You may need to enter data to the program and may be used to doing this interactively
when prompted by the program. There are two ways of doing this in batch jobs. If,
for example, the program requires the numbers 1000 then 50 to be entered when prompted.
You can either create a file called, say, input containing these values
%cat input
1000
50
then run the program as
./a.out < input
or the data can be included in the batch job script as follows:
Notice that the PBS directives are all at the start of the script,
that there are no blank lines between them, and there are
no other non-PBS commands until after all the PBS directives.
qsub options of note:
-Pproject
The project which you want to charge the jobs resource
usage to. The default project is specified by the
PROJECT environment variable.
-qqueue
Select the queue to run the job in. The queues
you can use are listed by running nqstat.
-l walltime=??:??:??
The wall clock time limit for the job. Time is
expressed in seconds as an integer, or in the form:
[[hours:]minutes:]seconds[.milliseconds]
-l vmem=???MB
The total (virtual) memory limit for the job - can be
specified with units of "MB" or "GB" but only integer values
can be given. There is a small
default value. Your job will
only run if there is sufficient free memory so making
a sensible memory request will allow your jobs to run
sooner. A little trial and error may be required to find
how much memory your jobs are using - nqstat
lists jobs actual usage.
-l ncpus=?
The number of cpus required for the job to run on.
The default is 1. -lncpus=N - If the number of cpus requested,
N, is small (currently 8 or less on NF systems) the job will
run within a single shared
memory node. If the number of cpus specified is greater,
the job will (probably) be distributed over multiple nodes.
Currently on NF systems, these larger requests are restricted to multiples
of 8 cpus. -lncpus=N:M - This form requests a total of N cpus
with (a multiple of) M cpus per node. Typically, this is used
to run shared memory jobs where M=N and N is currently limited
to 8 on NF systems.
-l jobfs=???GB
The requested job scratch space. This will reserve disk
space, making it unavailable for other jobs, so please do
not over estimate your needs. Any files created in the
$PBS_JOBFS directory are automatically removed at
the end of the job. Ensure that you use integers, and units
of mb, MB, gb, or GB.
-l software=???
Specifies licensed software the job requires to run. See
the software for the string to
use for specific software. The string should be a colon
separated list (no spaces) if more than one software
product is used.
If your job uses licensed software and you do not
specify this option (or mis-spell the software), you
will probably receive an automatically generated
email from the license shadowing daemon (see
man lsd), and the job may be
terminated. You can check the lsd status and find out
more by looking at the URL mentioned in man lsd.
-l other=???
Specifies other requirements or attributes of the job. The
string should be a colon
separated list (no spaces) if more than one attribute
is required.
Generally supported attributes are:
iobound - the job should not share a node
with other IO bound jobs
mdss - the job requires access to the MDSS
(usually via the mdss command).
If MDSS is down, the job will not be started.
pernodejobfs - the job's jobfs resource request
should be treated as a per node request. Normally the
jobfs request is for total jobfs summed over all nodes
allocated to the job (like vmem). Only relevant to distributed
parallel jobs using jobfs.
You may be asked to specify other options at times to support
particular needs or circumstances.
-r y
Specifies your job is restartable, and if the
job is executing on a node when it crashes, the job will
be requeued. Both resources used by and resource limits set
for the original job will carry over to the requeued job.
Hence a restartable job must be checkpointing
such that it will still be able to complete in the remaining
walltime should it suffer a node crash.
The default is that jobs are assumed to not be restartable.
Note that regardless of the restartable status of a job,
time used by jobs on crashed nodes is charged against
the project they are running under, since the onus is on
users to ensure minimum waste of resources via a
checkpointing mechanism which they must build into any
particularly long running codes.
-wd
Start the job in the directory from which it was
submitted. Normally jobs are started in the users
home directory.
Look at the qsub and pbs_resources man page
for complete details of all options. Note that -l options
maybe combined as a comma separated
list with no spaces, eg. -lvmem=500mb,walltime=20:00.
qpsjobid
show the processes of a running job
qlsjobid
list the files in a job's jobfs directory
qcatjobid
show a running job's stdout, stderr or script
qcpjobid
copy a file from a running job's jobfs directory
The man pages for these commands on the system detail the various options
you will probably need to use.
The qsub -I option will result in an interactive
shell being started out on the batch cpu[s] once your job starts.
A submission script cannot be used in this mode - you must provide
all qsub options on the command line.
Your job is subject to all the same constraints and management
as any other job in the same queue.
In particular, it will be charged on the basis of walltime,
the same as any other batch job, since you will have dedicated
access to the cpus reserved for your request. Dont forget to exit
your interactive batch session to
avoid both leaving cpus idle on the machine, and to avoid being
charged for idle time!
Interactive batch jobs are likely to be used for debugging large
or parallel programs etc. Since you want interactive response, it
may be necessary to use the express queue to run immediately and
avoid your session being suspended. However the express
queue attracts a higher charging rate so don't leave the session
idle.
To use an X display in an interactive batch job, use ssh to login to
the vayu or XE (do not change the DISPLAY variable ssh
provides)
and then submit your job with at least the following options:
The systems have a simple queue structure with two main levels of
priority; the queue names reflect their priority. There is no longer
a separate queue for the lowest priority "bonus jobs" as these
are to be submitted to the other queues, and PBS lowers their priority
within the queues.
express:
high priority queue for testing, debugging or quick turnaround
charging rate of 3 SUs per processor-hour (walltime)
small limits particularly on time and number of cpus
normal:
the default queue designed for all production use
charging rate of 1 SU per processor-hour (walltime)
allows the largest resource requests
copyq:
specifically for IO work, in particular, mdss commands for
copying data to the mass-data system.
where relevant copyq jobs run on the /short (and /fast)
server nodes.
runs on nodes with external network interface(s) and so can
be used for remote data transfers (you may need to configure
passwordless ssh).
tars, compresses and other manipulation of /short files
can be done in copyq.
purely compute jobs will be deleted whenever detected.
Apart from copyq jobs, job charging is based on the product wall clock
time used and number of cpus requested. copyq jobs are charged based
on the cputime used by the job.
bonus time
Most projects can continue to submit jobs when their
account is exhausted - such jobs are called "bonus jobs"
but are in fact submitted to either of the express
or normal queues.
bonus jobs:
queue at a lower priority than other jobs
and will generally only run if there are no non-bonus jobs
are more suspendable than non-bonus jobs
make use of otherwise idle cycles while
minimally hindering other jobs
may be terminated if they are impeding normal
jobs or for system management reasons (usually jobs are
just suspended)
The version of PBS used on NF systems has been modified to include
customisable per-user/per-project limits:
All limits can be (and are intended to be) varied on a
per-user or
per-project basis - reasonable variation requests will be granted
where possible.
Resources on the system are strictly allocated with the intent
that if a job does not exceed its resource (time, memory, disk)
requests, it should not be unduly affected by other jobs on the
system.
The converse of this is that if a job does try to exceed
its resource requests, it will be terminated.
The queue configuration and default limits are subject to change
as we need to respond to the demand on the system and try to
deliver the fairest system scheduling at the same time as allowing
as many jobs to be queued per project as possible. The limits on
the queues also vary from system to system. The command
nf_limits -P project is available on EACH of the systems
to allow users to see what limits apply to their username and project
combination, on the particular machine. If
used without the -P project specified, the environment
PROJECT is assumed.
The nf_limits command returns the limits for maximum
number of CPUs queued, maximum number of CPUs per job, and the
maximum memory and maximum walltime for each PBS queue.
As memory and walltime limits
depend on the number of CPUs of the job, it is necessary to use
nf_limits -n ncpus to determine the limits of a job requesting
ncpus to run.
The maximum number of CPUs queued shown is the number if all jobs are
single cpu jobs. If all jobs are parallel jobs using an even number
of cpus, they may queue up double that number of CPUs. See the notes
which form part of the nf_limits output, and also
man nf_limits.
An example of the queues available and an indication of the limits
which may apply on vayu is available
HERE.
The scheduling algorithm used on NCI-NF is somewhat complicated but
its aims are to:
promote large scale parallel use of the Facility
allow equal access to resources for all users independent
of their "share" or grant
provide good turnaround for all users
minimize the impact of jobs on one another
Some of the features of the scheduler designed to achieve these aims are:
resources are strictly allocated so jobs will not start unless
there is sufficient free memory and jobfs (as well as cpus).
queued jobs are shuffled so that jobs from different users and
projects are "interleaved". This means your first job should
appear near the top of the queue even if there are many jobs
in the queue as reported by nqstat.
running jobs can be suspended to allow express and parallel jobs
to run. Long jobs and jobs belonging to users/projects with lots
of other running jobs are most "suspendable" but any job can be
suspended. The fraction of time a job can be suspended is heavily
limited.
From a user's perspective, it is very important that you minimize your
requests for resources (i.e. walltime, memory and jobfs). Otherwise
your job may be queued or suspended longer than necessary. Of course,
make sure you ask for sufficient resources - a little experimentation
in the express queue might help.
Further details on the scheduling policy and algorithm are
available.
Dont hesitate to contact us if you wish to query or have
comments or suggestions about the queues and scheduling.
A number of file systems are available, each with a different
purpose - the appropriate file system should be used whenever possible.
As well as the generally available filesystems listed below, there may be
high performance filesystems, utilities or techniques available to improve
the IO performance of your workload. Please
if you think this may be relevant to you.
The file systems currently generally available, listed in order
of most permanent and backed up to most transient and NOT backed up, are:
Intended to be used for source code, executables and irreproducible
data (input files etc), NOT large data sets. Note that
/home
on vayu and the XE are quite separate systems.
Globally accessible from all nodes within a system.
Backed up on a regular basis.
Quotas apply - use quota -v on each machine to see your
disk quota and usage,
and see the Disk Quota Policy
document for details of the ramifications of exceeding the quotas.
Intended to be used for archiving large data files
particularly those created or used by batch jobs. (It is a
misuse of the system to try to store large numbers of small
files - please do NOT do this. See the netcp -t command
option below.)
Each project has a directory on the Mass Data Storage
System (MDSS) with pathname /massdata/projectid
on that system. This path CANNOT be directly accessed
from either of the compute systems.
Remote access to your massdata directory is by the mdss
utility or the netcp and netmv commands (see
man mdss/netcp/netmv for full details.) The mdss
commands operate on files in that remote directory.
mdss:
put - copy files to the MDSS get - copy files from the MDSS mk/rmdir - create/delete directories on the MDSS sls - list directories on the MDSS
netcp/netmv:
netmv and netcp generate a script, then submit a batch
request to PBS to copy of files (and directories) from
the vayu or XE to the MDSS. In the case of netmv, remove the
files from the vayu or XE if the copy has succeeded. -t create a tarfile to transfer -z/-Z gzip/compress the file to be transferred
Please use at least the -t option if you wish
to archive a directory structure of numerous small files.
Users connected to the project have
rwx permissions in that directory and so may create their
own files in those areas.
NOT to be used as an extension of home directories
(files changed/removed on the massdata area are not
in general recoverable, as there are no back-ups of previous
revisions.)
Currently batch jobs (other than copyq jobs) cannot use the
mdss utilities.
Quotas apply - use quota -v on
the compute machines to see your MDSS quota and usage.
See the Disk Quota Policy
document for details of the ramifications of exceeding the quotas.
The mdss access is intended for relatively modest
mass data storage needs.
Users with larger capacity storage or more
sophisticated access needs should
to get an account
on the data cluster.
Consult the MDS System User Guide for detailed information.
In particular attention is drawn to the fact that there are
currently no off-site copies of MDSS data and the potential
for data loss in the event of a catastrophe.
Intended to be used for job data that must live beyond the lifetime
of the job. Note that /short
on the vayu and the XE are quite separate file systems.
Each project has a directory with pathname
/short/projectid on each compute system.
Users connected to the project have rwx permissions in
that directory and so may create their own files in those areas.
Globally accessible from all nodes within a system.
NOT backed up - users should save to MDS system as necessary.
Quotas apply on a per project basis - use quota -v on
each machine to see your disk quota and usage. See the
Disk Quota Policy
document for details of the ramifications of exceeding the quotas.
Also see the faq for suggestions
on managing quotas when you are connected to several projects.
Note that there are also limits on the number of files (actually
inodes) that can be owned by a group (project) on /short. This
limit and currect usage can be seen using quota -v -s.
An excessive number of inodes causes a number of filesystem
problems hence the limit.
Files not accessed for 60 days are automatically deleted
without warning.
for an increase
in either the disk quota or the file time limit will be considered.
Warning:
Lots of small IO to /short (or /home) can be very slow and impact
other jobs on the system.
Avoid "dribbly" IO, eg writing 2 numbers from your inner
loop. Writing to /short every second is too often.
Avoid frequent opening and closing of files
(or other file operations).
Use /jobfs (see below) instead of /short
for jobs that do lots of file manipulation.
To achieve good IO performance, try to read or write binary files
in large chunks (of around 1MB or greater). To find out more
details of how to best tune your IO
.
Intended for IO intensive jobs providing scratch space only for
the lifetime of the job. (Available to jobs on both compute systems)
Allocated by using the -ljobfs=?? option to qsub,
eg. -ljobfs=5GB requests 5 Gbytes.
Use integers and units of mb, MB, gb or GB.
Your batch job can access its jobfs via the environment variable
PBS_JOBFS.
The actual path will usually be /jobfs/jobid
but avoid using this directly.
Only accessible on the execution node.
NOT backed up at all
Limited in size only by partition size and other job usage
jobfs directories are associated with a currently running
jobs and are automatically deleted at the jobs completion.
Jobs spanning multiple nodes with local JOBFS space on each node
should use the /opt/pbs/bin/pbsdsh -N ... command in the
batch script to act on all JOBFS directories, e.g.
/opt/pbs/bin/pbsdsh -N ls $PBS_JOBFS
For example, if you want local copies of files generated before the
current batch run you can do the following to make them available
on each nodes jobfs area.
Users who are dealing with large files in large chunks (i.e.
> 1 MB reads and writes) have a number of options available to them
to improve their IO performance.
for assistance in choosing the best options.
stdout and stderr of your batch jobs are temporarily
stored in files in /var on the executing node.
PBS will enforce a limit of 10MB on these files by terminating your
job if it exceeds this limit.
The message is
keep your stdout/err to a
reasonable size
redirect stdout/err to a file
direct IO explicitly to files
Doing either of the last two has the added advantage of letting you
see your job output while it is running. (Actually the qcat
command allows this even for stdout and stderr files in /var.)
Traditionally the TMPDIR environment variable is set to /tmp. TMPDIR
is used by various commands and programs, perhaps without the users
being aware of this, for example the intermediate files created during
compilation are saved to TMPDIR. As the /tmp area is not very large,
for interactive use TMPDIR is set to /short/tmp. Batch jobs which
require to write scratch files to $TMPDIR MUST request jobfs space,
as TMPDIR is then set to $PBS_JOBFS. If jobfs space is not requested,
TMPDIR
is set to a meaningless path and an error will be generated if the job
attempts to use $TMPDIR.
Each user belongs to at least two Unix groups: unigrp - determined by their host institution, and projectid(s) - one for each project they are attached to.
These limits can be increased on a per user or per project basis as
necessary.
Users request allocation of /jobfs as part of their job
submission. The actual disk quota for a particular job is given by
the jobfs request.
For compiling and other details , please see contents listing.