nf logo
nf Home
National Computational Infrastructure
NCI National Facility

Contents

-Introduction to the Sun Constellation VAYU Cluster and SGI XE Cluster

+Software Development at NCI NF

Introduction to the Sun Constellation VAYU Cluster and SGI XE Cluster

Getting Started

Connecting to the Sun Constellation VAYU Cluster and SGI XE Cluster

Once you have obtained an account via one of the mechanisms on the Accounts page, you will be sent an initial email informing you of your login name, and project code.

The respective hostnames of the Sun Constellation VAYU Cluster and SGI XE Cluster are

vayu.nci.org.au
xe.nci.org.au

You can use secure shell (ssh) to connect to the Sun Constellation VAYU Cluster and SGI XE Cluster.

See the Software Page for the details of other Network Access software available.

If you are connecting for the first time, please change your initial password to one of your own choice via the passwd command, which will prompt you as below: (Note the % is the command prompt supplied by the interactive "shell" as in all examples in this document - it is not something you type in.)

     % passwd
     Old password:
     New password:
     Re-enter new  password:
Changing your password on either machine will also change it on the other machine.

If you have any problems with starting up new windows or using editors please see the Frequently Asked Questions.

Interactive Use and Basic Unix

The operating system on all systems is Unix. A basic guide to Unix operating system commands is available HERE.

When you login you will come in under the Resource Accounting SHell, (referred to as RASH), which is a local shell used to impose interactive limits and account for the time used in each interactive session.

Your account will be set up with an initial environment via a default .login file, and an equivalent .profile file, as well as a .rashrc file. The .rashrc file can be edited to change the default project (see Project Accounting) and the command interface shell to be started by RASH as you login. Your initial command interface shell will be the tcsh. You can change this to bash by changing the line in .rashrc from

     setenv SHELL /bin/tcsh
to be
     setenv SHELL /bin/bash
instead. Other shells including ksh are available but may not provide the same support for modules as tcsh and bash. There has been a local modification made for ksh and details of that are here. If you try to use a shell not registered with rash for the particular machine you will default to the tcsh.

Each interactive process you run has imposed on it a time limit and a memory use limit. To see what these limits are enter the command nf_limits. This shows not only the details of the memory limits and time limits for interactive processes, but for batch jobs as well. The limits are not published here as they are liable to change, and it is also possible to vary these limits on an 'as needs' basis by project or user.

Project Accounting

All use on the compute systems is accounted against projects. Each project has a single grant of time per 3 month quarter, which can be used on one or other, or both the compute systems. (The grant is NOT per machine, but rather may be used wherever you choose).

If your username is connected to more than one project, you are prompted for which project to charge each session to as you login. A default project should be available for you to avoid typing. Batch job usage will also be charged to whatever project chosen at login, unless you otherwise specify a project on the qsub command line, or within the batch job script file.

To change or set the default project, edit your .rashrc file in your home directory, and change the PROJECT variable as desired. The correct syntax is

    setenv PROJECT x99
    
If it is desired to avoid being prompted for a project at login time, add to .rashrc the line
    setenv PROMPT_FOR_PROJECT 0
    

Details on setting up a non-interactive login with a selected project are given here.

For projects allocated time under the Merit Allocation or Partner shares, it is possible to keep submitting jobs to the queues after the project grant is exhausted. The jobs will run at a lower priority.

Monitoring Resource Usage

  • nf_limits displays imposed limits and charging rates relevant to the machine it is run on.

  • quotasu -P project -h displays the usage of the project in the current quarter, as well as some recent history of the project if available. Total usage is shown across both machines, but it is also possible to see the usage per queue on each machine.

  • quota -v -P project displays your disk usage and quota in your home directory and the project usages in both the /short/<proj>/ directories and on the Massdata Storage System for the projects which you are connected to. See quota -h for details of other options for the command. More details on managing group ownership and project quotas is available via the faq.

  • quotamd -h -P project displays the quarterly history of MDSS requests, grants and usage. The usage is the maximum recorded in any quarter. It is also possible to see the storage history for other storage media by specifying -m media. At the time of writing, the valid media are "disk" for dc:/projects, "gdata" for the Global Lustre filesystem, and "database" for database storage. Leave out the "-h" to see the most recent recorded request, grant and usage data for the project and media specified.

  • nqstat displays status of all running and queued batch jobs.

Software Environments

Environment Modules are available on vayu and XE to allow easy customisation of your shell environment to the requirements of whatever software you wish to use. The module command syntax is the same no matter which command shell you are using.
module avail will show you a list of the software environments which can be loaded via a module load package command. module help package should give you a little information about what the module load package will achieve for you. Alternatively module show package will detail the commands in the module file.


Back to top

PBS Batch Use

Most jobs will require greater resources than are available to interactive processes. Larger jobs must be scheduled by the batch job system (which however does allow an interactive mode). The batch system software in use on both machines is a locally modified version of Portable Batch System (PBS)(1), a queueing system similar to NQS. You submit jobs to PBS specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources). PBS runs the job when the resources are available, subject to constraints on maximum resource usage.

1. This product includes software developed by NASA Ames Research Center, Lawrence Livermore National Laboratory, and Veridian Information Solutions, Inc. Visit the OpenPBS site for OpenPBS software support, products, and information.
 

Basic commands

The basic PBS commands are the same on both systems.
qstat
Standard queue status command supplied by PBS. See man qstat for details of options. (But see the local nqstat command below.)
nqstat
Local version of qstat. The queue header of nqstat gives the limit on wall clock time and memory for you and your project. The fields in the job lines are fairly straightforward.
qdel jobid
Delete your unwanted jobs from the queues. The jobid is returned by qsub at job submission time, and is also displayed in the nqstat output.
qsub
Submit jobs to the queues. The simplest use of the qsub command is typified by the following example (note that there is a carriage-return after -wd and ./a.out):

   % qsub -P a99 -q normal -l walltime=20:00:00,vmem=300MB -wd
   ./a.out
   ^D     (that is control-D)
or
   % qsub -P a99 -q normal -l walltime=20:00,vmem=300MB -wd jobscript
where jobscript is an ascii file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
   #!/bin/csh
   #PBS -P a99 
   #PBS -q normal 
   #PBS -l walltime=20:00:00,vmem=300MB 
   #PBS -wd
   ./a.out
You submit this script for execution by PBS using the command:
   % qsub jobscript

You may need to enter data to the program and may be used to doing this interactively when prompted by the program. There are two ways of doing this in batch jobs. If, for example, the program requires the numbers 1000 then 50 to be entered when prompted. You can either create a file called, say, input containing these values

   %cat input
   1000
   50
then run the program as
   ./a.out < input
or the data can be included in the batch job script as follows:
   #!/bin/csh
   #PBS -P a99 
   #PBS -q normal 
   #PBS -l walltime=20:00:00,vmem=300MB 
   #PBS -wd
   ./a.out << EOF 
   1000
   50
   EOF

Notice that the PBS directives are all at the start of the script, that there are no blank lines between them, and there are no other non-PBS commands until after all the PBS directives.

qsub options of note:

-P project
The project which you want to charge the jobs resource usage to. The default project is specified by the PROJECT environment variable.
-q queue
Select the queue to run the job in. The queues you can use are listed by running nqstat.
-l walltime=??:??:??
The wall clock time limit for the job. Time is expressed in seconds as an integer, or in the form:
[[hours:]minutes:]seconds[.milliseconds]
System scheduling decisions depend heavily on the walltime request - it is always best to make as accurate a request as possible.
-l vmem=???MB
The total (virtual) memory limit for the job - can be specified with units of "MB" or "GB" but only integer values can be given. There is a small default value.
Your job will only run if there is sufficient free memory so making a sensible memory request will allow your jobs to run sooner. A little trial and error may be required to find how much memory your jobs are using - nqstat lists jobs actual usage.
-l ncpus=?
The number of cpus required for the job to run on. The default is 1.
-lncpus=N - If the number of cpus requested, N, is small (currently 8 or less on NF systems) the job will run within a single shared memory node. If the number of cpus specified is greater, the job will (probably) be distributed over multiple nodes. Currently on NF systems, these larger requests are restricted to multiples of 8 cpus.
-lncpus=N:M - This form requests a total of N cpus with (a multiple of) M cpus per node. Typically, this is used to run shared memory jobs where M=N and N is currently limited to 8 on NF systems.
-l jobfs=???GB
The requested job scratch space. This will reserve disk space, making it unavailable for other jobs, so please do not over estimate your needs. Any files created in the $PBS_JOBFS directory are automatically removed at the end of the job. Ensure that you use integers, and units of mb, MB, gb, or GB.
-l software=???
Specifies licensed software the job requires to run. See the software for the string to use for specific software. The string should be a colon separated list (no spaces) if more than one software product is used.

If your job uses licensed software and you do not specify this option (or mis-spell the software), you will probably receive an automatically generated email from the license shadowing daemon (see man lsd), and the job may be terminated. You can check the lsd status and find out more by looking at the URL mentioned in man lsd.

-l other=???
Specifies other requirements or attributes of the job. The string should be a colon separated list (no spaces) if more than one attribute is required. Generally supported attributes are:
  • iobound - the job should not share a node with other IO bound jobs
  • mdss - the job requires access to the MDSS (usually via the mdss command). If MDSS is down, the job will not be started.
  • pernodejobfs - the job's jobfs resource request should be treated as a per node request. Normally the jobfs request is for total jobfs summed over all nodes allocated to the job (like vmem). Only relevant to distributed parallel jobs using jobfs.
You may be asked to specify other options at times to support particular needs or circumstances.

-r y
Specifies your job is restartable, and if the job is executing on a node when it crashes, the job will be requeued. Both resources used by and resource limits set for the original job will carry over to the requeued job. Hence a restartable job must be checkpointing such that it will still be able to complete in the remaining walltime should it suffer a node crash.

The default is that jobs are assumed to not be restartable. Note that regardless of the restartable status of a job, time used by jobs on crashed nodes is charged against the project they are running under, since the onus is on users to ensure minimum waste of resources via a checkpointing mechanism which they must build into any particularly long running codes.

-wd
Start the job in the directory from which it was submitted. Normally jobs are started in the users home directory.

Look at the qsub and pbs_resources man page for complete details of all options. Note that -l options maybe combined as a comma separated list with no spaces, eg. -lvmem=500mb,walltime=20:00.

qps jobid
show the processes of a running job
qls jobid
list the files in a job's jobfs directory
qcat jobid
show a running job's stdout, stderr or script
qcp jobid
copy a file from a running job's jobfs directory

The man pages for these commands on the system detail the various options you will probably need to use.

Interactive PBS Jobs

The qsub -I option will result in an interactive shell being started out on the batch cpu[s] once your job starts. A submission script cannot be used in this mode - you must provide all qsub options on the command line.

Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it will be charged on the basis of walltime, the same as any other batch job, since you will have dedicated access to the cpus reserved for your request. Dont forget to exit your interactive batch session to avoid both leaving cpus idle on the machine, and to avoid being charged for idle time!

Interactive batch jobs are likely to be used for debugging large or parallel programs etc. Since you want interactive response, it may be necessary to use the express queue to run immediately and avoid your session being suspended. However the express queue attracts a higher charging rate so don't leave the session idle.

To use an X display in an interactive batch job, use ssh to login to the vayu or XE (do not change the DISPLAY variable ssh provides) and then submit your job with at least the following options:

    % qsub -I -q express -v DISPLAY

Common Problems

See the faq for the resolution of common problems on the systems.

Back to top

Queues and Scheduling

Queue Structure

The systems have a simple queue structure with two main levels of priority; the queue names reflect their priority. There is no longer a separate queue for the lowest priority "bonus jobs" as these are to be submitted to the other queues, and PBS lowers their priority within the queues.
express:
  • high priority queue for testing, debugging or quick turnaround
  • charging rate of 3 SUs per processor-hour (walltime)
  • small limits particularly on time and number of cpus

normal:
  • the default queue designed for all production use
  • charging rate of 1 SU per processor-hour (walltime)
  • allows the largest resource requests

copyq:
  • specifically for IO work, in particular, mdss commands for copying data to the mass-data system.
    • Note: always use -l other=mdss when using mdss commands in copyq. this is so that jobs only run when the the mdss system is available.
  • runs on nodes with external network interface(s) and so can be used for remote data transfers (you may need to configure passwordless ssh).
  • tars, compresses and other manipulation of /short files can be done in copyq.
  • purely compute jobs will be deleted whenever detected.
Apart from copyq jobs, job charging is based on the product wall clock time used and number of cpus requested. copyq jobs are charged based on the cputime used by the job.

bonus time
Most projects can continue to submit jobs when their account is exhausted - such jobs are called "bonus jobs". These can be submitted to the normal queue, but will not run in the express queue.

bonus jobs:
  • queue at a lower priority than other jobs and will generally only run if there are no non-bonus jobs
  • are more suspendable than non-bonus jobs
  • make use of otherwise idle cycles while minimally hindering other jobs
  • may be terminated if they are impeding normal jobs or for system management reasons (usually jobs are just suspended)

Queue Limits

The version of PBS used on NF systems has been modified to include customisable per-user/per-project limits:
  • All limits can be (and are intended to be) varied on a per-user or per-project basis - reasonable variation requests will be granted where possible.

  • Resources on the system are strictly allocated with the intent that if a job does not exceed its resource (time, memory, disk) requests, it should not be unduly affected by other jobs on the system. The converse of this is that if a job does try to exceed its resource requests, it will be terminated.

The queue configuration and default limits are subject to change as we need to respond to the demand on the system and try to deliver the fairest system scheduling at the same time as allowing as many jobs to be queued per project as possible. The limits on the queues also vary from system to system. The command nf_limits -P project is available on EACH of the systems to allow users to see what limits apply to their username and project combination, on the particular machine. If used without the -P project specified, the environment PROJECT is assumed.

The nf_limits command returns the limits for maximum number of CPUs queued, maximum number of CPUs per job, and the maximum memory and maximum walltime for each PBS queue. As memory and walltime limits depend on the number of CPUs of the job, it is necessary to use nf_limits -n ncpus to determine the limits of a job requesting ncpus to run.

The maximum number of CPUs queued shown is the number if all jobs are single cpu jobs. If all jobs are parallel jobs using an even number of cpus, they may queue up double that number of CPUs. See the notes which form part of the nf_limits output, and also man nf_limits.

An example of the queues available and an indication of the limits which may apply on vayu is available HERE.

Scheduling issues

The scheduling algorithm used on NCI-NF is somewhat complicated but its aims are to:
  • promote large scale parallel use of the Facility
  • allow equal access to resources for all users independent of their "share" or grant
  • provide good turnaround for all users
  • minimize the impact of jobs on one another
Some of the features of the scheduler designed to achieve these aims are:
  • resources are strictly allocated so jobs will not start unless there is sufficient free memory and jobfs (as well as cpus).
  • queued jobs are shuffled so that jobs from different users and projects are "interleaved". This means your first job should appear near the top of the queue even if there are many jobs in the queue as reported by nqstat.
  • running jobs can be suspended to allow express and parallel jobs to run. Long jobs and jobs belonging to users/projects with lots of other running jobs are most "suspendable" but any job can be suspended. The fraction of time a job can be suspended is heavily limited.
From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime, memory and jobfs). Otherwise your job may be queued or suspended longer than necessary. Of course, make sure you ask for sufficient resources - a little experimentation in the express queue might help.

Further details on the scheduling policy and algorithm are available. Dont hesitate to contact us if you wish to query or have comments or suggestions about the queues and scheduling.


Back to top

File Systems

A number of file systems are available, each with a different purpose - the appropriate file system should be used whenever possible.

As well as the generally available filesystems listed below, there may be high performance filesystems, utilities or techniques available to improve the IO performance of your workload. Please if you think this may be relevant to you.

The file systems currently generally available, listed in order of most permanent and backed up to most transient and NOT backed up, are:

home directories

  • Intended to be used for source code, executables and irreproducible data (input files etc), NOT large data sets. Note that /home on vayu and the XE are quite separate systems.
  • Globally accessible from all nodes within a system.
  • Backed up on a regular basis.
  • Quotas apply - use quota -v on each machine to see your disk quota and usage, and see the Disk Quota Policy document for details of the ramifications of exceeding the quotas.
  • for an increase in your quota will be considered.

massdata

  • Intended to be used for archiving large data files particularly those created or used by batch jobs. (It is a misuse of the system to try to store large numbers of small files - please do NOT do this. See the netcp -t command option below.)
  • Each project has a directory on the Mass Data Storage System (MDSS) with pathname /massdata/projectid on that system. This path CANNOT be directly accessed from either of the compute systems.
  • Remote access to your massdata directory is by the mdss utility or the netcp and netmv commands (see man mdss/netcp/netmv for full details.) The mdss commands operate on files in that remote directory.
    mdss:
    put - copy files to the MDSS
    get - copy files from the MDSS
    mk/rmdir - create/delete directories on the MDSS
    ls - list directories
    netcp/netmv:
    netmv and netcp generate a script, then submit a batch request to PBS to copy of files (and directories) from the vayu or XE to the MDSS. In the case of netmv, remove the files from the vayu or XE if the copy has succeeded.
    -t create a tarfile to transfer
    -z/-Z gzip/compress the file to be transferred
    Please use at least the -t option if you wish to archive a directory structure of numerous small files.
  • Users connected to the project have rwx permissions in that directory and so may create their own files in those areas.
  • NOT to be used as an extension of home directories (files changed/removed on the massdata area are not in general recoverable, as there are no back-ups of previous revisions.)
  • Currently batch jobs (other than copyq jobs) cannot use the mdss utilities.
    • Note: always use -l other=mdss when using mdss commands in copyq. this is so that jobs only run when the the mdss system is available.
  • Quotas apply - use quota -v on the compute machines to see your MDSS quota and usage. See the Disk Quota Policy document for details of the ramifications of exceeding the quotas.
  • The mdss access is intended for relatively modest mass data storage needs. Users with larger capacity storage or more sophisticated access needs should to get an account on the data cluster.
  • Users on dc/dcc should consult the MDSS User Guide for more detailed information.

/short

  • Intended to be used for job data that must live beyond the lifetime of the job. Note that /short on the vayu and the XE are quite separate file systems.
  • Each project has a directory with pathname /short/projectid on each compute system. Users connected to the project have rwx permissions in that directory and so may create their own files in those areas.
  • Globally accessible from all nodes within a system.
  • NOT backed up - users should save to MDS system as necessary.
  • Quotas apply on a per project basis - use quota -v on each machine to see your disk quota and usage. See the Disk Quota Policy document for details of the ramifications of exceeding the quotas. Also see the faq for suggestions on managing quotas when you are connected to several projects.
  • Note that there are also limits on the number of files (actually inodes) that can be owned by a group (project) on /short. This limit and currect usage can be seen using quota -v -s. An excessive number of inodes causes a number of filesystem problems hence the limit.
  • Files not accessed on vayu for 365 days, or on xe for 365 days are automatically deleted. It is possible to vary these default expiry days on a per project basis. The list of files due for expiry within n days of the current time can be obtained by running
    short_files_report.py -P projdir -E n
    (The value of n should be < 8, as only the next 7 days worth of expiring files is recorded.)
  • for an increase in either the disk quota or the file time limit will be considered.
  • Warning:
    Lots of small IO to /short (or /home) can be very slow and impact other jobs on the system.
    • Avoid "dribbly" IO, eg writing 2 numbers from your inner loop. Writing to /short every second is too often.
    • Avoid frequent opening and closing of files (or other file operations).
    • Use /jobfs (see below) instead of /short for jobs that do lots of file manipulation.
    To achieve good IO performance, try to read or write binary files in large chunks (of around 1MB or greater). To find out more details of how to best tune your IO .

/jobfs

  • Intended for IO intensive jobs providing scratch space only for the lifetime of the job. (Available to jobs on both compute systems)
  • Allocated by using the -ljobfs=?? option to qsub, eg. -ljobfs=5GB requests 5 Gbytes.
    Use integers and units of mb, MB, gb or GB.
  • Your batch job can access its jobfs via the environment variable PBS_JOBFS. The actual path will usually be /jobfs/jobid but avoid using this directly.
  • Only accessible on the execution node.
  • NOT backed up at all
  • Limited in size only by partition size and other job usage
  • jobfs directories are associated with a currently running jobs and are automatically deleted at the jobs completion.
  • Jobs spanning multiple nodes with local JOBFS space on each node should use the /opt/pbs/bin/pbsdsh -N ... command in the batch script to act on all JOBFS directories, e.g.
              /opt/pbs/bin/pbsdsh -N ls $PBS_JOBFS
    For example, if you want local copies of files generated before the current batch run you can do the following to make them available on each nodes jobfs area.
              /opt/pbs/bin/pbsdsh -N cp original_file $PBS_JOBFS
    Note: don't put any quotes around the command issued under pbsdsh.
  • NOTE: It is not possible to use the netmv command to save data which exists on a /jobfs filesystem - files must be copied to /short first.

fast IO

Users who are dealing with large files in large chunks (i.e. > 1 MB reads and writes) have a number of options available to them to improve their IO performance. for assistance in choosing the best options.

/var

  • stdout and stderr of your batch jobs are temporarily stored in files in /var on the executing node.
  • PBS will enforce a limit of 10MB on these files by terminating your job if it exceeds this limit.
  • The message is
    • keep your stdout/err to a reasonable size
    • redirect stdout/err to a file
    • direct IO explicitly to files
    Doing either of the last two has the added advantage of letting you see your job output while it is running. (Actually the qcat command allows this even for stdout and stderr files in /var.)

/tmp

Traditionally the TMPDIR environment variable is set to /tmp. TMPDIR is used by various commands and programs, perhaps without the users being aware of this, for example the intermediate files created during compilation are saved to TMPDIR. As the /tmp area is not very large, for interactive use TMPDIR is set to /short/tmp. Batch jobs which require to write scratch files to $TMPDIR MUST request jobfs space, as TMPDIR is then set to $PBS_JOBFS. If jobfs space is not requested, TMPDIR is set to a meaningless path and an error will be generated if the job attempts to use $TMPDIR.

Summary

Name(1) Purpose Availability Quota(2) Timelimit
/home/unigrp/user Irreproducible data eg. source code Global 800MB+ none
massdata Archiving large data files External - access using the mdss command 20GB none
/short/projectid Large data IO, data maintained beyond one job Global 80GB 365 days
$PBS_JOBFS IO intensive data, job lifetime Local to node 50GB+(3) Duration of job
/var PBS spool area for job output no direct access 10MB/file -
/tmp Avoid Use jobfs or /short instead - -
  1. Each user belongs to at least two Unix groups:
              unigrp - determined by their host institution, and
              projectid(s) - one for each project they are attached to.
  2. These limits can be increased on a per user or per project basis as necessary.
  3. Users request allocation of /jobfs as part of their job submission. The actual disk quota for a particular job is given by the jobfs request.

Back to top

Using the GPUs

We now provide some GPUs connected to xe.nci.org.au for users with suitable applications.
Documentation is available here.

For compiling and other details , please see contents listing.


Back to top
Next >>
Email problems, suggestions, questions to