Accessing Raijin, vayu, xe and dcc
- Where do I go to apply for an account?
- How do I log on to the compute machines?
- How do I change my password?
- I have forgotten my password. How do I have it reset?
- What are the SU units quoted in my time grants?
- How do I open new windows on VU and XE for graphics or an editor?
- Where can I learn about Linux?
- Is there a simple editor on VU and XE?
Problems with batch jobs
- Where do I look for my output?
- Where do I look for error messages if the job doesn't work?
- What does it mean if the error message says /var/spool/pbs/mom_priv/jobs/***.ac-pb.SC: Command not found or mentions Ctrl M?
- The PBS resources I requested are being ignored.
- My batch job is accepted but takes no time and produces no output.
- Why won't my batch job start?
- Why does my batch job get suspended?
- Can I keep submitting jobs when my grant runs out?
- What's wrong with my Gaussian resource requests?
- I'm using job dependencies and my jobs keep getting stuck in Held state. Why?
- My jobs need a much longer runtime than the queue limit - what do I do?
|Where do I go to apply for an account?||Details on applying for accounts are give at the Accounts web page.|
|How do I log on to VU and XE?||You need to use ssh (and, in particular, ssh2) or slogin to access the NCI National Facility computers. Details of how to log on are given in the User Guide. Windows users will need an ssh client such as putty or mobaxterm.|
|How do I change my password?||Use the "passwd" command on any one of the NCI NF systems to change it on all. The instructions for this are in the User Guide.|
|I have forgotten my password. How do I have it reset?||To reset your password you can phone us on (02) 6125 3437 [4541 or 5986] during AEST business hours or email your mobile phone number to along with some extra identifying information.|
|What are the SU units quoted in my time grants?||Compute time allocations are made in Service Units (SU) which reflect both the time used and the priority at which it is used. For normal use, a Service Unit (SU) corresponds to one hour of elapsed time on one processor of the system. A kSU is simply 1000 SU. Note that an 8CPU job which runs for an elapsed walltime of 1 hour is accounted as using 8SU.|
|How do I open new windows on VU and XE?||
You need to have set up X11 forwarding and how you do
this depends on what sort of desktop machine you are using to access VU or XE.
Some suggestions for Windows, Linux and Mac are given on the
ssh software web page and on the
vnc software web page.VNC is generally faster than some of the Windows alternatives
such as Cygwin. There is a good tutorial on installing and using
Cygwin for X forwarding here. Some Windows users prefer Xming.
A good alternative is mobaxterm.
If X forwarding is not enabled you will get errors such as
xterm Xt error: Can't open display: xterm: DISPLAY is not set
|Where can I learn about Linux?||We have a rather old user guide here or you may like to look at the Linux on-line course.. The Beginner's Level Course section on fundamental Linux knowledge is a good start. Alternatively look at the iVEC Linux refresher course here.|
|Is there a simple editor on VU and XE?||There are several editors on the VU and XE such as emacs, vi and nano. Of these, nano is very straightforward. You do not need to load a module to use nano. VPAC provide a good tutorial on editors here.|
|Where do I look for my output?||If your batch job is named, for example, runjob.sh and your output is not redirected in the batch script then your job output will appear in runjob.sh.o**** where the final digits are the job number. The final entries in the .o file give you the details on walltime and virtual memory used by the job.|
|Where do I look for error messages if the job doesn't work?||If your batch script was called runjob.sh then this will be in runjob.sh.e****. There is a limit to the length of this filename so, if you have a particularly long batch script name, it may be truncated in the resulting error and output file names.|
|What does it mean if the error message says ...-pb.SC: Command not found?||
This can happen if:
|The PBS resources I requested are being ignored.||
Your job script should have all the resource request options set at the
top before any shell commands. When specifying a list of resources with
the -l option, there also must be no spaces
within the comma separated list. e.g. The start of the script
should be something like the following (it is case sensitive):
#!/bin/sh #PBS -Pa99 #PBS -lwalltime=20:00:00,vmem=300MB #PBS -ljobfs=1GB #PBS -wd
|My batch job is accepted but takes no time and produces no output.||
This could be due to a couple of problems:
|Why won't my batch job start?||
A batch job requires different resources before it can start running
such as adequate free memory, dedicated processors and licenses
for any licensed software being used. The queue scheduler will not start
your job until it can get all the resources it needs. The priority
for starting up jobs also depends on jobs in the queue submitted by
other members of your project and how much of your grant has been used.
Your job may be be held up because another user from your project
has queued a job which cannot start and it is above yours in the queue.
Usually the reason your job is not running is shown by qstat -s jobid.If you use licensed software check the software license web page to see if there are sufficient software licenses for your job to start.
More information on how the scheduler works is available here.
|Why does my batch job get suspended?||
If your job is suspended then, usually, either a higher priority (e.g. express queue job) or a
larger parallel job is running on the CPUs your job had been running on. This is
only temporary and is a normal part of scheduling the system - there is nothing
wrong with your job.
Say a large parallel jobs requires N cpus to be available to start running. Most sites would hold cpus idle as running jobs complete until N CPUs become available. This can leave a lot of cpus idle for quite a while.
On NF systems smaller queued jobs are able to "jump" the queued N CPU parallel job and start early keeping as many cpus as possible busy all the time. The scheduler uses a sophisticated algorithm such that when around N CPUs worth of jobs have jumped ahead and started, some subset of those (and possibly other) running jobs are selected to be suspended to allow the queued parallel job to start. The scheduler tries to limit job suspension to maintain fairness to all jobs and to projects.
In order to optimize CPU utilization on the system, queued jobs are started as soon as possible. This results in small jobs (which can "fill" small "holes" in the available cpus more easily) generally spending much less time in queued state. The trade-off is that they potentially spend some time in suspended state. The NF estimates that job preemption (suspension/resumption) is giving around 25-30% better system utilization than more traditionally scheduled systems. That means everyone's grants are effectively 25-30% larger - we hope you will accept occasional longer suspensions for this win.
Important: suspend time is roughly proportional to the job's requested walltime. It is in your best interest to make your walltime (and other resource) requests as accurate as possible.
There is more explanation in the scheduling policy document.
|Can I keep submitting jobs when my grant runs out?||Yes, you can continue to submit jobs to the batch queues when you have used up all your quarterly grant. These jobs will run at a lower priority than jobs of projects with remaining grant. As a result they may take longer to start and may be suspended more often. However, as time goes by, more and more users are in the same situation.|
|Whats wrong with my Gaussian resource requests?||
When submitting Gaussian jobs to PBS, you have to specify resource
requirements in two places. What you specify as qsub options (-lvmem etc)
are what is reserved for your jobs use. Gaussian knows nothing about these
reservations, eg. specifying qsub -lncpus=4 will not cause
Gaussian to use 4 cpus. You tell Gaussian what resources to actually use in your
Gaussian input file with directives like %mem and %nproc.
Naturally, whatever you tell Gaussian to use should fit inside what you have
requested PBS to reserve for your job.
Obviously what you specify with -lncpus should be reflected in the Gaussian %nproc line.
For memory requests, the %mem line in your Gaussian input file specifies the work area that Gaussian uses. Gaussian also needs memory for the program, other data and shells etc. To safely allow for this other memory, the %mem value should be about 500-600MB less than the PBS -lvmem value (possibly more in some cases). You can tune this difference with experience by comparing the actual vmem used by your job (see the job stdout file after the job completes) and the %mem request.
You can also specify a maxdisk option to Gaussian - this should reflect your -ljobfs request.
|I'm using job dependencies and my jobs keep getting stuck in Held state. Why?||
Job dependencies are only tested at the time of the relevant event
that the dependency is waiting on and at no other time. That event
will cause the dependent job to be released from the held state so
it can be scheduled. In particular, if you submit a pair of jobs
jobid1=`qsub job1` qsub -Wdepend=after:$jobid1 job2then the release of job2 happens only when job1 starts. The problem comes if job1 starts before job2 is even submitted - the event to trigger the release of job2 never happens in job2's lifetime.
There are various ways to avoid this race condition. One sure way is to hold job1 until job2 is submitted:
jobid1=`qsub -h job1` qsub -Wdepend=after:$jobid1 job2 qrls $jobid1Or you could take a gamble by just delaying the start of job1
jobid1=`qsub -a now+120 job1` qsub -Wdepend=after:$jobid1 job2If there is no need for a dependency, it is best not to force one.
The same problem can arise with other forms of job dependencies
|My jobs need a much longer runtime than the queue limit - what do I do?||
The vast majority problems solved on NF systems require runtimes much longer than the
queue walltime limits. But to:
a. protect against wasting considerable compute time in the case of a hardware failure or system crash and
b. provide all users with more continuous progress
these applications generally employ checkpointing, the process of regularly saving the state of the job to disk. Follow-on jobs can then resume from this checkpointed state. Critically, it is possible to automate the submission of these follow-on jobs so you effectively submit one long job. Most sophisticated applications support checkpoint-restarts.
|How do I acknowledge the National Facility in publications?||
It is a condition of use that use of the NCI National Facility be acknowledged in publications. An acceptable form of words would be:
|What is the in-kind value of an SU (for reporting to ARC, etc)?||This information is available from the 'Cost & ARC Resource Valuation' section of the Accounts page.|
|Why dont I get a core dump file?||
The default coredumpsize is set to zero. If you need to see your core file you can remove that limit by:
unlimit coredumpsize (in tcsh)
ulimit -c unlimited (in bash)
Note that if you are using the Intel fortran compiler you will also need:
|Batch Jobs to Automate Work Flow||To facilitate continued throughput of work on the compute systems, the batch queueing system may be used to great advantage. For detailed instructions and examples follow the link on the left.|