National Computational Infrastructure
NCI National Facility
Newsletter July 2010

Table of Contents

Vayu is complete The final stages of the installation of Vayu, the Oracle/Sun Constellation cluster, were carried out without any real interruption to the production service so some users may not be aware that Vayu was completed on April 6th. Since then it has been running with the full complement of over 11900 cpus (cores) in 1488 nodes all connected by a fully non-blocking Infiniband switch. Full details of the system hardware appear here.

Help arrangements The NF email receives many emails a day. You can help us to help you by supplying as much information as possible in your initial email. In particular we need to know your userid, what project you are using, and whether you are on vu or xe. If the problem arises from a batch job we need to know where your batch script is and where the error and output files are. It is easier and better for us to read these files directly rather than using email attachments.

Officially we are available to answer help emails in working hours in Canberra - see the getting help page for the official position on National Facility support. In fact you may well get a reply outside of these hours since many of us read emails out of hours (but please do not depend on getting a response at 11 pm on a Saturday night if you want the walltime of a job extended and it is about to run out in 10 minutes!)

Please also note that the range of questions to is very large and some will need to be answered by the NF specialist in that particular area. We will let you know if that person is not available and if there is going to be a delay in replying to your email in full.

Finally, even if you have been in a long correspondence just with a particular NF staff member it is best to mail to if you have a new question. This means that you will be answered if that staff member is on leave and also adds to the general knowledge base of user problems.

/home quotas New users to the National Facility may have come from other facilities where their /home quota is larger than the default quota give to all new accounts on vu and xe. Note that you also have access to a much larger file space under /short/PROJECT (along with all other members of the project). You do need to put some thought into what goes where. All files on /home are backed up every evening and if you accidentally remove a file it can be retrieved from backup. This is not the case for /short so any important files there that you cannot easily recreate should be copied to the Mass Data Store.

The initial quota can be increased if, for example, you have source code which needs to be stored and recompiled. Just email us with an explanation and letting us know how much space you need. Sometimes several users in the same or related projects are using common source code. Rather than each user having their own copy of this we can set up a special /projects area which is readable by that project alone and which is not accounted for in the /home quota of all the users in the group.

Remember that all limits, whether it be the size of your /home and /short directories or the resource limits for batch jobs, can be modified within the physical limitations of the systems. Please do not hesitate to ask if you need any quotas or limits changed so that you can run your jobs and store the output safely.

Managing archiving from /short Imagine a scenario where a random 1% of all your files on the vayu /short filesystem were deleted right now. How useful would the other 99% of your files really be? How much work would you have lost? How long to get back to where you are are now?

Between 8pm and 12pm on the evening of Thursday June 10th we were one tiny disk read failure (*) away from facing that scenario. All /short data is protected with two parity disks but we actually lost both in the one RAID set. With such large disks, double disk failures will be common and the probability of a triple disk failure in a RAID set during the life of vayu is very high. The /short filesystem is constructed of 104 of these double parity disk protected (RAID6) RAID sets hence the very real possibility of losing a random 1% of your files (more if you use file striping). Remember that there are no backups of /short.

Because of these uncertainties we strongly recommend that you copy off all important /short data as soon as its makes sense to do so. If possible build a data copying job (running mdss, scp, rsync, gridftp, etc in the copyq) into your workflow by submitting it automatically from the compute job that generated the data. Data may be copied to the NF's MDSS, your home institution's resources or simply to a USB disk on your desk. Note that archiving data does not necessarily mean that you have to remove it immediately from /short.

When you do copy the data, think carefully about the size of the data chunks (files) and the way you group the data you are saving. For storage efficiency the files should not be too small, ideally in the order of megabytes. But equally importantly, for ease of future access and updating, manageability of network transfers, and data reuse using specialised data services, they should not be too large. Generally speaking, you may find files of many gigabytes unwieldy to manage and unpack again when you need to use the data again. For example, try to avoid burying useful data in amongst a huge archive of garbage data - filter out useless information early.

(*) disk read failures occur many times per day across the system although the average per disk rate is probably more like a small number per year.

File expiry on /short Although there is an advertised expiry time of 60 days for unused files in /short this had not yet been put into operation.
Over the next few days all files that have not been accessed in 6 months will be deleted. The period of grace before deletion will then gradually move back to 90 days. Note that files that are being regularly used will not be deleted as the time is measured from the last access, not from creation of the file.

File ownership in /short A common problem experienced by users who are part of multiple projects is to have files with group ownership by one project residing in the /short directory of another project. This can lead to confusion when the result of the command quota -v does not correspond to the user's view of the /short directory.

Keeping track of group ownership of files is ultimately the user's reponsibility but we have some suggestions of ways of minimising this problem here.

Non-interactive project selection If you are a member of more than one project and you use scp/rsync (or other non-interactive ssh commands), then you may have had problems because such commands always use your default project (defined in your .rashrc file). If so, you may be interested in recent changes that help to alleviate this problem.
Email problems, suggestions, questions to