National Computational Infrastructure
NCI National Facility
Mass Data DC Downtime Information

23 April 2012 - ?? April 2012

Monday April 23 10 am: /projects on dcc requires a full file system check so dcc is going down immediately. We hope to restore services tomorrow, Tuesday April 24. We apologise for this disruption to the dcc service. Some web sites hosted at the NI NF will also be affected by this downtime.

Tuesday April 24: The file system check of /projects is still going. In order to get services back the /projects file system will be restored from the most recent dump to a smaller cache.

Thursday April 26: It is expected that the /projects restore will be completed by late this afternoon.

2 March 2012
The DCC will be down for Phase II system upgrade this Friday March 2 2012.
18 Jan 2012, 10am-4pm
A downtime for the MDSS file system is scheduled for Wednesday this week (18th of January 2012) to move the MDSS (Mass Data Storage System) HSM (Hierarchical Storage Management) file system from our old Oracle SAMQFS infrastructure to a new SGI based CXFS/DMF HSM system.

The new SGI based MDSS file system is significantly more robust, with two large tape silos in separate machine rooms on the opposite sides of the ANU campus enabling automatic offsite copies for all data. It also has around 3 times the current disk cache size at 350 TBytes.

The downtime will start at 10am and is expected to take around 4 hours to dump and restore the MDSS file system. Copyq jobs with -lother=mdss can still be queued, but they will not be run while the MDSS is not available. (Note the importance of including -lother=mdss qsub option in this circumstance.)

The change to the filesystems will be mostly transparent to users of mdss get and put, or /massdata pathnames. If you currently use specific SAMFS HSM commands such as stage and sls, these will be replaced by similar DMF HSM commands.

5-8 Nov 2011
Following persistent system crashes on both the SAMFS metadata server and clients, the massdata file system was taken offline for a file system check. This check was expected to take two days due to the large amount of online data cached in readiness for transfer to the new massdata system. Unfortunately the fsck failed, and the massdata file system had to be rebuilt from the last known good snapshot taken at 6am on the 3rd of November.
18-24 Oct 2011
Due to security violations from unknown hackers, the MDSS was taken out of service whilst investigations were undertaken, and security upgraded on all NCI systems.
17 Sept 2011
There will be an electrical outage at the ANU Huxley data centre on Saturday 17th September. As a result there will be a downtime from 17th Sept (12am) to 18th Sept (12am). A new UPS is to be installed in the data centre and power to all equipment except the Vayu and the Xe supercomputers will need to be turned off to install the electrical feed for this UPS. All Vayu and Xe jobs that need to access the mdss will be unable to run as the mdss system will be unavailable.
07 Apr 2011
VMs gpu-polit and esgnode1 down for 10 minutes while being moved from a faulty VM host to a working one.
05 Apr 2011
Nework outage to upgrade switches. Affected all services.
21 Dec 2010
The production gridftp service for the DC cluster is down at present. It went down at approx 10:00. We are currently working to get an alternate machine up and running. Users can still get to their data via the host dc.nci.org.au. Please email help@nf.nci.org.au if you have any queries.
13 Dec 2010
Unscheduled downtime lasting approximately from 11:15 to 14:30. Services hosted in our Infrastructure as a Service (IaaS) enviroment may have appeared to be unavailable. Not all services were affected.
12 Oct 2010
Scheduled downtime to merge all of the file systems in to one, and increase the disk cache size. The expected benefits are a larger and more useable disk cache as a result of not being divided amongst 5 file systems. The limited number of tape drives will also be used more efficiently for the same reason, so we should expect better response to puts and gets through the copyq queue than we are currently experiencing.

The downtime is expected to take all day to dump, restore and merge the file systems. copyq jobs with -lother=mdss can still be queued, but they will not be run while the MDSS is not available. (Note the importance of including -lother=mdss qsub option in this circumstance.)

The change to the filesystems will be transparent to users of mdss get and put, or /massdata// pathnames.

11-12 May 2010
A major component of the DC storage subsystem had an unexpected outage between 11th May (2am) until 12th May (8am). A detailed analysis revealed a known bug in the underlying ZFS storage target. A patch was applied and all filesystems were checked for data integrity. The system was returned to service by 9:30am, 12th May.
5 April 2010
The assda.anu.edu.au website was unavailable from 5:00 PM to 7:30 PM on Monday 5th April 2010 due to an operator Error.
30 January 2010
On Saturday 30th January 2010, the NCI National Facility received a major upgrade to its core routing and switching network equipment. The outage windows for the NCI National Facility services were: - 09:00-13:00 Loss of network connectivity to all National Facility services including Vayu and XE cluster head nodes. - 08:00-15:00 Expected outage of DC services - 08:00-17:00 Possible outage of DC services - 17:00-20:00 Fixing link to remote silo (new C3 stack had a port-span problem)
20 - 21 December 2009
The SAM-QFS cluster filesystem which supports all DC data services was down between 2000hrs (20th) to 0800hrs (21st). The following services were affected as a result: the dc2 webserver, all MDSS data transfers, the dc0 login node, the DC gridftp server.
30 October 2009
There was a downtime between 1300hrs to 1500hrs which affected the VMWare ESX cluster owing to firware problems on the Sun 6140 fibre channel array.
21 July 2009
On Tuesday 21st July the NCI National Facilty will be down whilst the network and data cloud is being moved to a new location in the machine room.

Although most NCI servers will remain up and running, all NCI and ANUSF services, including web sites, compute systems, and the data cloud will be uncontactable during the move, starting 8am.

On the same day, many of the components of the data cloud and mass data storage system (dc,mdss) will also be moved into new racks. This move involves a lot of hardware and is expected to take most of the day, starting from 7:30am.

The data cloud, and any service dependent upon it will be down for most of the day. The network is expected to be up at around 10am.

Any further queries regarding this downtime should be directed to help@nf.nci.org.au.

update @ 9:30pm The equipment is now all in place, and the software services are being tested.

22 March 2009, 10am -- 24 March 2009
The SAM-QFS cluster filesystem, which provides HSM storage for the DC cluster, were unavailable. Two independent hardware RAID controllers on two different 6140 arrays failed. Solaris clients were patched to the latest level which included multipathing support fixes.
2 January 2009 -- 7 January 2009
The system availability has been limited due to problems with client fibrechannel failover support with the new firmware. Full service was finally restored after careful checking of filesystem and with Sun support resolving the issues.
30 December 2008-1 January
The DC cluster was scheduled for a downtime to move to another area of the machine room. During this downtime the the RAID firmware on the 6140 will also be upgraded.
20 November 2008, 11:00am-2:30pm
The DC cluster suffered some problems due to an NFS server issue. Nodes with NFS mounts were affected while nodes with QFS mounts or ESX volumes continued without problems.
25 August 2008, 10:00am-12:00pm
The DC cluster will be unavailable from 10am-12pm as underlying services associated with the data cluster (dc) will be migrated onto newer hardware. This in turn will affect MDSS and NFS shares and web services (Apache, Ruby-on-rails etc.) originating from dc2.apac.edu.au will be affected as a faulty hardware component for this host needs replacement.

Affected DC file systems are home, projects, short, vizlab, and all the massdata directories. Users on the compute systems, i.e. AC and LC, will not be able to use the mdss command to access the MDSS.

25 June 2008, 9:00am-11:00am
Mass Data Storage System (dc cluster and store) will be down from 9am-11am to install 3 new 16Tbyte disk trays, and update the firmware on all of the existing disk trays.
17 April 2008
The partition tables on the home, opt and vizlab QFS file systems disappeared, and those file systems were offline for around 5 hours whilst we debugged and fixed the ensuing problems. The probable causes was either an ESX server install, or the reinstall of Solaris on dcmds0 during that time.
1 April 2008
One of the metadata servers crashed with kernel panic when it was the current metadata server. It rebooted automatically after the crash. There was a reboot the following data to fix some lingering problems.
28 March 2008, 10:00am-10:30am
Store will be rebooted this morning (Friday 28th March) at 10am to apply the latest Daylight Savings changes. Daylight savings in NSW and ACT now ends the first Sunday in April, rather than the last Sunday in March.
17 March 2008, 1:30pm-4:30pm
The tape drives have been offline for the afternoon due to a hardware failure in the ACSLS tape management system. The system has been replaced with a new system, and is currently auditing the tape library. Tape movements will gradually increase in responsiveness once the audit it completed.
17 March 2008, 11:30am-1:30pm
On Monday morning the DC Data Storage cluster was down from 11:30am-1:30pm to replace a broken CPU in the metadata server. Unfortunately, the CPU replacement did not fix the problem. There will probably be another downtime in the near future to replace the motherboard.
1 May 2007
On Tuesday morning the Mass Data Storage System (store) will be down from 11:30am-1:30pm to install a new T10000 tape drive. The T10000 tape drive has a larger capacity (500GB per tape) and faster throughput (120MB/s), compared to our current 9940B tape drives (200GB, and 70MB/s).
27 Feb 2007
At 11pm a severe hail storm caused the power to fail in the machine room. Recovery commenced 8am 28/02/2007. File system integrity checking took until around 10pm to complete.
11 Dec 2006
On Monday morning the Mass Data Storage System (store) will be down from 10-12am for system maintenance. During this downtime the disk and tape fibre fabric is being reorganised for the new data cluster system.
20 May 2006
APAC-NF Downtime: Saturday May 20th, 8am-6pm. The ANU machine room will be powered off at 8am May 20th to allow the replacement of the UPS system. All APAC-NF services (AC, LC, MDSS, web services, etc) will be unavailable for the day. The work is scheduled to be completed by 6pm but may be earlier. Jobs that will not complete before the downtime are not being started.
19 May 2006
All MDSS services will be unavailable on Friday 19th May from 9:30am to 12:30pm for preventative maintenance on the Fibre Channel switches that connect the MDSS to it's disk array and tape drives. The actual period of the downtime may be shorter if there are no unforeseen problems
21 April 2006
The MDSS was rebooted at 4:25 this afternoon (Friday 21 April 2006) to fix up a hardware problem with one of the tape drives. It was down for around 10 minutes.
12 April 2006
Starting at 10:00am the MDSS will be down to update the firmware on the fibre channel disk arrays. We expect it to be down for around three hours. All access to the MDSS will be closed during this time.
  • Returned to service at 15:35
  • 8-10 March 2006
  • 8 March, 7:30pm - Both of the tape robot's hands went off line, causing all tape mounts to fail. Disk operations are working normally.
  • 9 March, pm - It is expected that one of the hands will be replaced by 5pm this afternoon, enabling tape mounts to resume.
  • 10 March, am - The second hand will be replaced in the morning, disabling tape mounts for approximately 1 hour.
  • 10 March 5pm - The robot is on-line again and is now allowing staging from tape.
  • 17 November 2005
    1600 - Loss of power to the machine room caused all of lc, ac and the MDSS to go offline. The systems are being tested before full return to service.
    14 November 2005
    Starting at 10:00am the MDSS will be down to update the firmware on all fibre channel devices (tape drives, switches and disk arrays). A new tape robot control station will also be installed. We expect it to be down for around four hours. The copyqs on the ac and lc will be closed during this time.
    7 May 2005. early am to around 1pm
    Store will be down to allow work on machine room power supplies.
    21 March 2005. 10:30-11:30pm
    Store is down due to a robot software failure.
    16 March 2005. 11am
    Store is down having an emergency hardware robot repair. The data resident in the disk cache resident is accessible but offline data is not available until the mechanism has been repaired.
    7 March 2005. 6pm
    Store was down having an emergency hardware robot repair. The data resident in the disk cache resident was accessible but offline data was not available until the mechanism has been repaired.
    28 January 2005
    Store was rebooted to clear some rogue system processes.
    15 December 2004
    Starting at 11am, the store was down for approximately 5 hours to The persistent binding implementation on the Qlogic cards caused some grief so the initial planned downtime of 2hours was extended to resolve the issues.
    7 December 2004
    Starting at 2:15pm the MDSS will be down to install a new fibre driver and to configure some fibre switches. We expect the machine to be down for just over an over. The copyqs on the sc and lc will be closed during this time.
    2 December 2004
    Starting at 6:30am the machine room went down for a power upgrade. During the downtime we also upgraded SAM-FS to version 4.2.
    16 April 2004
    Store will be down from 3pm to install several operating system patches. We expect that the machine will be down for 1 hour.
    25 February 2004
    Store will be down from 10 to 11 a.m. Monday the 1st of March 2004. The downtime is to replace some hardware associated with the problem from 29th of January (the centerplane).
    16 February 2004
    Store will be down from 2 to 3 p.m. Tuesday the 17th of February 2004. The downtime is for rearrangements to maximise performance.
    4 February 2004
    Store has been down several times over the last few days due to crashes and attempts to fix the problem causing the crashes. We have implemented a work-around suggested by Sun and hope that Store is now stable.
    29 January 2004
    Today we were seeing some intermittent hardware problems with the new Sun server. It was investigated by a hardware engineer and we are monitoring for further problems. We may close the sc and lc copyqs if reboots are necessary.
    19 - 21 January 2004, from 8am
    Scheduled downtime to upgrade system hardware, and reorganize SAMFS disk cache. Store is being upgraded from it's current 6x336MHz cpu 6GB Sun E6500 to a new 4x900MHz cpu 8GB Sun Fire V480R. An extra 1.5 terabytes of disk cache is also being added into SAMFS. Back in service at 8pm on the 21st.
    7 October 2003, from 8am
    Scheduled downtime to relocate network switches and fibre terminations.
    16 June 2003, 1-1:40pm
    The store was down to update some references to the tape drives.
    Planned downtime: Thursday 3 April 2003, 9am - 3 pm
    Update: The downtime was extended to 10pm to allow time for final testing and improving the performance.

    The Mass Data Storage System will have a downtime on Thursday from 9am to 3pm (coinciding with the SC downtime). During this downtime the new StorageTek D280 3 Terabyte disk array will be installed. This will replace the old Sun A3500 300 GB disk array.

    This new disk array will become the disk cache for all the SAMFS file systems on the Mass Data Storage System. All the existing SAMFS file systems will be copied on to the new disk cache, as well as being reorganized to more effectively use the resulting extra space.

    10/3/2003, 11am
    A problem with the network connection to store.anu.edu.au has been resolved after some teething problems with network routes. People contacting store.apac.edu.au, or connectioning from the SC were not affected.
    28/2/2003, 3pm
    The copyq on the sc was started again this morning after yesterday's issues with store were resolved. The problem was linked to an system monitoring program called BigBrother that was recently installed on the system. The sam05 filesystem used by the macho and stromlo projects will continue to be busy while we continue migration onto the 9940B tape drives.
    27/2/2003, 3pm
    Store is being rebooted to attach a replacement tape drive.
    25/2/2003, 11:00am
    Store was rebooted to clear a problem with a hung tape drive.
    19/2/2003, 10:00am
    The Mass Data Store crashed last night with disk errors on the RAID array. We are awaiting parts from Sun and expect them to arrive early tomorrow morning (20/2/2003). Some data loss may have occurred, but the extent is unknown at this time. Apologies for the interruption to the service.
    14/2/2003, 10:20am
    The Mass Data Store crashed last night and early this morning. Apologies for the interruption to the service.
    11/2/2003, 9am-9:30am
    The Mass Data Store will be down for around half an hour to install a modified version of the SAMFS file system software.
    16/1/2003
    An error in the tape system caused the tape drives were off-line last night.
    6/1/2003
    A permissions problem with the macho data catalog has been resolved.
    23/12/2002
    Store is back in production service. The problems were finally resolved after we diagnosed a conflict between 1Gbit and 2Gbit JNI fibre cards driving the 9940B tape drives. The 1 Gbit cards have been replaced with another 2Gbit card.
    20/12/2002
    Store is down again to investigate a new instance of problems on the filesystems. We are diagnosing the problems, but expect the system will not be back in service before the start of the week.
    19/12/2002
    Store has returned to service for APAC-NF users (12:30pm). We apologise for this unplanned and long downtime.
    18/12/2002
    Store is down. We are working on bringing it back to service as soon as possible.
    17/12/2002
    There are some hardware problems on store at the moment. Work is in progress, although we dont have an end time at the moment.
    15/12/2002 7:00am - ~5pm
    Unfortunately the power to the sc and store will be cut for most of Sunday December 15 to allow electrical work to be done in a neighbouring building. There will therefore have to be a downtime from around 7am. We expect that the power will e returned to the building in the late afternoon/evening and the system will then be brought back into service.

    The NF web server and networking will be available using the UPS battery system and generator. Any updates to the downtime will be posted on this web site.
    15/10/2002 10:00am - 12:30pm
    This downtime was to allow patches to the operating system to come into effect, a network card to be installed, and changes in network topology.
    1/10/2002 ~11:00am - 10:40pm
    Two disks failed in the Mass Data Store's disk array. They were replaced and the filesystems restored from backup. We also took the opportunity to reconfigure the tape devices for increased reliability.
    12/6/2002 - 8:00pm - 8:30pm
    There will be a network outage due to an operating system upgrade of the ACT Regional Network Organisation router.
    5/6/2002 - 12noon-3pm
    The Mass Data Store will be down to add in 3 new high capacity linear tape drives.
    15/5/2002
    The Mass Data Store will be down Wednesday 15 May from 9am to noon to prepare the system for more tape drives. This will affect /massdata mounts on the APAC-NF SC.
    Update
    We got underway a little late this morning. The system will be up at ~2:30pm.
    16/1/2002 1pm-2:30pm
    The mass storage system (/massdata) will be offline to apply a SAM-FS patch.
    13/12/2001 12:30am-4:30pm
    The Store HSM disk caches (/massdata) were unavailable due to a hardware failure on the controllers of the RAID5 disk array. The service was returned at 4:30pm.
    Store Downtime 30/11/2001 9am-10:30am
    The Mass Storage System will taken down to upgrade all the SCSI cards in the system. This will hopefully fix the hardware problems we have been having over the past few weeks.
    Store Downtime 17/11/2001 1:30pm
    The Mass Storage System is experiencing a hardware problem.
    Store Downtime 2/11/2001 6pm
    The Mass Storage System had a hardware failure last night. This problem was isolated on Saturday morning (3/11), and the service was returned to production.
    Store Downtime 22/10/2001 9am
    The Mass Storage System is being rebooted to clear a system failure that occured overnight.
    Store Downtime 16/10/2001 11am
    The Mass Storage System is being rebooted to clear a system failure that occured overnight.
    Store Downtime 05/10/2001 9:30am-5pm
    The Mass Data Store will be down for most of the day to have its Operating System upgraded to Solaris 8.
    Store crash 10/8/2001 5pm
    The Sun server went down at 5pm this afternoon. We are investigating the cause.
    Leonard Huxley Room planned power out 16/03/2001 6:30am - 10am
    The store will be powered down again as part of ongoing work to put all storage system components onto the new power grid. Unfortunately the number and type of circuits required to power the Mass Storage System will mean another downtime next week as well.
    Leonard Huxley Room planned power out 14/03/2001 6:30am - 3pm
    The Leonard Huxley Machine Room was be powered down to work on the power grid. A second downtime will be required to complete this work, either Thursday or Friday.
    Store downtime 25/7/2000 11am - 12noon
    The Mass Data Store will be down for approximately 1 hour to fix up alignment problems with the robot arm. At the same time two new gigabit ethernet cards will be installed in preparation for the new APAC installation.
    Store downtime 5/7/2000 2pm - 3pm
    The Mass Data Store will be down for approximately 1 hour to upgrade the storage and archive management software (SAMFS) to the latest version.
    A configuration error meant that the catalog of tapes was not on line until 8:30p.m.
    Store AMMENDED downtime 7/02/2000 11am - 2pm
    **** Due to the late arrival of the STK engineers, the downtime will now start at 11am. ****

    The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
    Store downtime 7/02/2000 9am - 12 noon
    The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
    Store downtime 31/12/1999 10am - 2/1/2000 noon
    All ANUSF machinery was powered off in compliance with ITS Y2K policy.
    20/12/99 2pm
    The Mass Storage System was rebooted to fix a problem with exported filesystems.
    3/12/99 10:20-11:20am
    The Mass Storage System is down to replace a poorly performing SCSI card.
    10/11/99 4-7:30pm
    The Mass Storage System is down to fix a robot hand and to rebuild the arrays with better tuning parameters.
    A slight problem was experienced on the reconfigure. The downtime has been extended to 10pm.
    2/11/99 1-2pm
    The Mass Storage System was rebooted to re-install an FDDI card and attach it properly to the network.
    26/10/99 4:30pm
    The Mass Storage System was returned to service. Some new problems appeared in the communication path between the data server (store) and robot server (acsls). The problems were mainly due to a communication change in the new acsls software that was needed to run the new 9840 tape drives.
    25/10/99 8am-8pm
    The Mass Storage System will be down all day to installed the second phase of the planned upgrades. In this downtime the robotics will be upgraded to a STK Powerdorn, and 8 new 9840 drives will be installed. The Silverton drives will also be removed. Data on the older linear technology will be transparently migrated to the new tapes.
    6/9/99 10am-5pm
    The SUN SparcCenter 2000 data server will be replaced by a new SUN 6500E server with an increased disk cache. The installation of the new server and disk cache is part of a two stage upgrade to increase the performance and capacity of the Mass Data Storage System.
    1/9/99 10-11:30am
    Maintenance was performed on the robotics and tape drives. The system was unavailable during this period.
    17/8/99 1-7am
    The FDDI interface died last night at 1am. The system was rebooted to restore network connectivity at 7am.
    31/7/99 7am-noon
    Store is down for airconditioning work in machine room.
    27/7/99 8-9:30am
    A small fix was applied to SAM-FS. Some routine maintenance was also performed.

    Entries before this date are no longer kept on-line

    Email problems, suggestions, questions to