23 April 2012 - ?? April 2012
Monday April 23 10 am: /projects on dcc requires a full file system check so dcc is going down immediately. We hope to restore services tomorrow, Tuesday April 24. We apologise for this disruption to the dcc service. Some web sites hosted at the NI NF will also be affected by this downtime.Tuesday April 24: The file system check of /projects is still going. In order to get services back the /projects file system will be restored from the most recent dump to a smaller cache.
Thursday April 26: It is expected that the /projects restore will be completed by late this afternoon.
The DCC will be down for Phase II system upgrade this Friday March 2 2012.
A downtime for the MDSS file system is scheduled for Wednesday this week (18th of January 2012) to move the MDSS (Mass Data Storage System) HSM (Hierarchical Storage Management) file system from our old Oracle SAMQFS infrastructure to a new SGI based CXFS/DMF HSM system.The new SGI based MDSS file system is significantly more robust, with two large tape silos in separate machine rooms on the opposite sides of the ANU campus enabling automatic offsite copies for all data. It also has around 3 times the current disk cache size at 350 TBytes.
The downtime will start at 10am and is expected to take around 4 hours to dump and restore the MDSS file system. Copyq jobs with -lother=mdss can still be queued, but they will not be run while the MDSS is not available. (Note the importance of including -lother=mdss qsub option in this circumstance.)
The change to the filesystems will be mostly transparent to users of mdss get and put, or /massdata pathnames. If you currently use specific SAMFS HSM commands such as stage and sls, these will be replaced by similar DMF HSM commands.
Following persistent system crashes on both the SAMFS metadata server and clients, the massdata file system was taken offline for a file system check. This check was expected to take two days due to the large amount of online data cached in readiness for transfer to the new massdata system. Unfortunately the fsck failed, and the massdata file system had to be rebuilt from the last known good snapshot taken at 6am on the 3rd of November.
Due to security violations from unknown hackers, the MDSS was taken out of service whilst investigations were undertaken, and security upgraded on all NCI systems.
There will be an electrical outage at the ANU Huxley data centre on Saturday 17th September. As a result there will be a downtime from 17th Sept (12am) to 18th Sept (12am). A new UPS is to be installed in the data centre and power to all equipment except the Vayu and the Xe supercomputers will need to be turned off to install the electrical feed for this UPS. All Vayu and Xe jobs that need to access the mdss will be unable to run as the mdss system will be unavailable.
VMs gpu-polit and esgnode1 down for 10 minutes while being moved from a faulty VM host to a working one.
Nework outage to upgrade switches. Affected all services.
The production gridftp service for the DC cluster is down at present. It went down at approx 10:00. We are currently working to get an alternate machine up and running. Users can still get to their data via the host dc.nci.org.au. Please email help@nf.nci.org.au if you have any queries.
Unscheduled downtime lasting approximately from 11:15 to 14:30. Services hosted in our Infrastructure as a Service (IaaS) enviroment may have appeared to be unavailable. Not all services were affected.
Scheduled downtime to merge all of the file systems in to one, and increase the disk cache size. The expected benefits are a larger and more useable disk cache as a result of not being divided amongst 5 file systems. The limited number of tape drives will also be used more efficiently for the same reason, so we should expect better response to puts and gets through the copyq queue than we are currently experiencing.The downtime is expected to take all day to dump, restore and merge the file systems. copyq jobs with -lother=mdss can still be queued, but they will not be run while the MDSS is not available. (Note the importance of including -lother=mdss qsub option in this circumstance.)
The change to the filesystems will be transparent to users of mdss get and put, or /massdata/
/ pathnames.
A major component of the DC storage subsystem had an unexpected outage between 11th May (2am) until 12th May (8am). A detailed analysis revealed a known bug in the underlying ZFS storage target. A patch was applied and all filesystems were checked for data integrity. The system was returned to service by 9:30am, 12th May.
The assda.anu.edu.au website was unavailable from 5:00 PM to 7:30 PM on Monday 5th April 2010 due to an operator Error.
On Saturday 30th January 2010, the NCI National Facility received a major upgrade to its core routing and switching network equipment. The outage windows for the NCI National Facility services were: - 09:00-13:00 Loss of network connectivity to all National Facility services including Vayu and XE cluster head nodes. - 08:00-15:00 Expected outage of DC services - 08:00-17:00 Possible outage of DC services - 17:00-20:00 Fixing link to remote silo (new C3 stack had a port-span problem)
The SAM-QFS cluster filesystem which supports all DC data services was down between 2000hrs (20th) to 0800hrs (21st). The following services were affected as a result: the dc2 webserver, all MDSS data transfers, the dc0 login node, the DC gridftp server.
There was a downtime between 1300hrs to 1500hrs which affected the VMWare ESX cluster owing to firware problems on the Sun 6140 fibre channel array.
On Tuesday 21st July the NCI National Facilty will be down whilst the network and data cloud is being moved to a new location in the machine room.Although most NCI servers will remain up and running, all NCI and ANUSF services, including web sites, compute systems, and the data cloud will be uncontactable during the move, starting 8am.
On the same day, many of the components of the data cloud and mass data storage system (dc,mdss) will also be moved into new racks. This move involves a lot of hardware and is expected to take most of the day, starting from 7:30am.
The data cloud, and any service dependent upon it will be down for most of the day. The network is expected to be up at around 10am.
Any further queries regarding this downtime should be directed to help@nf.nci.org.au.
update @ 9:30pm The equipment is now all in place, and the software services are being tested.
The SAM-QFS cluster filesystem, which provides HSM storage for the DC cluster, were unavailable. Two independent hardware RAID controllers on two different 6140 arrays failed. Solaris clients were patched to the latest level which included multipathing support fixes.
The system availability has been limited due to problems with client fibrechannel failover support with the new firmware. Full service was finally restored after careful checking of filesystem and with Sun support resolving the issues.
The DC cluster was scheduled for a downtime to move to another area of the machine room. During this downtime the the RAID firmware on the 6140 will also be upgraded.
The DC cluster suffered some problems due to an NFS server issue. Nodes with NFS mounts were affected while nodes with QFS mounts or ESX volumes continued without problems.
The DC cluster will be unavailable from 10am-12pm as underlying services associated with the data cluster (dc) will be migrated onto newer hardware. This in turn will affect MDSS and NFS shares and web services (Apache, Ruby-on-rails etc.) originating from dc2.apac.edu.au will be affected as a faulty hardware component for this host needs replacement.Affected DC file systems are home, projects, short, vizlab, and all the massdata directories. Users on the compute systems, i.e. AC and LC, will not be able to use the mdss command to access the MDSS.
Mass Data Storage System (dc cluster and store) will be down from 9am-11am to install 3 new 16Tbyte disk trays, and update the firmware on all of the existing disk trays.
The partition tables on the home, opt and vizlab QFS file systems disappeared, and those file systems were offline for around 5 hours whilst we debugged and fixed the ensuing problems. The probable causes was either an ESX server install, or the reinstall of Solaris on dcmds0 during that time.
One of the metadata servers crashed with kernel panic when it was the current metadata server. It rebooted automatically after the crash. There was a reboot the following data to fix some lingering problems.
Store will be rebooted this morning (Friday 28th March) at 10am to apply the latest Daylight Savings changes. Daylight savings in NSW and ACT now ends the first Sunday in April, rather than the last Sunday in March.
The tape drives have been offline for the afternoon due to a hardware failure in the ACSLS tape management system. The system has been replaced with a new system, and is currently auditing the tape library. Tape movements will gradually increase in responsiveness once the audit it completed.
On Monday morning the DC Data Storage cluster was down from 11:30am-1:30pm to replace a broken CPU in the metadata server. Unfortunately, the CPU replacement did not fix the problem. There will probably be another downtime in the near future to replace the motherboard.
On Tuesday morning the Mass Data Storage System (store) will be down from 11:30am-1:30pm to install a new T10000 tape drive. The T10000 tape drive has a larger capacity (500GB per tape) and faster throughput (120MB/s), compared to our current 9940B tape drives (200GB, and 70MB/s).
At 11pm a severe hail storm caused the power to fail in the machine room. Recovery commenced 8am 28/02/2007. File system integrity checking took until around 10pm to complete.
On Monday morning the Mass Data Storage System (store) will be down from 10-12am for system maintenance. During this downtime the disk and tape fibre fabric is being reorganised for the new data cluster system.
APAC-NF Downtime: Saturday May 20th, 8am-6pm. The ANU machine room will be powered off at 8am May 20th to allow the replacement of the UPS system. All APAC-NF services (AC, LC, MDSS, web services, etc) will be unavailable for the day. The work is scheduled to be completed by 6pm but may be earlier. Jobs that will not complete before the downtime are not being started.
All MDSS services will be unavailable on Friday 19th May from 9:30am to 12:30pm for preventative maintenance on the Fibre Channel switches that connect the MDSS to it's disk array and tape drives. The actual period of the downtime may be shorter if there are no unforeseen problems
The MDSS was rebooted at 4:25 this afternoon (Friday 21 April 2006) to fix up a hardware problem with one of the tape drives. It was down for around 10 minutes.
Starting at 10:00am the MDSS will be down to update the firmware on the fibre channel disk arrays. We expect it to be down for around three hours. All access to the MDSS will be closed during this time.Returned to service at 15:35
8 March, 7:30pm - Both of the tape robot's hands went off line, causing all tape mounts to fail. Disk operations are working normally. 9 March, pm - It is expected that one of the hands will be replaced by 5pm this afternoon, enabling tape mounts to resume. 10 March, am - The second hand will be replaced in the morning, disabling tape mounts for approximately 1 hour. 10 March 5pm - The robot is on-line again and is now allowing staging from tape.
1600 - Loss of power to the machine room caused all of lc, ac and the MDSS to go offline. The systems are being tested before full return to service.
Starting at 10:00am the MDSS will be down to update the firmware on all fibre channel devices (tape drives, switches and disk arrays). A new tape robot control station will also be installed. We expect it to be down for around four hours. The copyqs on the ac and lc will be closed during this time.
Store will be down to allow work on machine room power supplies.
Store is down due to a robot software failure.
Store is down having an emergency hardware robot repair. The data resident in the disk cache resident is accessible but offline data is not available until the mechanism has been repaired.
Store was down having an emergency hardware robot repair. The data resident in the disk cache resident was accessible but offline data was not available until the mechanism has been repaired.
Store was rebooted to clear some rogue system processes.
Starting at 11am, the store was down for approximately 5 hours toThe persistent binding implementation on the Qlogic cards caused some grief so the initial planned downtime of 2hours was extended to resolve the issues.
- remove a 9940B drive which is being placed in our second silo and reconfigure the robot server ACSLS
- configure persistent binding on the fibre fabric; and
- apply a patch to SAM-FS
Starting at 2:15pm the MDSS will be down to install a new fibre driver and to configure some fibre switches. We expect the machine to be down for just over an over. The copyqs on the sc and lc will be closed during this time.
Starting at 6:30am the machine room went down for a power upgrade. During the downtime we also upgraded SAM-FS to version 4.2.
Store will be down from 3pm to install several operating system patches. We expect that the machine will be down for 1 hour.
Store will be down from 10 to 11 a.m. Monday the 1st of March 2004. The downtime is to replace some hardware associated with the problem from 29th of January (the centerplane).
Store will be down from 2 to 3 p.m. Tuesday the 17th of February 2004. The downtime is for rearrangements to maximise performance.
Store has been down several times over the last few days due to crashes and attempts to fix the problem causing the crashes. We have implemented a work-around suggested by Sun and hope that Store is now stable.
Today we were seeing some intermittent hardware problems with the new Sun server. It was investigated by a hardware engineer and we are monitoring for further problems. We may close the sc and lc copyqs if reboots are necessary.
Scheduled downtime to upgrade system hardware, and reorganize SAMFS disk cache. Store is being upgraded from it's current 6x336MHz cpu 6GB Sun E6500 to a new 4x900MHz cpu 8GB Sun Fire V480R. An extra 1.5 terabytes of disk cache is also being added into SAMFS. Back in service at 8pm on the 21st.
Scheduled downtime to relocate network switches and fibre terminations.
The store was down to update some references to the tape drives.
Update: The downtime was extended to 10pm to allow time for final testing and improving the performance.The Mass Data Storage System will have a downtime on Thursday from 9am to 3pm (coinciding with the SC downtime). During this downtime the new StorageTek D280 3 Terabyte disk array will be installed. This will replace the old Sun A3500 300 GB disk array.
This new disk array will become the disk cache for all the SAMFS file systems on the Mass Data Storage System. All the existing SAMFS file systems will be copied on to the new disk cache, as well as being reorganized to more effectively use the resulting extra space.
A problem with the network connection to store.anu.edu.au has been resolved after some teething problems with network routes. People contacting store.apac.edu.au, or connectioning from the SC were not affected.
The copyq on the sc was started again this morning after yesterday's issues with store were resolved. The problem was linked to an system monitoring program called BigBrother that was recently installed on the system. The sam05 filesystem used by the macho and stromlo projects will continue to be busy while we continue migration onto the 9940B tape drives.
Store is being rebooted to attach a replacement tape drive.
Store was rebooted to clear a problem with a hung tape drive.
The Mass Data Store crashed last night with disk errors on the RAID array. We are awaiting parts from Sun and expect them to arrive early tomorrow morning (20/2/2003). Some data loss may have occurred, but the extent is unknown at this time. Apologies for the interruption to the service.
The Mass Data Store crashed last night and early this morning. Apologies for the interruption to the service.
The Mass Data Store will be down for around half an hour to install a modified version of the SAMFS file system software.
An error in the tape system caused the tape drives were off-line last night.
A permissions problem with the macho data catalog has been resolved.
Store is back in production service. The problems were finally resolved after we diagnosed a conflict between 1Gbit and 2Gbit JNI fibre cards driving the 9940B tape drives. The 1 Gbit cards have been replaced with another 2Gbit card.
Store is down again to investigate a new instance of problems on the filesystems. We are diagnosing the problems, but expect the system will not be back in service before the start of the week.
Store has returned to service for APAC-NF users (12:30pm). We apologise for this unplanned and long downtime.
Store is down. We are working on bringing it back to service as soon as possible.
There are some hardware problems on store at the moment. Work is in progress, although we dont have an end time at the moment.
Unfortunately the power to the sc and store will be cut for most of Sunday December 15 to allow electrical work to be done in a neighbouring building. There will therefore have to be a downtime from around 7am. We expect that the power will e returned to the building in the late afternoon/evening and the system will then be brought back into service.
The NF web server and networking will be available using the UPS battery system and generator. Any updates to the downtime will be posted on this web site.
This downtime was to allow patches to the operating system to come into effect, a network card to be installed, and changes in network topology.
Two disks failed in the Mass Data Store's disk array. They were replaced and the filesystems restored from backup. We also took the opportunity to reconfigure the tape devices for increased reliability.
There will be a network outage due to an operating system upgrade of the ACT Regional Network Organisation router.
The Mass Data Store will be down to add in 3 new high capacity linear tape drives.
The Mass Data Store will be down Wednesday 15 May from 9am to noon to prepare the system for more tape drives. This will affect /massdata mounts on the APAC-NF SC.
Update
We got underway a little late this morning. The system will be up at ~2:30pm.
The mass storage system (/massdata) will be offline to apply a SAM-FS patch.
The Store HSM disk caches (/massdata) were unavailable due to a hardware failure on the controllers of the RAID5 disk array. The service was returned at 4:30pm.
The Mass Storage System will taken down to upgrade all the SCSI cards in the system. This will hopefully fix the hardware problems we have been having over the past few weeks.
The Mass Storage System is experiencing a hardware problem.
The Mass Storage System had a hardware failure last night. This problem was isolated on Saturday morning (3/11), and the service was returned to production.
The Mass Storage System is being rebooted to clear a system failure that occured overnight.
The Mass Storage System is being rebooted to clear a system failure that occured overnight.
The Mass Data Store will be down for most of the day to have its Operating System upgraded to Solaris 8.
The Sun server went down at 5pm this afternoon. We are investigating the cause.
The store will be powered down again as part of ongoing work to put all storage system components onto the new power grid. Unfortunately the number and type of circuits required to power the Mass Storage System will mean another downtime next week as well.
The Leonard Huxley Machine Room was be powered down to work on the power grid. A second downtime will be required to complete this work, either Thursday or Friday.
The Mass Data Store will be down for approximately 1 hour to fix up alignment problems with the robot arm. At the same time two new gigabit ethernet cards will be installed in preparation for the new APAC installation.
The Mass Data Store will be down for approximately 1 hour to upgrade the storage and archive management software (SAMFS) to the latest version.
A configuration error meant that the catalog of tapes was not on line until 8:30p.m.
**** Due to the late arrival of the STK engineers, the downtime will now start at 11am. **** The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
All ANUSF machinery was powered off in compliance with ITS Y2K policy.
The Mass Storage System was rebooted to fix a problem with exported filesystems.
The Mass Storage System is down to replace a poorly performing SCSI card.
The Mass Storage System is down to fix a robot hand and to rebuild the arrays with better tuning parameters.
A slight problem was experienced on the reconfigure. The downtime has been extended to 10pm.
The Mass Storage System was rebooted to re-install an FDDI card and attach it properly to the network.
The Mass Storage System was returned to service. Some new problems appeared in the communication path between the data server (store) and robot server (acsls). The problems were mainly due to a communication change in the new acsls software that was needed to run the new 9840 tape drives.
The Mass Storage System will be down all day to installed the second phase of the planned upgrades. In this downtime the robotics will be upgraded to a STK Powerdorn, and 8 new 9840 drives will be installed. The Silverton drives will also be removed. Data on the older linear technology will be transparently migrated to the new tapes.
The SUN SparcCenter 2000 data server will be replaced by a new SUN 6500E server with an increased disk cache. The installation of the new server and disk cache is part of a two stage upgrade to increase the performance and capacity of the Mass Data Storage System.
Maintenance was performed on the robotics and tape drives. The system was unavailable during this period.
The FDDI interface died last night at 1am. The system was rebooted to restore network connectivity at 7am.
Store is down for airconditioning work in machine room.
A small fix was applied to SAM-FS. Some routine maintenance was also performed.
Entries before this date are no longer kept on-line