The SAM-QFS cluster filesystem, which provides HSM storage for the DC cluster, were unavailable. Two independent hardware RAID controllers on two different 6140 arrays failed. Solaris clients were patched to the latest level which included multipathing support fixes.
The system availability has been limited due to problems with client fibrechannel failover support with the new firmware. Full service was finally restored after careful checking of filesystem and with Sun support resolving the issues.
The DC cluster was scheduled for a downtime to move to another area of the machine room. During this downtime the the RAID firmware on the 6140 will also be upgraded.
The DC cluster suffered some problems due to an NFS server issue. Nodes with NFS mounts were affected while nodes with QFS mounts or ESX volumes continued without problems.
The DC cluster will be unavailable from 10am-12pm as underlying services associated with the data cluster (dc) will be migrated onto newer hardware. This in turn will affect MDSS and NFS shares and web services (Apache, Ruby-on-rails etc.) originating from dc2.apac.edu.au will be affected as a faulty hardware component for this host needs replacement.Affected DC file systems are home, projects, short, vizlab, and all the massdata directories. Users on the compute systems, i.e. AC and LC, will not be able to use the mdss command to access the MDSS.
Mass Data Storage System (dc cluster and store) will be down from 9am-11am to install 3 new 16Tbyte disk trays, and update the firmware on all of the existing disk trays.
The partition tables on the home, opt and vizlab QFS file systems disappeared, and those file systems were offline for around 5 hours whilst we debugged and fixed the ensuing problems. The probable causes was either an ESX server install, or the reinstall of Solaris on dcmds0 during that time.
One of the metadata servers crashed with kernel panic when it was the current metadata server. It rebooted automatically after the crash. There was a reboot the following data to fix some lingering problems.
Store will be rebooted this morning (Friday 28th March) at 10am to apply the latest Daylight Savings changes. Daylight savings in NSW and ACT now ends the first Sunday in April, rather than the last Sunday in March.
The tape drives have been offline for the afternoon due to a hardware failure in the ACSLS tape management system. The system has been replaced with a new system, and is currently auditing the tape library. Tape movements will gradually increase in responsiveness once the audit it completed.
On Monday morning the DC Data Storage cluster was down from 11:30am-1:30pm to replace a broken CPU in the metadata server. Unfortunately, the CPU replacement did not fix the problem. There will probably be another downtime in the near future to replace the motherboard.
On Tuesday morning the Mass Data Storage System (store) will be down from 11:30am-1:30pm to install a new T10000 tape drive. The T10000 tape drive has a larger capacity (500GB per tape) and faster throughput (120MB/s), compared to our current 9940B tape drives (200GB, and 70MB/s).
At 11pm a severe hail storm caused the power to fail in the machine room. Recovery commenced 8am 28/02/2007. File system integrity checking took until around 10pm to complete.
On Monday morning the Mass Data Storage System (store) will be down from 10-12am for system maintenance. During this downtime the disk and tape fibre fabric is being reorganised for the new data cluster system.
APAC-NF Downtime: Saturday May 20th, 8am-6pm. The ANU machine room will be powered off at 8am May 20th to allow the replacement of the UPS system. All APAC-NF services (AC, LC, MDSS, web services, etc) will be unavailable for the day. The work is scheduled to be completed by 6pm but may be earlier. Jobs that will not complete before the downtime are not being started.
All MDSS services will be unavailable on Friday 19th May from 9:30am to 12:30pm for preventative maintenance on the Fibre Channel switches that connect the MDSS to it's disk array and tape drives. The actual period of the downtime may be shorter if there are no unforeseen problems
The MDSS was rebooted at 4:25 this afternoon (Friday 21 April 2006) to fix up a hardware problem with one of the tape drives. It was down for around 10 minutes.
Starting at 10:00am the MDSS will be down to update the firmware on the fibre channel disk arrays. We expect it to be down for around three hours. All access to the MDSS will be closed during this time.Returned to service at 15:35
8 March, 7:30pm - Both of the tape robot's hands went off line, causing all tape mounts to fail. Disk operations are working normally. 9 March, pm - It is expected that one of the hands will be replaced by 5pm this afternoon, enabling tape mounts to resume. 10 March, am - The second hand will be replaced in the morning, disabling tape mounts for approximately 1 hour. 10 March 5pm - The robot is on-line again and is now allowing staging from tape.
1600 - Loss of power to the machine room caused all of lc, ac and the MDSS to go offline. The systems are being tested before full return to service.
Starting at 10:00am the MDSS will be down to update the firmware on all fibre channel devices (tape drives, switches and disk arrays). A new tape robot control station will also be installed. We expect it to be down for around four hours. The copyqs on the ac and lc will be closed during this time.
Store will be down to allow work on machine room power supplies.
Store is down due to a robot software failure.
Store is down having an emergency hardware robot repair. The data resident in the disk cache resident is accessible but offline data is not available until the mechanism has been repaired.
Store was down having an emergency hardware robot repair. The data resident in the disk cache resident was accessible but offline data was not available until the mechanism has been repaired.
Store was rebooted to clear some rogue system processes.
Starting at 11am, the store was down for approximately 5 hours toThe persistent binding implementation on the Qlogic cards caused some grief so the initial planned downtime of 2hours was extended to resolve the issues.
- remove a 9940B drive which is being placed in our second silo and reconfigure the robot server ACSLS
- configure persistent binding on the fibre fabric; and
- apply a patch to SAM-FS
Starting at 2:15pm the MDSS will be down to install a new fibre driver and to configure some fibre switches. We expect the machine to be down for just over an over. The copyqs on the sc and lc will be closed during this time.
Starting at 6:30am the machine room went down for a power upgrade. During the downtime we also upgraded SAM-FS to version 4.2.
Store will be down from 3pm to install several operating system patches. We expect that the machine will be down for 1 hour.
Store will be down from 10 to 11 a.m. Monday the 1st of March 2004. The downtime is to replace some hardware associated with the problem from 29th of January (the centerplane).
Store will be down from 2 to 3 p.m. Tuesday the 17th of February 2004. The downtime is for rearrangements to maximise performance.
Store has been down several times over the last few days due to crashes and attempts to fix the problem causing the crashes. We have implemented a work-around suggested by Sun and hope that Store is now stable.
Today we were seeing some intermittent hardware problems with the new Sun server. It was investigated by a hardware engineer and we are monitoring for further problems. We may close the sc and lc copyqs if reboots are necessary.
Scheduled downtime to upgrade system hardware, and reorganize SAMFS disk cache. Store is being upgraded from it's current 6x336MHz cpu 6GB Sun E6500 to a new 4x900MHz cpu 8GB Sun Fire V480R. An extra 1.5 terabytes of disk cache is also being added into SAMFS. Back in service at 8pm on the 21st.
Scheduled downtime to relocate network switches and fibre terminations.
The store was down to update some references to the tape drives.
Update: The downtime was extended to 10pm to allow time for final testing and improving the performance.The Mass Data Storage System will have a downtime on Thursday from 9am to 3pm (coinciding with the SC downtime). During this downtime the new StorageTek D280 3 Terabyte disk array will be installed. This will replace the old Sun A3500 300 GB disk array.
This new disk array will become the disk cache for all the SAMFS file systems on the Mass Data Storage System. All the existing SAMFS file systems will be copied on to the new disk cache, as well as being reorganized to more effectively use the resulting extra space.
A problem with the network connection to store.anu.edu.au has been resolved after some teething problems with network routes. People contacting store.apac.edu.au, or connectioning from the SC were not affected.
The copyq on the sc was started again this morning after yesterday's issues with store were resolved. The problem was linked to an system monitoring program called BigBrother that was recently installed on the system. The sam05 filesystem used by the macho and stromlo projects will continue to be busy while we continue migration onto the 9940B tape drives.
Store is being rebooted to attach a replacement tape drive.
Store was rebooted to clear a problem with a hung tape drive.
The Mass Data Store crashed last night with disk errors on the RAID array. We are awaiting parts from Sun and expect them to arrive early tomorrow morning (20/2/2003). Some data loss may have occurred, but the extent is unknown at this time. Apologies for the interruption to the service.
The Mass Data Store crashed last night and early this morning. Apologies for the interruption to the service.
The Mass Data Store will be down for around half an hour to install a modified version of the SAMFS file system software.
An error in the tape system caused the tape drives were off-line last night.
A permissions problem with the macho data catalog has been resolved.
Store is back in production service. The problems were finally resolved after we diagnosed a conflict between 1Gbit and 2Gbit JNI fibre cards driving the 9940B tape drives. The 1 Gbit cards have been replaced with another 2Gbit card.
Store is down again to investigate a new instance of problems on the filesystems. We are diagnosing the problems, but expect the system will not be back in service before the start of the week.
Store has returned to service for APAC-NF users (12:30pm). We apologise for this unplanned and long downtime.
Store is down. We are working on bringing it back to service as soon as possible.
There are some hardware problems on store at the moment. Work is in progress, although we dont have an end time at the moment.
Unfortunately the power to the sc and store will be cut for most of Sunday December 15 to allow electrical work to be done in a neighbouring building. There will therefore have to be a downtime from around 7am. We expect that the power will e returned to the building in the late afternoon/evening and the system will then be brought back into service.
The NF web server and networking will be available using the UPS battery system and generator. Any updates to the downtime will be posted on this web site.
This downtime was to allow patches to the operating system to come into effect, a network card to be installed, and changes in network topology.
Two disks failed in the Mass Data Store's disk array. They were replaced and the filesystems restored from backup. We also took the opportunity to reconfigure the tape devices for increased reliability.
There will be a network outage due to an operating system upgrade of the ACT Regional Network Organisation router.
The Mass Data Store will be down to add in 3 new high capacity linear tape drives.
The Mass Data Store will be down Wednesday 15 May from 9am to noon to prepare the system for more tape drives. This will affect /massdata mounts on the APAC-NF SC.
Update
We got underway a little late this morning. The system will be up at ~2:30pm.
The mass storage system (/massdata) will be offline to apply a SAM-FS patch.
The Store HSM disk caches (/massdata) were unavailable due to a hardware failure on the controllers of the RAID5 disk array. The service was returned at 4:30pm.
The Mass Storage System will taken down to upgrade all the SCSI cards in the system. This will hopefully fix the hardware problems we have been having over the past few weeks.
The Mass Storage System is experiencing a hardware problem.
The Mass Storage System had a hardware failure last night. This problem was isolated on Saturday morning (3/11), and the service was returned to production.
The Mass Storage System is being rebooted to clear a system failure that occured overnight.
The Mass Storage System is being rebooted to clear a system failure that occured overnight.
The Mass Data Store will be down for most of the day to have its Operating System upgraded to Solaris 8.
The Sun server went down at 5pm this afternoon. We are investigating the cause.
The store will be powered down again as part of ongoing work to put all storage system components onto the new power grid. Unfortunately the number and type of circuits required to power the Mass Storage System will mean another downtime next week as well.
The Leonard Huxley Machine Room was be powered down to work on the power grid. A second downtime will be required to complete this work, either Thursday or Friday.
The Mass Data Store will be down for approximately 1 hour to fix up alignment problems with the robot arm. At the same time two new gigabit ethernet cards will be installed in preparation for the new APAC installation.
The Mass Data Store will be down for approximately 1 hour to upgrade the storage and archive management software (SAMFS) to the latest version.
A configuration error meant that the catalog of tapes was not on line until 8:30p.m.
**** Due to the late arrival of the STK engineers, the downtime will now start at 11am. **** The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
The Mass Data Store will be down for approximately 3 hours to remove the two borrowed Redwood tape drives and the last of the old Timberline linear tape drives.
All ANUSF machinery was powered off in compliance with ITS Y2K policy.
The Mass Storage System was rebooted to fix a problem with exported filesystems.
The Mass Storage System is down to replace a poorly performing SCSI card.
The Mass Storage System is down to fix a robot hand and to rebuild the arrays with better tuning parameters.
A slight problem was experienced on the reconfigure. The downtime has been extended to 10pm.
The Mass Storage System was rebooted to re-install an FDDI card and attach it properly to the network.
The Mass Storage System was returned to service. Some new problems appeared in the communication path between the data server (store) and robot server (acsls). The problems were mainly due to a communication change in the new acsls software that was needed to run the new 9840 tape drives.
The Mass Storage System will be down all day to installed the second phase of the planned upgrades. In this downtime the robotics will be upgraded to a STK Powerdorn, and 8 new 9840 drives will be installed. The Silverton drives will also be removed. Data on the older linear technology will be transparently migrated to the new tapes.
The SUN SparcCenter 2000 data server will be replaced by a new SUN 6500E server with an increased disk cache. The installation of the new server and disk cache is part of a two stage upgrade to increase the performance and capacity of the Mass Data Storage System.
Maintenance was performed on the robotics and tape drives. The system was unavailable during this period.
The FDDI interface died last night at 1am. The system was rebooted to restore network connectivity at 7am.
Store is down for airconditioning work in machine room.
A small fix was applied to SAM-FS. Some routine maintenance was also performed.
Entries before this date are no longer kept on-line