Incident Report Number: 2017-001

GSB UPS #4 Failure

Ticket Number: INC0081383

What happened?

The Uninterruptible Power Supply (UPS) #4  located in the General Services Building (GSB) data centre experienced a hardware failure which resulted in several non redundant power supplied servers to become unavailable and as a result several services were either unavailable or were slow to access.

Who was affected?

Campus Computing ID (CCID), Judas, LDAP (Lightweight Directory Access Protocol), MX mail and Samba users were potentially affected by this outage.

What was the impact?

The affected users were not able to access or experienced delays with the following services CCID, Judas, LDAP, Mail, Pubcookie and Samba.

What was the timeline of the incident?
Start: 2017/05/24 11:32 - Monitoring systems began to alarm on UPS #4 in the GSB data centre.
2017/05/24 11:35 - IST analysts contacted Facilities and Operations to confirm if there were any power bumps/spikes/surges in the GSB building. Also inquired as to input power to the GSB data center. No alarms reported on their end.
2017/05/24 11:38 - IST analysts contacted vendor, Schneider Electric, and started a case requesting immediate assistance. Work began to identify the scope of the outage as several devices were affected.
2017/05/24 11:50 - IST analysts began transferring critical devices over to alternate power sources to restore service.
End: 2017/05/24 13:25 - Impacted services were restored.
What was the root cause of the incident?

UPS #4 experienced two separate hardware failures. The UPS and it’s transformer both failed at the same time resulting in total loss of power to connected devices.

What was the work around and resolution for the incident?
Work Around

The affected servers were moved to alternate power supplies (utility power or alternate UPS’s)  to restore service.



Resolution

Vendor support technicians will perform repair work to bring UPS #4 and it’s transformer back to full health in the next few weeks.

What are any recommendations to prevent this incident from occurring again?

None as UPS# 4 and it’s transformer will be be overhauled and repaired.

Updates

None.