Incident Report Number: 2018-001

UAlberta Login

Ticket Number: INC0113811

What happened?

UAlberta Login, the service clients use to log into some university services experienced an outage.

Who was affected?

All Campus Computing ID (CCID) account holders trying to log into some university services such as Google and PeopleSoft were affected by this outage. Clients already logged into these services were not affected.

What was the impact?

The affected clients were presented with an error message when trying to login in using their CCID’s.

What was the timeline of the incident?

Start: 2018/09/11 23:30 – Monitoring systems began alerting the size of a database used by UAlberta Login service had reached a threshold but the UAlberta Login service remained available. IST support analysts began investigating.
2018/09/12 01:00 – A large number of sessions were found in the database and it continued to slowly grow in size.
2018/09/12 01:30 – The largest sessions were analyzed and found to be a result of a bug in the software used by UAlberta Login.
2018/09/12 02:00 – IST support analysts determined the IPs of the client computers with the large sessions and after looking at their requests to UAlberta Login, it was determined the affected clients were querying UAlberta Login at an exceedingly high rate of authentication requests that were not completing properly.
2018/09/12 02:02 – Affected client IPs were blocked on UAlberta Login and the corresponding sessions were removed from the database.
2018/09/12 03:00 – The database was no longer growing at the rapid rate.
2018/09/12 03:15 – The database continued to process a backlog of transactions, freeing up space both within the database and temp files. The backlog needed several hours of processing time to catch up. No abnormally large sessions were found in the database anymore. 
2018/09/12 13:00 – Some clients began reporting they were receiving error messages when trying login in using their CCID’s. Clients already logged into the service were not affected.
2018/09/12 13:05 – IST support analysts began working on the issue. 
2018/09/12 13:10 – The same issue identified overnight was found again from new client IP addresses. The new IPs were blocked on UAlberta Login.
2018/09/12 13:20 – The database ran out of space due to the heavy volume of new authentication requests. Space was increased and all client sessions were dropped from the database to restore service.
End: 2018/09/12 13:30 – Service was confirmed restored.

 

What was the root cause of the incident?

This issue was the result of multiple factors. There were a number of clients computers triggering a significant volume of authentication requests to the UAlberta Login service. The software bug was allowing these authentication requests to grow to an abnormal size and database logs also grew abnormally both of which contributed to the database running out of space.

What was the work around and resolution for the incident?
Work Around

Client computers sending large volume of requests were blocked from UAlberta Login. The database size was increased and all active client sessions were removed.



Resolution

Issue was resolved with the workarounds listed above.

What are any recommendations to prevent this incident from occurring again?

The database configuration was modified on 9/18/2018 so log files would not grow so large. Enhanced monitoring and preventive actions around client sessions is currently being investigated.

Updates

None.