Incident Report for Thursday Morning Outage (22/12/2016)

By Sam Bailey in Release Notes

Overview

At 12:30am AEDT on the 22nd of December we experienced a significant service disruption. The disruption lasted approximately 6 hours and 30 minutes, until 7:00am AEDT.

At 12:30am our systems deployed an approved update to one of the PHP modules that Schoolbox relies on to function. This particular module contained an unexpected change that caused it to fail to install correctly. This caused a service inaccessibility issue for all Schoolbox instances where no one could log in to the system, nor were any notifications triggered during this time sent to users.

One of our clients, Saint Kentigern, contacted our out of hours emergency support line at 5:45am AEDT (7:45am NZDT). Our team began working on identifying the cause, including bringing more of our team members into the resolution process. By 6:20am AEDT we had identified the issue and began deploying the fix to all servers. The issue was fixed by using the configuration management system to deploy the original version of the PHP module.

All servers were back operating correctly as of 7:00am AEDT (9:00am NZDT).

Key Notes

  • Service disruption occurred from 12:30am - 1am AEDT through to 6:30am - 7:00am AEDT.
  • The disruption meant that users were unable to login to Schoolbox during the affected times.
  • There was no data loss or corruption.
  • There was no change to the configuration of the servers.

Further Actions

We have discussed this incident internally and have formulated the following actions to prevent this kind of error occurring in the future:

  • Audit the configuration management system to ensure no other modules or packages have the potential to have this issue occur (completed).
  • Investigate additional automated and/or manual checks before approval of an update or their dependencies as this was due to an update being insufficiently checked before approval.

We sincerely apologise for the disruption to the service during this time. Although this disruption was detected and resolved outside of our customer's primary operating hours (and during the holiday period), we appreciate that this may not have been the case if the issue had occurred at a different time of day.

Our team strives to provide the best service availability possible, and we will continue to make on-going improvements across our systems to achieve this goal.

Schoolbox Support and Operations Team

×