Summary
On 15 May 2018, 15:15 PDT, SEDNA experienced an outage on our North American cluster that prevented inbound and outbound messages. No emails were lost. All messages sent to your mailboxes during the outage have been queued and will be delivered. You will see all new messages as they arrive.
We understand the seriousness of our responsibility to your communications. We are sorry for the disruption to your business caused by today’s outage. We review these types of incidents in depth, reflect on our systems, and adapt them in order to constantly improve our reliability.
Why this happened
Application code deployment involved a database change. It had previously been run successfully on a test environment, but that environment does not have nearly as much data as production to generate this kind of failure the DB change did not successfully complete in a timely manner, which caused a backlog of database changes that were waiting their turn tasks such as sending and receiving email were stuck waiting for the database changes to complete
Actions and opportunities for improvement
Changes involving the database are particularly difficult to recover from and require exceptional care. The biggest opportunity for improvement in this particular case is to perform a trial-run of database changes on a database as similar to production as possible. This has now been established as policy. We also have action items related to speedier monitoring and alerting the technical team to assist in our response procedures.