This post is about an incident in the 'Platform as a Service for government' production environment for hosting applications.
At 1.30pm UTC on Friday 3 June 2016, a program to delete Cloud Foundry (CF) development environments was accidentally run in the production environment. As a result, there was a complete outage to the platform.
How users were affected
At the time of this incident, the applications deployed were for testing and demonstration purposes only.
The production environment was completely down for 46 minutes; CF was inaccessible and applications were deleted.
All users had to re-push their original source code in order to reinstate their applications. We informed them they could do this after 2 days 23 hours and 46 minutes. We later discovered that applications, which specified a buildpack could have been pushed successfully after the 46 minutes.
How we responded
We began the process of recreating the production environment. This ran for approximately 21 minutes but was only partially successful. The platform was successfully reinstated, but an application used for monitoring the platform failed to deploy.
We investigated this failure but given there were no live applications using the platform we decided not to work through the weekend to resolve it. Instead, we informed the users that the platform would be unstable over the weekend.
The following Monday we resumed our investigation and concluded that the buildpack detection process was broken. This meant that applications that didn’t specify a buildpack couldn’t be deployed. This was resolved by removing the test buildpack from the detection list.
Once all automated tests were successful, we were confident the problem had been resolved and we told users the platform was available.
Why it occurred
The team uses short-lived development environments, which we create and delete on a regular basis. These are an extremely useful way of testing changes we’re preparing for the platform.
At the time the incident occurred, all environments had a web interface, which allowed a team member to delete them. The web interface for deleting production looked identical to the one for deleting a development environment. Members of the team often have several browser tabs open with web interfaces for managing different environments.
How we're preventing this from happening again
We'll stop this incident from happening again by:
- removing the ability to run destructive pipelines in persistent environments
- investigating a way to make the concourse user interface and command line interface for persistent environments look different from development environments.
- not running acceptance tests that modify global state, such as buildpacks
- investigating methods for protecting data in production and testing environments, for example, requiring multi-factor authentication (MFA) when deleting data from buckets
Improving our ability to resolve this type of problem
We'll change our processes to resolve similar incidents more quickly and efficiently by:
- checking the frequency of service monitoring alerts by Pingdom is appropriate
- defining the incident response process, for example, who is responsible for which areas, what are the escalation points, and what information do we need to capture
- agreeing the communications process and incident details for users and tenants
- prioritising system monitoring for more visibility of component failures and unexpected state
- investigating backup and restore options for persistent data, for example, simple storage service (S3) and Relational Database Service (RDS)
- investigating why the RDS snapshot failed
Detailed timeline for developers
All times are in UTC, unless otherwise stated.
3 June 2016 at 1:30pm
Delete CF pipeline triggered by accident
3 June 2016 at 1:32pm
BOSH task delete deployment CF was triggered from the manually started delete deployment pipeline
3 June 2016 at 1:37pm
Healthcheck.cloudapps.digital check on Pingdom failed
3 June 2016 at 1:42pm
Healthcheck.cloudapps.digital/db check on Pingdom failed
3 June 2016 at 1:46pm
BOSH delete-deployment finished, meaning that all APIs and apps were down
3 June 2016 at 1:46pm
The pipeline failed while trying to create DB snapshot
3 June 2016 at 2.00pm
Create CF pipeline on prod was triggered manually to force redeployment
3 June 2016 at 2:21pm
Create-bosh-cf pipeline failure during deploy graphite-nozzle task
3 June 2016 at 4:00pm
Decision was made to inform tenants that platform won't be available during weekend
3 June 2016 at 4:49pm
Email sent to tenants
6 June 2016 at 8:45am
Investigation resumed after weekend
6 June 2016 at 10:20am
Spurious buildpack spotted in the first detection slot. It was deleted and `cf push` worked again
6 June 2016 at 10:24am
cf-deploy job was re-run to re-deploy graphite-nozzle+healthcheck and run tests
6 June 2016 at 10:37am
Healthcheck.cloudapps.digital/db pingdom check up
6 June 2016 at 10:37am
Healthcheck.cloudapps.digital pingdom check up
6 June 2016 at 11:27am
Acceptance tests failed because CATS persistent app had not been restaged
6 June 2016 at 12:15pm
Deleted CATS persistent app and re-ran acceptance tests which re-deployed the app
6 June 2016 at 1:03pm
Acceptance tests passed
6 June 2016 at 1:32pm
Email to tenants to say that service was resumed and that they needed to re-push their apps
GDS is expanding, and we have a number of positions that need to be filled - especially on the Government as a Platform team. So we’re always on the lookout for talented people. Have a look at our videos describing how we work, our vacancies page, or drop us a line.