Incident report: GOV.UK Notify

This post outlines a recent production issue on GOV.UK Notify and how it was resolved.

What happened

On 5 August at 10:38am a deployment was made to the production environment for GOV.UK Notify to update the database schema. As a result the application couldn’t send notifications during the deployment.

How users were affected

Some attempts to send notifications between 10:38am and 10:46 (by which time the deployment was complete), were unsuccessful. But despite the error, API users would still have seen a success response, even though the error would have prevented the system from storing the notifications.

How we responded

Once the deployment had finished and all instances of the application had been updated with the new code, the issue had fixed itself.

We checked the logs for unsent notifications (there were 12 in total), and we told the relevant service teams.

Why it occurred

GOV.UK Notify runs on Amazon Web Services (AWS) across several availability zones. There’s a single instance of each of the 4 server types in every availability zone. During a code deployment we update servers one at a time to avoid any downtime.

When a deployment needs a database change, the first server will make the change and from that point all instances refer to the new database schema. In this particular case the schema change involved removing a column.

Once the change had been made to the other servers, which accessed that schema, there was a mismatch between the application code and the database schema. This mismatch then caused errors when notifications were being sent.

Any API calls however, would have succeeded because they were queued up to be processed. Once the queue of API calls was processed, and the application had attempted to save the data to the database, the notification failed. But the user would have already received a success response.

Once the code was deployed to all the servers, the issue was fixed.

What we’re doing to prevent this from happening again

We’ve decided that in future, all database migration scripts must be deployable on their own.

To make sure that all schema changes are backwards compatible, code deployments containing a database migration should be split into two steps:

apply the migration script to the database
deploy the new code, which relies on the schema change

We’ll also look at using github hooks to make sure deployments are split into 2 parts, and we’ll set up automated alerts.

Database errors trigger further attempts (or retries) to send a notification. So we’ve extended the time period of these retries to 25 minutes. which should allow enough time for any errors to be fixed.

We’re also thinking about extending the retry period for longer than 25 minutes, to give the team time to fix issues without losing any notifications. We already have a retry limit of 4 hours for passing messages to our delivery partners. Additionally we’ll look at the use of dead letter queues for tasks that repeatedly fail.

Finally, we’re considering saving notification requests to the database as soon as we receive them, not after they’ve been queued. This will ensure that a success response from the API will be mean the notification has been saved to our database.

Getting better at fixing this type of problem

We’ve improved logging, particularly how we log retries. And we’re also working on our monitoring and alerting processes.

We’re looking at allowing retries to take place over a longer period of time to make sure we have enough time to respond to errors.

We’ll save the notification requests earlier during an API call. And we’ll check this update to see how it affects API performance.

Detailed timeline for developers

5 August at 10:38am
Build commences. First server starts to update and the database update is applied.

5 August at 10:46am
Build complete. All servers fully functional.

Follow Martyn on Twitter and don't forget to sign up for email alerts.

GDS is expanding, and we have a number of positions that need to be filled - especially on the Government as a Platform team. So we’re always on the lookout for talented people. Have a look at our videos describing how we work, our vacancies page, or drop us a line.

Incident report: GOV.UK Notify

What happened

How users were affected

How we responded

Why it occurred

What we’re doing to prevent this from happening again

Getting better at fixing this type of problem

Detailed timeline for developers

Share this page

This blog is archived

Categories

Work for us

Follow us on Twitter

Comments and moderation policy

What happened

How users were affected

How we responded

Why it occurred

What we’re doing to prevent this from happening again

Getting better at fixing this type of problem

Detailed timeline for developers

Sharing and comments

Share this page

Related content and links

This blog is archived

Categories

Work for us

Follow us on Twitter

Comments and moderation policy