Incident Report: GOV.UK Platform as a Service

This post outlines a recent issue with the GOV.UK Platform as a Service (PaaS), and how it was resolved.

What happened

On 15 December Amazon Web Services (AWS) experienced a DNS issue which caused problems for PaaS, and the applications running on it.

How users of the platform were affected

Applications deployed on the PaaS suffered intermittent failures when resolving DNS queries. These failures may have resulted in services not being able to connect to databases, backend services and APIs.

It’s likely that any tenants (or service teams) managing applications, would have experienced problems. But given the time of this outage, it's probable that the subsequent problems would have only affected automated scheduled tasks.

How end-users were affected

Some users would have seen intermittent slowness and errors if they were accessing applications deployed on the PaaS between 1am and 3.17am.

Why it occurred

The underlying cause was a problem with EC2 DNS resolvers in eu-west-1.

To work correctly many components and operations on the PaaS need DNS resolution to be fully functioning.

How we responded

At the time we weren't providing 24/7 support for the PaaS, but we will be soon in-line with the needs of our adopting services.

When we started work on 15 December 2016 we noticed there had been some monitoring alerts issued. We investigated the incident and sent a notification email to users.

We'll prevent this from happening again

Amazon have plans to prevent similar problems reoccurring in their DNS resolution services. These plans are being worked on now.

We’ll improve how we respond

We’ll stop emails being sent from our monitoring system if disk monitoring data is not being received. Large numbers of emails may make it difficult to spot any other notifications.

We’ll make it easier and quicker to notify tenants about outages.

And finally we’ll soon be implementing alert notifications for out-of-hours support. We will use this incident as an example to shape those alerts.

We considered setting up a local DNS resolver cache, rather than connecting to EC2 resolvers directly. We rejected this option, because we don’t believe it would have prevented the problems; there would have been failures for uncached records.

We also discussed adding more resolvers such as google or OpenDNS. These are free open resolver providers targeted at end users. They implement things like rate limiting and as a team we think using them might introduce new issues.

We may revisit these ideas if there are any further incidents relating to DNS resolution.

Detailed timeline for developers

First DNS error failures logged in the VMs cell/4, cell/5, cell/8, consul/2 and etcd/2. All VMs in zone 3 of AWS for production account.

Mainly errors resolving datadog monitoring, which repeated with a frequency of 10 seconds. Datadog agent was not able to send metrics

First error detected in our pingdom monitoring of our healthcheck app for DB endpoint.  3 minutes downtime. This endpoint checks if the application can connect to a backing database.

Followed by 5 failure events, from 1:24 to 2:18 from 1 to 6 minutes

One smoke test CF API operation failed due DNS resolution of UAA from VM api/1. 6 more errors happened at 01:37, 01:42, 02:11, 02:21, 02:31, 02:46

First monitor triggered in datadog due "no-data" disk monitors from previous VMs.

Followed by 68 events Alert/Recover from 6 hosts (cell/4, cell/5, cell/8, consul/2, elasticsearch_master/2, etcd/2) until 02:58

AWS officially acknowledges issues in the DNS resolution in some instances in their status page: "We are investigating DNS resolution issues from some instances in the EU-WEST-1 Region."

Updated as continuing  investigating at 02:29, 03:00 and 03:07. Reported resolved at 03:17

Last pingdom failure. Recovered in 6m at 02:18:24

Drop of rate of DNS resolution errors logged in the VMs

Last DNS resolution error logged in the VMs

Last "no-data" monitor recovered in Datadog

AWS reports the issue as completely recovered

PaaS team member noticed warnings and checked monitoring to confirm that the issue was no longer impacting

Notification about the issue sent to users of the platform