Incident review: Service outage on 25 October 2020

Written by

Last editedJun 20244 min read

On Sunday 25th October we experienced a full service outage, lasting 22 minutes from 07:32 to 07:54 GMT. During this period the GoCardless API, dashboard and payment pages were unavailable, as well as operations performed via partner integrations.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

As a payments company, with a goal of providing consistently high levels of availability, we take incidents like this extremely seriously. As such, we're providing the following review as an insight into the technical issues and to demonstrate the action taken to prevent similar occurrences.

Secrets management: An overview

Similar to most online applications, our systems need access to a number of 'secret' values in order to securely communicate with other systems; for example, credentials to connect to a database, or an API key to access a third-party service.

At GoCardless we aim to keep velocity high by ensuring that this secret data can be managed by the engineers building these services, rather than relying on an operational team to make changes. To this end, we've implemented a system based on Hashicorp's Vault product.

We run an instance of Vault that all engineers can authenticate with, using their Google identity, and have the permissions to write secrets into, but not read the secret data out of. We additionally have policies configured in Vault that ensure our applications, authenticated via their Kubernetes service account token, can read secret data belonging to that application only.

The end result of this, combined with running Vault as a highly-available cluster and using Google Cloud Storage as a secrets engine backend, is a system that we can expose to our applications as a secure store that complies with the principle of least privilege.

What went wrong?

The root cause of this outage was ultimately very simple: an expired TLS certificate. This is the certificate that Vault was serving to secure its API and user interface, ensuring that nobody can impersonate our instance of Vault.

In actuality, the certificate had been renewed, but Vault didn't know about this.

We run Vault within a Kubernetes cluster and make use of the excellent cert-manager controller to manage these certificates for us. This allows us to have short-lived certificates, issued by Let's Encrypt, for each of our individual services including Vault, without the financial or operational cost of purchasing and rotating TLS certificates regularly. This is also arguably more secure than provisioning long-lived certificates that are shared across multiple services through the use of wildcards or Subject Alternative Names (SANs).

When cert-manager renews the certificate, as it did in this instance, the new certificate data which is mounted into the running container is automatically updated. However, as we discovered, Vault does not dynamically detect this change and must be manually restarted or reloaded via a SIGHUP signal in order to load the new certificate. This is automation that we didn't have in place, and so meant that our Vault server continued to serve the old certificate after it had been renewed, and then finally expired.

The Vault instance becoming unavailable is certainly bad, but much worse than this is the near-immediate failure of all applications that depended upon it. In infrastructure engineering we try to limit the single points of failure and the 'blast radius' of any given system failure whenever possible. This is an example where insufficient testing of these kinds of failure scenarios led to us re-learning this lesson the hard way.

In order to pull secrets from Vault and inject them into our application, we wrote a program called theatre-envconsul which authenticates against Vault and then executes the official envconsul tool.

Within this tool we'd previously made a choice to run envconsul with the -once flag, causing it to execute its main loop only once, and not run as a daemon. Without this flag, envconsul would restart the process if there was any change in Vault data, which is undesirable if the application in question is serving traffic – this would cause us to serve errors until the process became healthy again.

In addition, making this change would cause our workloads to fail-fast whenever Vault was unavailable. This would block new deployments from rolling out while Vault was down, but this seemed a worthwhile tradeoff for more predictable pod behaviour.

What we did not realise was that, even in non-daemon mode, the envconsul program would still periodically renew its leases with Vault. And if this failed, then envconsul would return an error and exit – taking the otherwise healthy process it was managing with it.

As we had disabled Vault retries within envconsul to combat other issues that we'd encountered, this caused all our healthy pods serving production traffic to die one-by-one, as their Vault leases expired and envconsul attempted a renewal. This left us with no capacity to serve requests, causing this incident.

Incident timeline and immediate reaction

Timeline

07:22: The Vault certificate expires.

07:25: Containers begin entering a crash-loop, due to envconsul terminating the process and then being unable to establish a new connection with Vault.

07:31: All requests to the GoCardless API begin failing, due to no healthy backends being available.

07:32: Automated monitoring detects the issue and pages the on-call engineer.

07:42: In the belief that the TLS certificate has expired, manual action is taken to force cert-manager to recreate the certificate.

07:48: The new certificate has been generated, and the Vault cluster is manually restarted in order to load this.

07:52: Backend services are returned to a healthy state and begin serving requests.

Recovery notes

One mistake in the response process was to assume that the underlying certificate had expired and try forcing its recreation. This was incorrect- the certificate had been renewed, it was just that Vault had not reloaded the new certificate.

The recreation of the certificate, while automated, does take several minutes as it depends upon DNS validation and this can be slow due to record propagation. In hindsight, we could have avoided this step.

Once Vault had restarted, the process of returning services to health was almost entirely hands-free. The self-healing properties of Kubernetes ensured that new containers were spun-up, and as soon as these became ready we were back to serving requests again.

The only manual intervention taken here was to explicitly 'delete' some pods that represented critical workloads, in order to short-circuit the 5-minute exponential backoff delay that they had reached. This returned the service to full health a couple of minutes earlier than it otherwise would have been.

How are we fixing this?

There are several key areas that we've been improving, to protect against this or similar scenarios in the future:

Monitoring: While we already had monitoring in place for the expiration of TLS certificates used by our Layer-7 load balancers, this monitoring did not extend to the simple TCP load balancer that Vault uses. We've now fixed this gap and use the Blackbox exporter to regularly check these too.

Prevention: We'd like the Vault server to automatically reload its certificate if it has changed. As this feature is not available, we've deployed the Reloader controller to trigger rolling updates of Vault when the certificate data changes.

Mitigation: Even if we can ensure that Vault will never serve an expired certificate again, there are other failure modes, such as network partitions to Vault, that we don’t want to affect other running services. For this, we are deploying a change to theatre-envconsul that will stop envconsul from managing the child process, while still allowing us to inject secret environment variables at container startup.

GoCardless is still growing rapidly, and with this so are the challenges and opportunities in building performant and reliable systems.

If you've found this interesting then check out our careers page, where you'll find openings for Site Reliability Engineers as well as other disciplines!