Update on service disruption: 27 June, 2019
By Pete HamiltonJul 20192 min read
On Thursday 27 June between 12:41 and 13:21 (BST), we experienced a 40 minute service outage.
Our API and Dashboard were unavailable and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.
Submissions to the banks to collect payments and pay out collected funds were unaffected.
We’d like to apologise for any inconvenience caused. We know how important reliability is to you and we take incidents like this extremely seriously.
We’ll be completing a more detailed internal review of the incident in the coming weeks but we want to provide more details on exactly what happened and what we’re doing about it.
First, some context
Our dashboard displays a merchant’s balance, letting them know how much money they’ve collected recently and how much they can expect in their next payout.
We compute these balances whenever a merchant logs into their dashboard and cache them. This means we temporarily store them so that if the user reloads the page, we can quickly show them the existing balance, without them having to wait.
We keep these cached balances in a data store called Redis.
On Thursday at 12:17, as part of an internal reporting exercise, we ran a one-off task to compute balances for all our merchants. As each balance was computed, it was also automatically cached, resulting in a significant increase in the number of cached balances stored in Redis, compared to what we’d usually expect to see.
By 12:40, the machine running Redis had almost used up all its available memory, which caused it to automatically restart.
We’ve previously put measures in place to ensure that our service can cope if Redis isn’t available. This should have meant that aside from a slight delay in updating balances, our service should have remained online.
Instead our API, dashboard and payment pages experienced downtime.
How we responded
At 12:42, our monitoring flagged that our API was offline and by 12:43, our on-call engineers had received an alert and began investigating.
They quickly established that Redis was experiencing issues but it wasn’t immediately obvious what was causing them, or why it had taken the rest of our service offline with it.
After some further investigation, we established that Redis was attempting to bring itself online automatically but that it was taking a long time to do so. Our systems assumed this meant it wasn’t working properly and so it was rebooted, again, preventing Redis from loading successfully.
Observing this, our infrastructure engineers concentrated on trying to break the cycle by updating our monitoring to allow Redis enough time to come back online fully.
This was successful and at 13:24 Redis came back online. By 13:26, the rest of our service was restored.
However, once Redis came back online, it still contained previously cached balances; meaning we were still at risk of further downtime. The team monitored the situation closely and, set about exploring how best to clear the existing values from Redis to buy us back more capacity without causing any further downtime.
Unfortunately, whilst we were doing this, at 17:00, Redis once again exceeded its memory limit, resulting in a further 4 minutes of downtime. Shortly after, we cleared the unnecessary balances, putting Redis back into a stable state.
What we’re doing next
We’ve already learned a few valuable lessons from this experience.
Having identified why our service failed when Redis became unavailable, we've applied a fix that means we'll be unaffected if this was to happen in future, and have plans to continually verify this on an on-going basis.
In the next few days, we’ll also be running a more formal “post-mortem” internally, to examine both the incident and our response to it in further detail.