All fun and games until you start with GameDays

As a payments company, our APIs need to have as close to 100% availability as possible. We therefore need to ensure we’re ready for whatever comes our way: from losing a server without bringing the API down, to knowing how to react if a company laptop is compromised.

To accomplish this we run GameDay exercises. What you will read below is our version of a GameDay. We hope that by sharing how we do GameDays we can give you a starting point for running your first GameDay.

What is a GameDay?

Failure is inevitable. The best way to prepare for it is to know how it can and will occur. A GameDay allows us to measure the resiliency of systems, applications, and people and test how we would react to failure in a controlled way.

The GameDay is not just about resolving issues we may face, but also about finding the best way of approaching them. It helps us improve our internal and external communication and guides us into the unknown - that realm most companies are too fearful to explore until it is too late. However at GoCardless, we embrace failure. It is how we learn. It is how we become a better, more prepared company.

How do we run ours?

A GameDay has the potential to affect real systems, however, we want to be able to experience failures without impacting our customers. In our GameDays so far, we used our staging environment - a replica of production. As we gain confidence in our defenses, we will start doing GameDays in production.

Initial Meeting

We give GameDays some structure to ensure their success. We run an initial meeting some days before the event to assign roles, create a rough schedule for the day, and decide which information we want to gather. Depending on what we want to test during that GameDay, we may also send a heads up email to the engineering team.

Roles: There are two important roles during a GameDay at GoCardless:

  1. Villain: spends the day introducing failure and generally causing havoc.

  2. First responders: serve as communication hubs to the rest of the teams. This role already exists in normal operation - each team has a first responder who handles issues during working hours, rotating weekly.

Schedule: We define a rough start and end time for the day.1 Because these are controlled events, we want to do it during work hours - we wouldn't be doing anyone any favours by running it into the late hours.

Setup & Discussion: The SRE team and the Villain will set up any user accounts needed and discuss the tools and services the Villain should use and target. This isn't a comprehensive conversation as we want the Villain to employ their best creative self and do things we wouldn't expect.


Issues

Part of the Villain's role is to think about what issues to create in order to gather the most relevant information. This is an important part of the role - the Villain needs to be careful to balance the issues they create with the amount the team will learn from them. Causing the first responders extra work just for the sake of it is usually pointless, especially if it's an issue that would be unlikely to occur in reality.

During GameDay we cover technical, process and trust related issues.

Technical

Technical incidents are usually very easy to create and easy to deal with. We can cause these by slowing traffic between servers, or shutting down the primary node in our PostgreSQL cluster. A pkill -9 -f redis in your primary Redis node can easily show if you are reliant on Redis being always up and responding, or if you can handle a transient failure properly.

Process

Process issues may be some of the trickiest things you’ll uncover. Often, a process will go untested for a long period of time, and will no longer work by the time you need it. For example, say you run into an issue with a third-party provider and need to speak with a support agent. Imagine for a moment that the support agent won’t speak to you until you’ve set up security questions on your account so that they can confirm the answers with you. You do not want this to happen during a real incident at 2am. It did.

Trust

Trust is a tricky one. You want people to trust each other but you also want employees to be vigilant for things like Phishing emails or compromised accounts. Finding a balance and helping people know when and what to question is hard. We have used several tactics to train our employees, from phishing2 to leaving USB drives around the office, and sometimes even plugging them into computers during lunch time. Afterwards we brief the whole company on why and how we did it, and we communicate best practices.

GameDay T-1 day

On the day before GameDay we test if the Villain has the correct access in the environments they need.

If the GameDay is being run in staging, we change our alerting settings before the GameDay officially starts so that we are paged about failures just as we would be if they happened in production.

Lastly we gather with the Villain. We run through a list of documented instructions on how to induce various kinds of failures. The list includes things like:

  • how to reboot servers
  • how to add or remove firewall rules
  • how to slow traffic on a server
  • how to stop or kill processes

We also discuss things the Villain wants to do but isn't too sure if they should or could be performed. For example phishing attacks on GoCardless employees, or starting the fire alarm (may be illegal).

GameDay

During GameDay, teams work as usual. We don't make additional requirements of people across the company. As mentioned previously, our first responder role exists even outside of GameDay. The only extra role during GameDay is the Villain.

To make sure we get the most out of each GameDay, we pay special attention to documentation and communication.

Documentation

For every incident that occurs we start by creating a shared document. Both first responders and the villain take separate notes while keeping two things in mind:

  1. timestamp everything that happens
  2. be as detailed as possible

Here is an entry from the Villain:

09:34 - turned off search01 and search02
09:38 - back up intermittently (assuming due to ES03)
09:43 - seems to be back up

On the flip side, here is an entry from one of our first responders:

[2016/11/16 09:36] Elasticsearch down

- 2 nodes (search0[12]) were shut down - one can only assume mistake
- search02 didn’t rejoin the elasticsearch cluster
- had to stop elasticsearch daemon and restart for it to pick-up the correct master
- fully resolved at 09:49

By writing detailed notes with timestamps, we can easily compare first responders' notes against the Villain's. This will tell you how long it took for the first responder to notice the issue. With this information you can improve your monitoring and alerting to achieve a better MTTR3.

In the two examples above we timestamped our notes inconsistently. An extra take-away from that day was that we should agree on a timestamp format.

Communication

We strive for clear communication even if we are performing this in a staging environment. For example our SRE team has a flag 4 that indicates to people who they should go to if they need assistance.

We also indicate who our first responders are in each team’s chat channel. This way everyone knows how to reach whoever they need during incidents.

During our last GameDay we found that we didn't communicate clearly to our first responders. Some of this was on purpose - we want to test people’s reactions to issues they aren’t expecting! There were some things we could have done better though.

For example, recent joiners had never heard of GameDay or heard what it entailed. This caused frustration in the team and we had to adjust how we were running the day on-the-fly. The lesson for us was to not go overboard with GameDays and when in doubt, over communicate.

GameDay T+1 day

Regardless of the environment we are using we open tickets for all incidents created5 during GameDay that need follow-up.

We tag all of these tickets with "Game Day". This makes them stand out from other issues so we can prioritise getting them fixed. It also makes it easier for a Villain to go back to issues that have been marked as Fixed and re-test them as part of later GameDays.

We produce a post-mortem6 for any significant issues we hit, even for GameDays run in staging. If you're already running your GameDays in production, we’d certainly hope that you’d write a post-mortem anyway, as a matter of course. They are a great way to spread knowledge and to make sure critical issues get fixed permanently.

Takeaway

The main purpose of GameDays is to prepare the company for failure. Not just to accept failure but to embrace it. It is never about the if but the when.

During GameDay you may encounter issues you hadn't planned for. For example:

  • Key people being unavailable due to meetings or illness
  • Changes unrelated to GameDay causing issues that send you on a wild goose chase
  • Unexpected office problems like losing WiFi
  • Real production issues that require the team's attention

Don't get frustrated by these things. Consider them all as part of the game. That's why you do it. You want to learn about your company's processes and people as much as about its systems and applications.

Even if you know exactly how everything works today, that may not be true tomorrow. A technological company is constantly changing so you need to make sure you know the people and the systems are evolving and adapting healthily to it.

Be careful doing attacks that may undermine your employees' trust in one another, such as targeted phishing7. We are very explicit about attacks being blameless and only for educational purposes. Our goal is to expose people to them so they know what they look like and become more vigilant; not to start questioning and ignoring every email they receive. A good motto is "Trust, but verify."8

Acknowledgments

A GameDay isn't possible without the buy-in from the whole company. GoCardless not only provides us with space for running these GameDay events but also encourages them.

Special thanks to the incredibly dedicated Villains and to the first responders who put out all the fires created by them.

Caveats

Our First GameDay

A couple of months before our first GameDay we, exceptionally, gathered various engineering team leads to document any things we already knew were definitely going to fail if we were to do a GameDay. This was to give those teams a chance to work on them before GameDay.

No safe word - beware

We ran our first two GameDay events without a safe word - it wasn’t something we thought about when we first planned them! We had a couple of instances of "Is this part of GameDay or are we having a serious incident?" which delayed dealing with the issue. For all future GameDays, we’ll have a safe word set up beforehand so that people are able to prioritise what they react to.

Due respect to The Villain in our last game day for coming up with the idea of getting people to add #gameday to any GameDay-related messages. This made things much easier to prioritise.


  1. In some situations it is helpful to start events outside of the proposed schedule. We do that to test our on-call procedures and communication channels. Since this can only be tested by reaching people unexpectedly, we do this (conservatively) during GameDays. 

  2. https://en.wikipedia.org/wiki/Phishing 

  3. MTTR - Mean time to recovery 

  4. https://www.amazon.co.uk/European-Union-Blue-Stars-Table/dp/B009L4LZCS/ 

  5. Notice the word created, not found. We may not have found all the issues created by the Villain. Another important bit of information you'll extract from the Villain detailed notes. 

  6. Post-mortems at GoCardless are blameless, as we believe they should be. 

  7. https://en.wikipedia.org/wiki/Phishing#Spear_phishing 

  8. https://en.wikipedia.org/wiki/Trust,_but_verify 

Want to help build reliable systems?
Join our team