Managed runs on an Amazon AWS cluster, located across three separate Availability Zones (AZ) in AP-Southeast-2 (Sydney). There are many different services and elements of AWS that we’re using, each of which have slightly different retention and redundancy.

  • Retention - the amount of time that backups are stored. So if there was an issue, how far back could we go in terms of restoring data?
  • Redundancy - a single instance has poor redundancy. Spread over multiple instances gives better redundancy, since you can lose one but the others remain

There are two areas we need to consider for disaster planning. The first is the actual infrastructure, which are the actual server instances, databases, object stores etc. Our databases have a 3d retention policy. They are all multi-AZ, which means even if one AZ went down, there would be no impact. 

The servers themselves are “ephemeral”, and we don’t try to back these up. However this is intentional - each time we deploy a new version we destroy all the existing servers and replace them with new updated images (using Docker containers). This means if we did unexpectedly lose some servers, our system will automatically create new ones in a working AZ. In fact this is happening all the time as our platform adjusts to load and new code updates.

The second is the software that runs in and on this infrastructure. The code itself is stored in a versioned code repository (Github). If we were to lose the code from our servers, the right version can be quickly restored from GitHub. The software which runs our infrastructure (Kubernetes, etc) is configured via specific files. These also live on Github, and again if we were to lose these, they can be rapidly restored.

We also depend on certain third party services, such as payment processing by Assembly and MoneyTech. We’ve designed our platform to be non dependent, which means if there was a disruption to payment processing, no data is lost. Our platform will simply wait and then continue processing once the payment processor comes back online.

Disaster scenarios:

  1. Our offices become unavailable / unusable
  2. Continuity Impact: none
  3. Recovery process: engineers work remotely, replace premises and computers
  4. Data loss: none
  5. Payment processor unavailable
  6. Continuity Impact: limited payment flow for duration
  7. Recovery process: payments will queue and restart
  8. Data loss: none
  9. AWS single zone outage
  10. Continuity Impact: none
  11. Recovery process: assets in the remaining zones autoscale up to take on the load.
  12. Data loss: none
  13. AWS total region outage
  14. Continuity Impact: severe
  15. Recovery process: rebuild cluster in alternate region eg: AP-southeast-1
  16. Data loss: data recreated from last snapshot (max 24h)
  17. AWS total outage
  18. Continuity Impact: catastrophic
  19. Recovery process: rebuild cluster with alternate vendor (GCS, Azure)
  20. Data loss: data recreated from last snapshot (max 24h), but could impact many other services such as Github, Assembly
Did this answer your question?