Today was undoubtedly a special day for internet startups and businesses. One of the biggest and most famous “cloud providers,” Amazon Web Services, went partially down, taking with it famous websites like Quora, FormSpring, and dozens of others.
The worst part of the story is that as of the time this article is being written, some of these websites have experienced over 10 hours of downtime.
LearnBoost is hosted in the same cloud as these services, but our teachers, parents and students enjoyed their gradebooks with same quality of service today as every other day. I’m going to take this opportunity to explain what decisions we made toward making this possible, what technologies we employed, and in the spirit of open source, how others can achieve this.
Short introduction to the Cloud
The Amazon Web Services Cloud offers solutions for running systems and storing data in a scalable infrastructure. It provides room for growth (even explosive) that would otherwise be costly and complex in a traditional infrastructure.
AWS has multiple data centers, each with multiple “availability zones.” The data center that experienced problems today, “us-east,” is located in North Virginia. An instance (and persistent storage units) can belong to only one particular availability zone, but instances can communicate with others in other availability zones in the same data center in a very fast and performant way.
Data replication with MongoDB
At LearnBoost we’re extremely happy with our decision to have made MongoDB one of our primary storage systems. The MongoDB team always encourages developers and system administrators to deploy your database in a replicated way.
A few months ago, before MongoDB supported single-server durability, people often wondered why you were required to boot up more than one server. Today should be a great example of why it’s not a good idea to have single points of failure.
When we made the decision to deploy to Amazon, we structured our architecture in such a way that:
- There’s a separation between a database server and application server.
- Our databases are distributed across four availability zones
Whenever a teacher introduces a change to our database, it’s replicated to 4 servers in different availability zones before a “success message” is displayed in the frontend.
In addition, MongoDB allows us to have a delayed replica: if corruption of data occurs for whatever reason (human error or computer error), we can still fallback to a very recent replication state. We also leverage the excellent incremental EBS snapshots on that machine for data backups.
As a matter of fact, two of our replicas failed today (as shown in the `rs.status()` command below), and the failover (recovery) was automatic 
Setting up MongoDB replica sets
If your database of choice is MongoDB, creating a redundant system like this should be simple.
As a first step, make sure that application servers are easy to spawn up. To this end we leverage:
- NPM (Node.JS Package Manager): we keep all our dependencies (public and private) in NPM, so that dependency resolution and updates are seamless
- Git: to retrieve our codebase and get it up to speed with the latest production branch
- Bash scripts: to perform the installation of services like Redis
As a second step, leverage replica sets. We do this in production and development, so that we can best simulate the real conditions of our live codebase.
The following is a snippet from our Makefile that sets up 3 replica nodes:
Then, you need to simply connect to the first node and initialize it:
Make sure to adapt the relative path locations to your needs.
If you’re deploying this to Amazon, make sure to use the internal DNS (which you can get from your administrative console) for the HOST values. This is crucial to ensure nodes can identify themselves and others.
At LearnBoost, a lot of work has gone into making sure our users’ education data is safe and available. We’re soon going to post more information about different projects we use to guarantee the stability of our applications, monitoring, error reporting (both on the client and server side), and more.