(This post is a part of a series. See original)
As the world’s largest marketplace for small services, Fiverr deals with gigantic amounts of rich user-generated content at any given moment. We run our own billing, messaging, search and order management systems. Our systems were designed for maximum data integrity, so most non ACID NoSQL solutions are out of the question. In addition, unlike blogs or news sites where data remains static after it has been posted, our data continues to change as users interact with their orders, sellers and gigs constantly. This makes our databases write-intensive and much harder to scale out.
The database stack is always our toughest challenge in scaling Fiverr. On our first year, we spent countless hours optimising the database to handle the rapid growth of the site. As it turns out, optimisation can only take you so far, and eventually you’ll reach a glass ceiling that can’t be crossed only by making your queries better. We hit that barrier on our second year. It was then, that we knew we had to rethink everything about how our database works.
If you’re starting a new project or getting closer to your own glass ceiling, it might be a good idea for you to check some of these pointers out:
- Manage your database connections in one place
Make sure your queries go through a centralised connection manager and that your production database connection config files is kept in a separate file, outside of your main codebase. While you’re at it, make sure your db config file only points to hostnames and not hardcoded IPs.
Having a separate, well configured, external db connection file will make adding new slaves and changing stuff much easier. Regardless of the size and complexity of your app, making it aware of architecture changes will only take seconds. (If you’re using Rails, this would be your database.yml file). I can’t stress enough how important this is for your app’s flexibility. We even try to maintain this approach with our NoSQL data sources such as MongoDB and Redis.
To achieve fine grain control over which servers handle specific writes and reads we use the db_charmer gem. This gem lets you very easily assign specific queries to specific slaves. For example, you might want to have dedicated slaves (that do not serve live traffic) handle your heavy queries asynchronously.
Load balance your database servers
Real traffic spikes come unannounced, creating a ripple effect in your system. The first to get hit is usually the database. Overwhelmed by the amount of connections it has to serve, it simply fails to respond in time, causing a traffic jam of application server workers. At that time, users fail to get a response from your site and go into refresh frenzy. From this point, site downtime is imminent.
Going through a scenario like this without a having a database load balancer is a real bummer. At Fiverr, all queries go through a dedicated db balancer. In practice, this means that your application only gets to know two* db hostnames (one master and one slave). Each of these database hosts will eventually map to a set of slave servers and one master in the load balancer’s configuration. The distinction between the master and slaves is usually done using a designated port.
* In a logical sharding configuration this would look a little different, as you will have a master and a slave hostname for each shard.
Ouch Tip: Make sure you configure slaves for read-only operation. This will protect you from replication hell if someone configures something wrong in the balancer.
ActiveRecord is for pussies
One of the first things people love about rails is ActiveRecord and they way it magically ‘writes’ database queries for them. It’s easy to learn, maintain and read. It automatically maps your data to models and really allow you to focus on functionality and less on how things actually work under the hood.
But wait, there’s a problem. ActiveRecord will never be as smart as you when it comes to building real-world queries. A good example for this is ActiveRecord’s poor utilisation of Indexes, specially in the case of Compound ones. A query written in an order that doesn’t match your Compound Index’s order will simply get ignored. You can try to force ActiveRecord to use indexes the way you want, but it’s not trivial and takes the charm out of the entire thing.
Named Scopes are also great for building stuff quickly. However, they strength is also their weakness, as they are so easy to temper with. At one occasion, a developer in our team changed an innocent looking Named Scope which was originally only meant for use in display. Turned out that the change affected something deeper which caused serious damage and took weeks to fix.
We are constantly optimising our most heaviest named-scopes by turning them into hand crafted optimised SQL. As a result, we see dramatic improvements in performance, with queries dropping from several seconds to a few milliseconds.
Ouch Tip: Remember to sanitize your pure SQL queries before executing them to protect yourself against SQL injections. You can do that very easily using ActiveRecord sanitize features.
Slicing and dicing
For write intensive applications like Fiverr, scaling eventually means having more than one master. A typical scenario is when write operations start taking longer and longer to complete with each day that goes by. This also has a tremendous effect on replication times as slaves fall behind the master.
When this happens, it might mean that you’ve reached the maximum potential of the master database. Scaling up the master server will not do much other than buy you some time.
There are several ways to scale-out MySQL masters. The two most common ones are master-master replication and sharding. For us, sharding was the lower hanging fruit and we chose to start with it. Moving to sharded master-slave trees was a game changer for Fiverr. Suddenly, MySQL is no longer the bottle neck.
There is more than one way to shard your data. We decided to approach sharding gradually. We started by creating Logical Database Trees (LDTs), each consisting of a load balancer, a master server and a bunch of slaves. Each LDT is responsible for a single logical application unit such as Gigs, Users and Orders.
Making this change on a live system was not easy, but we achieved it with absolutely no downtime thanks to the way our system is configured (see previous points)
Monitor and Diagnose
Pareto’s principle claims that 80% of the problems come from 20% of a system. From our experience, in real life this is even more extreme.
Before you reach scalability nirvana, a single poorly written query can bring down your entire app. That’s why it’s so important to have eyes on your system at all times, and to be able to dig deeper when needed. At Fiverr, we rely on several tools to monitor our app. I’ll probably post a separate article about that, but let me just say that if I had to choose one tool to prevent us from running blind, that would be NewRelic.
New Relic allows us to diagnose, track and get alerts for our entire stack. Starting from the servers hardware, through database optimisation, ruby code, and up to client side rendering performance.
There is no magic pill to making your system scalable. These guidelines are just the tip of the iceberg. However, done right, they can get you a long way into making your app more durable and flexible for scale out changes - from day one.