How to create very reliable web services

Monitoring matrix

Over the years I have learnt a lot about creating reliable web services and I have scaled a startup from 0 to millions of users and billions of pages views. I want to share some tips with you that you apply to any language and any website.

The general strategy

I think creating scaleable and reliable services boils down to following:

  • Monitor everything: Have a clear picture of what's going on at any time. Have a log of how your past performance has been, have a log of errors, have a log of key metrics in your application, have visualization of every computer in your network, have visualization of response times etc. Using Google Analytics isn't enough, you need much more powerful tools.
  • Be proactive: Backups and anticipations of future load should be proactive. If your load is close to maxing out today then don't wait till your web site is down before you begin optimizations. Know everything about your systems so you can take an educated guess of how many more users you can handle. Think about which strategies you can apply if you want to scale 10x, 100x, 1000x of your current load.
    Snippet from Proactive vs. Reactive scaling.
  • Be notified when crashes happen: Getting SMS notifications is easy using a service like Pingdom or an open-source library like crash_hound.

The tools that you can use

In my toolset I use a lot of tools and I highly recommend them. There is no need to reinvent the wheel, unless necessary!

Realtime monitoring of anything

I read about statsd and Graphite in Keeping Instagram up with over a million new users in twelve hours.

After I read that blog post I was excited and I used a day to get statsd+Graphite up and running. Graphite is a bit complex to setup, but the investment was great and I must say that it changed my life! This setup lets you track anything in realtime without any performance penalty (since statsd uses non-blocking UDP communication). I track the average response times, number of errors, number of signups, number of sync commands, average response time pr. sync command etc. etc. I track these things across servers and across services.

This is especially useful when you do multiple deployments pr. day as you can quickly see if your latest deployment is introducing errors or performance issues.

Here's one of our Graphite dashboards:

Graphite dashboard

Monitor performance and get notified when something crashes

Pingdom is quite useful as it can track performance from multiple locations in the world. It can also notify you on SMS when your website crashes. Pingdom is also public and lets your users track your uptime and performance (a very useful service for your users).

For more powerful stuff we use crash_hound which lets you program notifications! For example, I get a SMS if a worker queue is not being processed. Most of our notifications are setup using small scripts that are less than 10 lines of code.

How one of our Pingdom monitor pages look like:

Pingdom's graphs

Log every error

We have a central logger that logs every error and displays it in a central location. With each error we can see a request's environment. It's a must have tool for any web service since you can track errors and bugs much easier.

Here's how our central logger looks like:

Central logger

I am sure there are some open-source tools that can do this as well (I did not find any that matched our style so we implemented our own).

SQL and request logger

We have a SQL logger that aggregates queries and shows which are run most times and which are the most expensive. We also have a request logger that shows the same things for requests.

These two tools are essential when you want to know which part of your codebase are worth optimizing.

I open-sourced request logger some time ago (it's very simple and quite hackish!)

Here's a screenshot of how it looks like:

Request logger

Monitor your servers

Being hosted on Amazon AWS is a privilege since they provide a lot of great monitoring tools (such as CPU usage and alarms if your servers start to use more resources than necessary).

There are some open-source tools available as well. At Plurk we used Cacti and Nagios. Cacti and Nagios are complex and powerful tools, but definitely worth your time since they can give you a much better picture of your servers and services you are running.

Here is how some of Amazon's monitoring tools look like:

Amazon AWS server monitoring

I hope you found my tips useful! I might open-source some of my tools such as the central logger or SQL monitor in the near future.

Stay tuned and as always: happy hacking!

3. Aug 2012 Code · Code improvement · Tips
© Amir Salihefendic