How to create very reliable web services
Over the years I have learnt a lot about creating reliable web services and I have scaled a startup from 0 to millions of users and billions of pages views. I want to share some tips with you that you apply to any language and any website.
The general strategy
I think creating scaleable and reliable services boils down to following:
The tools that you can use
In my toolset I use a lot of tools and I highly recommend them. There is no need to reinvent the wheel, unless necessary!
Realtime monitoring of anything
I read about statsd and Graphite in Keeping Instagram up with over a million new users in twelve hours.
After I read that blog post I was excited and I used a day to get statsd+Graphite up and running. Graphite is a bit complex to setup, but the investment was great and I must say that it changed my life! This setup lets you track anything in realtime without any performance penalty (since statsd uses non-blocking UDP communication). I track the average response times, number of errors, number of signups, number of sync commands, average response time pr. sync command etc. etc. I track these things across servers and across services.
This is especially useful when you do multiple deployments pr. day as you can quickly see if your latest deployment is introducing errors or performance issues.
Here's one of our Graphite dashboards:
Monitor performance and get notified when something crashes
Pingdom is quite useful as it can track performance from multiple locations in the world. It can also notify you on SMS when your website crashes. Pingdom is also public and lets your users track your uptime and performance (a very useful service for your users).
For more powerful stuff we use crash_hound which lets you program notifications! For example, I get a SMS if a worker queue is not being processed. Most of our notifications are setup using small scripts that are less than 10 lines of code.
How one of our Pingdom monitor pages look like:
Log every error
We have a central logger that logs every error and displays it in a central location. With each error we can see a request's environment. It's a must have tool for any web service since you can track errors and bugs much easier.
Here's how our central logger looks like:
I am sure there are some open-source tools that can do this as well (I did not find any that matched our style so we implemented our own).
SQL and request logger
We have a SQL logger that aggregates queries and shows which are run most times and which are the most expensive. We also have a request logger that shows the same things for requests.
These two tools are essential when you want to know which part of your codebase are worth optimizing.
I open-sourced request logger some time ago (it's very simple and quite hackish!)
Here's a screenshot of how it looks like:
Monitor your servers
Being hosted on Amazon AWS is a privilege since they provide a lot of great monitoring tools (such as CPU usage and alarms if your servers start to use more resources than necessary).
There are some open-source tools available as well. At Plurk we used Cacti and Nagios. Cacti and Nagios are complex and powerful tools, but definitely worth your time since they can give you a much better picture of your servers and services you are running.
Here is how some of Amazon's monitoring tools look like:
I hope you found my tips useful! I might open-source some of my tools such as the central logger or SQL monitor in the near future.
Stay tuned and as always: happy hacking!