How Plurk does systems monitoring
I think that monitoring and having an overview is one of the most important things in running a healthy web-application. As far as I can see, the toolset of monitoring things is pretty under developed and I want to showcase Plurk's monitoring tools, which we will probably open source in the future.
First of all, we at Plurk use the standard tools that's used for monitoring. These are:
Both cacti and nagios are pretty essential in our tools chain, but they fail to answer some essential questions:
Once you have these things, it's much easier to optimize and fix errors. Once you have these things answered, you'll feel blind when you don't have them at your disposal.
Let's look at how these different questions are answered.
SQL monitor aggregates queries, groups them, times them and runs a SQL EXPLAIN on them. It's like MySQL Enterprise SQL monitor (which costs $595 pr. server pr. year!) Our SQL monitor can log queries from all our servers and has a web front end that can sort results by average execute time, times run etc.
An example screenshot of Plurk's SQL Monitor:
I did the request logger some time ago and it's open sourced. Basically, it logs requests, times them, groups them and provides a web interface so one can easily see stats about the requests.
An example screenshot of Plurk's Request Monitor:
Like SQL Monitor and Request monitor, the central logger logs errors from all our servers and groups them together. It can quickly answer what errors are most common and give debug information about them.
An example screenshot of Plurk's Central logger:
Most of these tools are not that hard to make, but they provide essential information that can give you overview and can help you create a more robust and heathy product.