For the past weeks along with the preparation of the stable release of LTTng 2.0 (especially the Babeltrace API part) and LTTngTop development, I have been experimenting with Google Rocksteady.
This application is the monitoring solution used at Google, so it is designed to be highly scalable and distributed.
The basic concept is that the servers run some some daemons (not open source) which generate metrics (based on all criterion we can imagine on a system and on applications), these metrics are sent to a message queueing daemon (RabbitMQ in this case) and relayed to Rocksteady.
From the website : Rocksteady is a java application that reads metrics from RabbitMQ, parse them and turn them into events so Esper(CEP) can query against those metric and react to events match by the query.
I didn't start to experiment with CEP, but it is my next step, for now I tested the graphing component Graphite that comes with Rocksteady and it is really interesting.
To do that, I created a new branch of lttng-graph (on my public git) named rocksteady. Combined with a script that record a kernel trace for 1 minute. When the 1 minute trace is recorded, lttng-graph reads the trace, collects the number of each type of event and the number of events per-process and sends all of that in the message queue and the tracing restarts. The graphing tool Graphite listens in that queue and create graphs for each "metric". The metrics are in the format :
For example :
As of now, the parsing is really not efficient (the trace is recorded on disk and then parsed at 100% CPU), but the proof of concept is really interesting.
As we can see in the attached file, the web CLI of Graphite allows us to quickly display a graph with the events we want so we can have a visual correlation of virtually any metrics combined.
The next steps are :
- efficient parsing (live parsing, code will be shared with LTTngTop as soon as I figure how to hack a semi-clean lttng-tools API)
- reading and experimenting with CEP because with this tool we can correlate multiple metrics programmatically and generate alerts on triggers (for example they use that at google to alert when the page display latency is too high, they quickly send an alert along with the factors that could impact the delay and instantly they know why something wrong happened and can even predict the next), I really want to have that based on kernel and tracing !