I recently did a hunt for an open-source monitorig framework, to monitor a mix of external and internal servers.
Initially, I was interested in Bosun which seems quite impressive and powerful. But I quickly ran into a few shortcomings for my specific purpose. Specifically, its web ui has no built-in users or security. I was hoping to host this remote (on Digital Ocean) and so needed a login. That’s obviously possible using just security settings in your web server, but then the question applies also to the message queue or other ways stats are input to the system. Also at the time, I actually didn’t turn up Prometheus in my searching - maybe they need to watch their SEO.
Wait you say, why not Nagios? We used nagios at my previous org, and while many of the other projects out there borrow at least some concepts from it (Sensu included), I just couldn’t bring myself to install and run it again. It would have absolutely done the job, but a mix of wanting to try something new, and feeling there had to be something better out there, lead me to set an early criteria that I would absolutely not use nagios.
So somewhere along the way I found Sensu, and this looked quite interesting. It had a lot of things that interested me:
- Built on Ruby, a language I know really well
- Provides both straight monitoring (run a script, if the output is bad, raise an alarm) as well as stats as well as monitoring of those stats (raise an alarm based on a trend)
- Already has an OSS web ui, Uchiwa
- Community/Enterprise model (pay if you want/need to)
- That often missing gem, particularly compared to some other similar projects out there: really clear documentation
Enterprise vs Community
Sensu uses an Enterprise vs Community business model. There is an Enterprise version that sounds like it comes with a better dashboard and many plugins already built in. I absolutely don’t mind paying for that type of support, but in this particular case, my budget was 0 since this was essentially a pet project of mine and not a company-backed endeavor. Plus, the community version is a great way to try before you buy, to see you do really like the software. I add this only to say, after a few months of using it, I am very happy with Sensu and if you need some support and backing on your monitor software, I would definitely recommend going with the Enterprise version from them.
For those on the community side though, below I will take a quick look at what/how I installed to mold the community version into what I wanted.
Installing
I fired up an Ubuntu 14.04 instance on Digital Ocean, as I wanted to host this monitoring outside of my own domain. That way, its monitoring things like my ISP connection and still has a way to email me if that goes down - its not trapped behind my own firewall, unable to notify me if that very same firewall is the thing not working. Setting it up remotely adds one complexity, which is its possible that the virtual machine or network connection at Digital Ocean is the thing that is down, but in my example that risk was acceptable. If that risk is not for you, you can install a second monitoring node inside your firewall, to “watch the watcher”.
Install Sensu Itself
Here is where Sensu’s own clear and understandable documention really comes into play. They have an excellent installation guide, so start there.
That should help you get installed:
- Sensu Core, the monitoring daemon itself
- RabbitMQ, which Sensu uses for message queuing
- Redis that Sensu uses on the back end
- Sensu Client which is the same software you run on each node you want to monitor
- Uchiwa, if you choose, as a dashboard and web ui to Sensu Core (the Sensu install docs link over to the Uchiwa install docs)
During the install tutorial, the documentation will have you use a few simple shells scripts to do things like cpu or memory monitoring. Its good to install those and learn your way around Sensu Core, but don’t worry. There is actually a large collection of ready-to-use Sensu plugins in the form of ruby gems, that do a more thorough job of that type of monitoring. That is where I went next.
More Monitors and plugins
Of course probably the next thing you will want is to monitor more things then the install tutorial shows you. The Sensu Plugins home on github has them all. Some have fairly sparse documentation but they will start to make more sense the more you get to know how Sensu works. Most include both monitoring scripts, that you will use directly in your /etc/sensu/conf.d
folder via json configurations, and metric scripts, which output metrics in graphite format for import direct into graphite.
Since they all come as ruby gems, installing them is just a matter of a gem install. Remember that once you decide on your suite of monitors, every client that you monitor will need these same gems installed. I will talk about that a little more later.
Graphite and Stats
After the steps above, you reach the end of what the Sensu tutorial itself will teach you. But you still have some things you want to get going: email alerts, metrics in graphite, etc. I found those pretty easy to get going, via the rough steps below.
Install Graphite
You could have Sensu import stats collected from the clients into anything that understands graphite format stats (which is an extremely straightforward format). I opted to just use Graphite initially, rather than jump to some alternative or to even place statsd between Sensu and Graphite. I wanted to start there and grow into anything more complex that I needed.
I did find the initial setup of graphite a little challenging though, the graphite manual on its own was a little too broad. For instance I initially tried to use SQLite for the user database and for whatever reason that was not working for me. I quickly found an alternative set of instructions from Digital Ocean though and these are crystal clear and focused on Ubuntu 14.04 - easy!
Getting Metrics from Sensu into Graphite
There are a number of ways to do this, but a little googling turned up Sensu Metrics Relay, I guess at one time also called Wizard Van (go figure). It looks a little abandoned but actually works just fine for me as-is. The github page has install steps - I git cloned it and followed those.
If anyone has a better recommendation on this link, or a more official way, I would love to hear about it! But as I said, this method worked for me. Now, in the /etc/sensu/conf.d
json file for anything measuring metrics, I just include a handler of relay and the stats go into graphite.
Configure Graphite
I also of course had to configure /etc/carbon/storage-schemas.conf
for what I wanted my storage policies to be. This is one downside of hosting on Digital Ocean, depending on how many clients you are monitoring - you can chew up storage space fairly fast, depending on your settings in storage-schemas.
Firewalls and security
A last step for me, since I am hosting remotely, was to configure ufw with what I wanted and needed open, including the ports for RabbitMQ so remote clients can send messages. I made a few application config files in /etc/ufw/applications.d
to cover rabbitmq, uchiwa, and apache which was hosting the graphite web ui. Note you can set an authentication step for RabbitMQ too, so while you have the port open through ufw, each client still has to authenticate to RabbitMQ before it sends its messages.
Graphite’s Web UI
While graphite has users and roles for its little web ui, the guest user seems to actually have read-only access to everything. So, I added an authentication block to my apache config file to require a login to the graphite web ui.
My alert method of choice was going to be email, so I installed sensu-plugins-mailer from the plugins repo above. However, that requires an actual mailer installed. Once again, Digital Ocean has a really great tutorial on setting up postfix on Ubuntu 14.04, so I followed that.
Setting Up More Clients
By this point, I seemed to be in pretty good shape: monitor, stats, hosted remotely but secured. Next step is to get all this installed on all the remote clients I wanted to monitor.
I had a mix of Ubuntu 12.04, 14.04 and 15.x clients out there. That means they have access to different versions of ruby, if I install from apt. I also knew that most of these clients did not have ruby installed to start.
What I opted to do was write a bootstrap script in Python, which would:
- Detect the Ubu version, and install the appropriate version of Ruby from apt
- Install Sensu Client, and the list of gems for the monitoring and stats plugins I wanted
- Write out a config file to
/etc/sensu
to get started
This ended up working out really well: the install process is just a matter of fetching the script and running it, something that can expressed in a single line of shell. It went through a few revisions, as I will mention below, but for the most part I didn’t have to circle back to many servers to tweak the setup.
Troubleshooting
Of course, as with all things, this setup wasn’t without its problems. One issue I ran into was some of the clients failing their heartbeat check, despite Sensu Client running just fine on them. Turns out this was due to differences in system clock - the heartbeat looked old by the time it arrived to Sensu Core because the client’s clock was wrong. So, I added ntp install to my bootstrap script above. Those clients need the correct time anyway, so ultimately this was a good catch!
Network Metrics
While the TCP metrics collection from the Sensu Plugins repo above is good, it does not include average bit rate. This is because to measure bit rate, you must measure it over a period of time. You could possibly use Carbon or Statsd to do this - receive a point-in-time measurement of bits being delivered, and turn that over a period of time into bit rate. I opted to go another way though, which is to use aggregation on the client (via the vnstat library) and then input the already-aggregated bit rate into graphite. I wrote a sensu plugin of my own to do that, using vnstat.
There is a downside here, I have pushed the work of the aggregation onto each of my client cpus. Admittedly, that work is tiny, so its not that big a deal, but in my list of possible future improvements below, I do include the idea of implementing what I mention above - have each client report a true point-in-time measurement and do the aggregation work on the central graphite server.
Future improvements
So with all that, we have a pretty functional monitoring and stats system:
- Now monitoring about 70 servers for me
- Monitoring mem usage, cpu spiking, key pids, etc. on each server
- Have historic stats of cpu usage, memory usage, network usage
- Can view the stats via the graphite dashboard, can view the state of alerts via Uchiwa
But, there are plenty of things that could be improved, here are some thoughts on my own next steps.
Less Intensive Stats Gathering
Running a separate ruby script for every stat I want to collect is taking a small toll on the cpu of each client. I have measured it at around 1% of a quad core. I think this could be reduced using something like collectl in place of all those scripts.
I think it would be possible to have collectl input directly to the local sensu-client and have the sensu-client relay that into graphite over the existing RabbitMQ. This would reduce the needs on firewall and port setups between each client and sensu.
Better Dashboard for Graphs
Obviously the dashboard included with graphite is a bit cumbersome - it has a few more features, ironically enough, then I really need. I really just want the same few graphs for every system - such as seeing cpu usage across every system to spot overworked systems. So a quick little web interface to do just that for a few key graphs seems like a good idea.
Dead Man’s Switch
In the sensu docs they share a thought on how to make a Dead Man’s Switch using the system. That would be great for a few reorting and backup jobs I having running.
Conclusions
After more than 6 months of use I am very happy with Sensu. Its scaled for me with little effort to 70 monitored systems. I can definitely see where scaling higher might require a little work (or at least a larger Digital Ocean instance) but for my scale, that is great.