Friday, December 12, 2014

Home Stretch

Demo

Since my employer has an interest in Sensu and I have been doing my work within their network I had the opportunity to explain what I have worked on so far, how the environment is configured, and had a discussion on next steps and future planning. This project is one that will definitely continue after my course has completed next week, and has been a valuable learning experience. Here is a summary of what was covered in the demo:

What is Sensu: Sensu is a system monitoring solution designed for the cloud. Sensu is easily scaled and provides a simple web ui to visually 'connect the dots' with what is going on within the monitoring environments. One big advantage is, since it was designed for the cloud, clients automatically can check themselves in and add them to you monitoring. The Sensu project has come a long way and compared to other monitoring solutions like Nagios, does things better, but may be lacking in a few aspects. (This may just be to lack of knowledge on my part, or I may need to explore additional docs for what Sensu is capable of).

How is Sensu set up: The Sensu environment I have set up is an 8 machine environment that includes: 3 Sensu servers (which also run Sensu API and Uchiwa dashboard), 2 RabittMQ servers, and 3 Redis.
Currently Sensu is setup in a centralized fashion with directly connected clients. What this means is that the clients run checks locally (standalone checks) and publish results up to RMQ and are then picked up by Sensu which handles notification routing. The check data is also stored in Redis and the dashboard (Uchiwa) is updated through the Sensu API. Setting up Sensu in this fashion prevents arbitrary and potentially malicious checks to be run due to a system breach., and it also decreases infrastructure complexity.

High Availabiliy: High availability is achieved by:
  • Sensu: master election is internal within Sensu with no additional configuration needed by the user. This is achieved by Sensu-servers being aware of each other through checking connections to RabbitMQ and Redis. Master election and failover is relatively seamless in the eyes of the user.
  • RabbitMQ: RabbitMQ provides for cluster failover built in and Sensu does support this feature; but unfortunately the Sensu Puppet module does not support this feature and a fairly significant rework is needed in order to get this to function properly. In order to work around this pitfall load balancing the 2 RabbitMQ instances will provide HA.
  • Redis: In order to achieve a clustering like end goal, we are using Sentinel to provide master election and failover. Sentinel is built in to newer version of Redis and operates on a quorum election to choose a master in the event of a master failure. 
  • NOTE: these three components will all be load balanced eventually to provide for additional protection during machine or service failure.
The Dashboard- Uchiwa: Basically I went over all the tabs that resides within Uchiwa and different features provided by the service.  Some noteworthy features are:
  • Overall Look/Feel: The look and feel of Uchiwa is very slick, modern, and simple. It provides useful information while taking a minimalist approach. 
Going from a Nagios UI

To Uchiwa's

  • Stashes: Stashes are used to schedule downtime, and can be used to silence a check or a machine from alert notification. While a check/machine is silenced (stashed) the event data will still show, but all alert handling is turned off to reduce noise during things such as maintenance. 

Moving Forward: After some discussion we talked about if we choose to move forward with a larger implementation of Sensu how it will take a pilot or test run in order to ensure Sensu can do what we need it to do. This test run would include propping up a production like instance of Sensu and hammering on it with realistic-like events and event handling to put it through its paces. Also moving forward we would like to replace the auth provided by Uchiwa which provides only for a single user, and put LDAP with Apache in front of Uchiwa for authentification and then turn off Uchiwa's. Since I have been able to get a couple different handlers working (mailer, pagerduty) and have a good handle on checks my supervisors/coworkers would like me to look at how to send event data to graphite through Sensu to compare to our current approach. Also they would like me to look into how Sensu checks handle dependencies. 


Wednesday, December 3, 2014

December Update

Checks/Handlers

Finally, after a lot of troubleshooting and falling through some rabbit holes I've been able to get checks and handlers working. There were some problems that arose while using some of the Yelp handler's within my environment. Some of these problems were a result of some Yelp specific code that still existed within the module, and others because of some misconfigurations. With some help I was able to get the checks working to call a handler to send emails for alerts/resolutions. Currently only 5 checks are implemented across the 8 Sensu dev machines, and only one handler has been configured for the environment.

HA Testing

Testing for High Availability within the environment was easy, but there were a few snags I ran into. With both Sensu and Redis Sentinel there were some issues, that I caused because of oversights. Basically what was happening is that I forgot to lock puppet on a few of the boxes so the master would run against the boxes and screw up the firewall and lead to some connection errors. Once I fixed that master election in Sensu and Sentinel failover for Redis were working great.

Moving Forward

Checks/Handlers:
I'd like to get 2 more handlers up and running before presenting this project (Dec. 17th), and also add the sensu client role to a random machine to make sure that client discovery is working and to test the ease at which Sensu can be implemented.
HA:
Get Sensu, Redis, and RabbitMQ load balanced to ensure failover and ease for ease of access for uchiwa dashboard.
Security:
Implement LDAP with uchiwa. From my understanding this is done through LDAP and Apache in front of Uchiwa. Once LDAP and Apache are set up for auth then the auth provided by uchiwa can by turned off. Also I'd like to re-investigate some RabbitMQ specific stuff. There are currently some issues on Github I've been following and need to check the progress on those issues.