Friday, December 12, 2014

Home Stretch

Demo

Since my employer has an interest in Sensu and I have been doing my work within their network I had the opportunity to explain what I have worked on so far, how the environment is configured, and had a discussion on next steps and future planning. This project is one that will definitely continue after my course has completed next week, and has been a valuable learning experience. Here is a summary of what was covered in the demo:

What is Sensu: Sensu is a system monitoring solution designed for the cloud. Sensu is easily scaled and provides a simple web ui to visually 'connect the dots' with what is going on within the monitoring environments. One big advantage is, since it was designed for the cloud, clients automatically can check themselves in and add them to you monitoring. The Sensu project has come a long way and compared to other monitoring solutions like Nagios, does things better, but may be lacking in a few aspects. (This may just be to lack of knowledge on my part, or I may need to explore additional docs for what Sensu is capable of).

How is Sensu set up: The Sensu environment I have set up is an 8 machine environment that includes: 3 Sensu servers (which also run Sensu API and Uchiwa dashboard), 2 RabittMQ servers, and 3 Redis.
Currently Sensu is setup in a centralized fashion with directly connected clients. What this means is that the clients run checks locally (standalone checks) and publish results up to RMQ and are then picked up by Sensu which handles notification routing. The check data is also stored in Redis and the dashboard (Uchiwa) is updated through the Sensu API. Setting up Sensu in this fashion prevents arbitrary and potentially malicious checks to be run due to a system breach., and it also decreases infrastructure complexity.

High Availabiliy: High availability is achieved by:
  • Sensu: master election is internal within Sensu with no additional configuration needed by the user. This is achieved by Sensu-servers being aware of each other through checking connections to RabbitMQ and Redis. Master election and failover is relatively seamless in the eyes of the user.
  • RabbitMQ: RabbitMQ provides for cluster failover built in and Sensu does support this feature; but unfortunately the Sensu Puppet module does not support this feature and a fairly significant rework is needed in order to get this to function properly. In order to work around this pitfall load balancing the 2 RabbitMQ instances will provide HA.
  • Redis: In order to achieve a clustering like end goal, we are using Sentinel to provide master election and failover. Sentinel is built in to newer version of Redis and operates on a quorum election to choose a master in the event of a master failure. 
  • NOTE: these three components will all be load balanced eventually to provide for additional protection during machine or service failure.
The Dashboard- Uchiwa: Basically I went over all the tabs that resides within Uchiwa and different features provided by the service.  Some noteworthy features are:
  • Overall Look/Feel: The look and feel of Uchiwa is very slick, modern, and simple. It provides useful information while taking a minimalist approach. 
Going from a Nagios UI

To Uchiwa's

  • Stashes: Stashes are used to schedule downtime, and can be used to silence a check or a machine from alert notification. While a check/machine is silenced (stashed) the event data will still show, but all alert handling is turned off to reduce noise during things such as maintenance. 

Moving Forward: After some discussion we talked about if we choose to move forward with a larger implementation of Sensu how it will take a pilot or test run in order to ensure Sensu can do what we need it to do. This test run would include propping up a production like instance of Sensu and hammering on it with realistic-like events and event handling to put it through its paces. Also moving forward we would like to replace the auth provided by Uchiwa which provides only for a single user, and put LDAP with Apache in front of Uchiwa for authentification and then turn off Uchiwa's. Since I have been able to get a couple different handlers working (mailer, pagerduty) and have a good handle on checks my supervisors/coworkers would like me to look at how to send event data to graphite through Sensu to compare to our current approach. Also they would like me to look into how Sensu checks handle dependencies. 


Wednesday, December 3, 2014

December Update

Checks/Handlers

Finally, after a lot of troubleshooting and falling through some rabbit holes I've been able to get checks and handlers working. There were some problems that arose while using some of the Yelp handler's within my environment. Some of these problems were a result of some Yelp specific code that still existed within the module, and others because of some misconfigurations. With some help I was able to get the checks working to call a handler to send emails for alerts/resolutions. Currently only 5 checks are implemented across the 8 Sensu dev machines, and only one handler has been configured for the environment.

HA Testing

Testing for High Availability within the environment was easy, but there were a few snags I ran into. With both Sensu and Redis Sentinel there were some issues, that I caused because of oversights. Basically what was happening is that I forgot to lock puppet on a few of the boxes so the master would run against the boxes and screw up the firewall and lead to some connection errors. Once I fixed that master election in Sensu and Sentinel failover for Redis were working great.

Moving Forward

Checks/Handlers:
I'd like to get 2 more handlers up and running before presenting this project (Dec. 17th), and also add the sensu client role to a random machine to make sure that client discovery is working and to test the ease at which Sensu can be implemented.
HA:
Get Sensu, Redis, and RabbitMQ load balanced to ensure failover and ease for ease of access for uchiwa dashboard.
Security:
Implement LDAP with uchiwa. From my understanding this is done through LDAP and Apache in front of Uchiwa. Once LDAP and Apache are set up for auth then the auth provided by uchiwa can by turned off. Also I'd like to re-investigate some RabbitMQ specific stuff. There are currently some issues on Github I've been following and need to check the progress on those issues.

Thursday, November 20, 2014

Update

RabbitMQ Failover


So as of this post there is still not a fix for to support multiple broker connections within the Sensu puppet module. I submitted an issue via github and was asked if I could take a look at creating a PR to implement this update, but after looking at the module and talking with my coworker for some insight it seems a considerable amount of work would be needed in order to get this fix in place and would more than likely introduce breaking changes within the module. So for right now RabbitMQ is in the same boat as Redis and Sensu and will need to be load balanced if I want failover and high availability. As of right now load balancing these components has taken a back seat, and implementing checks and handlers has become my new focus.

Checks/Handlers

In order to implement checks/handlers I've decided (with guidance from my supervisor) to borrow from Yelp's sensu_handlers github project to create handlers and also use a similar style as Yelp uses in their puppet_monitoring_check project for checks. This will enable me to create a hiera file with team specific information, and define which team the check applies to, which will then be passed to the handler in order to notification routing and event processing. This also will solve, at least in my opinion, the problem that Sensu Core has with not providing contact routing configurations out of the box. 

Wednesday, November 5, 2014

Sensu Enterprise Announced

Sensu Enterprise
With the latest update to Sensu (0.16), there has been the added feature of installing/purchasing Sensu Enterprise. Sensu Enterprise hosts a slew of features not provided in the free version (Sensu Core), and some of these features kind of frustrate me that they need to be purchased. One such example is contact routing. This feature, provided only in Sensu Enterprise, allows you to create contact group and detail information of the medium to contact that individual or group and then add that contact to a check. I think of this feature as being a basic necessity when it comes to system monitoring rather than an Enterprise feature. I do think some of the features provided in Enterprise are reasonable though, such as: support for alternate transport brokers, "out of the box" support for third-party integrations, and the support features that are expected in enterprise level software packages (deployment assistance, premium support, etc.). The fact that contact routing is not supported in the Core software, in my personal opinion, just seems like bad business. If contact routing was supported in core it would give a more complete open source system monitoring solution, and give it another positive mark for those considering to migrate from another monitoring software such as Nagios.  

Monday, November 3, 2014

Milestones

Project Milestones

Below is a list of Milestones that I have set forward for my project. I hadn't posted any of this yet because my milestones kept evolving with my project, but now I believe that I have enough done and a good enough handle on things to set my milestones down more officially. This list will continue to change but hopefully I will just be adding details to the incomplete milestones as needed, rather than complete restructuring of this list.
  • [DONE] Read Sensu docs and get a feel for what it is and how it works
    • sensuapp.org/docs/latest/guide
  • [DONE] Setup basic 2 machine Sensu environment manually (1 server and 1 client)
    • [DONE] Sensu Server
      • RabbitMQ
      • Sensu API
      • Sensu client
      • Sensu server
      • Redis
    • [DONE] Sensu Client
      • Sensu client
      • connectivity to RabbitMQ
  • [DONE] Add Uchiwa dashboard to server for web UI.
    • provided in Sensu repo
  • [DONE] Puppet implementation and dev environment setup
    • [DONE] Create VMs via foreman
      • sensu[1,2,3]
        • sensu-server
        • sensu-api
        • sensu-client
        • uchiwa
      • sensu-mq[1,2]
        • RabbitMQ
        • sensu-client
      • sensu-redis[1,2,3]
        • Redis
        • Redis Sentinel
        • sensu-client  
    • [DONE (tentatively)] Initial Puppet Setup
      • Clone sensu puppet module from github
      • Clone uchiwa puppet module from github
      • Create sensu-server profile/role
      • Create sensu-client profile/role
      • Create sensu-redis profile/role
      • Create sensu-rabbitmq profile/role
      • Update existing redis module to allow for sentinel
    • [DONE (tentatively)] Initial Hiera Setup
      •  Create necessary files for each machine
  • Setup and implement Sensu check/handler structure in Puppet/Hiera
    • Modify Yelp's sensu_handlers to conform to our needs, and add any that may be missing.
      • [DONE] mailer
      • hipchat
      • pagerduty
    • Create custom sensu checks to work with custom handlers (borrow from Yelp's style)
    • [DONE] Create package for sensu handlers/plugins so we can manage the installation of these components easily via Puppet (there is an issue filed with sensu to create these packages, but no timeline or projected date has been given)
  • Testing/Troubleshooting
    • Run puppet (successfully) on all VMs (1st iteration with initial configs is done, need to implement checks/handlers before this can be considered 100% done.)
    • [DONE] Ensure necessary connections can be established
    • Ensure master selection/failover is working properly for RabbitMQ
    • [DONE] Ensure master selection/failover is working for Redis + Sentinel
      • Needs load balanced
    • [DONE (tentatively)] Ensure master selection/failover is working for Sensu-servers
      • Needs load balanced
    • Ensure checks/handlers are working as expected
      • Really hammer on handlers to make sure exponential backoff is working, or that handlers aren't blowing up hipchat/email/jira/etc.
    • Ensure Uchiwa dashboard is displaying information correctly
  • LDAP + Uchiwa 
    • Accomplished by using Apache as a proxy for authentification, and turn off Uchiwa native auth.
  • Code review(s)
  • Clean-up/Refactor Code
  • Explore Graphite integrations with Uchiwa
    • You can embed graphs within Uchiwa and jazz up the interface more details on this can be found at http://roobert.github.io/
  • Explore integrations with Consul?
  • Explore creating a Sensu API command line tool via Ruby's OptParse?
  • Demo Sensu?
Presentation preparation
  • Provide documentation on how to start up Sensu manually, and provide well documented code in Puppet if clarification is needed.
  • Presentation (ppt?, oral?, whiteboard outline?, project map?)

Monday, October 27, 2014

The Puppet Journey

Puppet implementation

Implementing puppet started out as something that I believed would go fairly smooth. However, there were several hiccups along the road that caused me some headaches, most of which were due to my own error and the environment I was working in. Now I think I have got it all fairly buttoned down and have got a basic 3 machine Sensu environment setup that includes a Sensu server, RabbitMQ server, and Redis server (managed with Sentinel). The next step will be to expand this infrastructure to include 3 Sensu server, 2 RabbitMQ servers, and 3 Redis servers (with Sentinel). This 8 machine environment will become the dev environment and will initially serve as a testing platform to get a feel for the ease of use of Sensu and possibly integrate Sensu with Consul later on. Also getting the 8 vm environment up and running (correctly) will mark the next milestone in my project.

Friday, October 3, 2014

Puppet and Sensu

Moving Forward: Implementing Puppet

For the most part I have a fairly decent handle on how Sensu works in terms of how events are routed and managed and different ways to implement checks and handlers. I've decided to move forward with my project and start implementing the Puppet module that is provided by Sensu on Github. This will mean creating Sensu profiles for both the server and client, as well as profiles to manage RabbitMQ and Redis. On a side note after listening in on the Sensu presentation at PuppetConf I've decided to use Redis Sentinel, which will provide clustering management for Redis and failover procedures in the event the Redis master goes down.