Friday, December 12, 2014

Home Stretch

Demo

Since my employer has an interest in Sensu and I have been doing my work within their network I had the opportunity to explain what I have worked on so far, how the environment is configured, and had a discussion on next steps and future planning. This project is one that will definitely continue after my course has completed next week, and has been a valuable learning experience. Here is a summary of what was covered in the demo:

What is Sensu: Sensu is a system monitoring solution designed for the cloud. Sensu is easily scaled and provides a simple web ui to visually 'connect the dots' with what is going on within the monitoring environments. One big advantage is, since it was designed for the cloud, clients automatically can check themselves in and add them to you monitoring. The Sensu project has come a long way and compared to other monitoring solutions like Nagios, does things better, but may be lacking in a few aspects. (This may just be to lack of knowledge on my part, or I may need to explore additional docs for what Sensu is capable of).

How is Sensu set up: The Sensu environment I have set up is an 8 machine environment that includes: 3 Sensu servers (which also run Sensu API and Uchiwa dashboard), 2 RabittMQ servers, and 3 Redis.
Currently Sensu is setup in a centralized fashion with directly connected clients. What this means is that the clients run checks locally (standalone checks) and publish results up to RMQ and are then picked up by Sensu which handles notification routing. The check data is also stored in Redis and the dashboard (Uchiwa) is updated through the Sensu API. Setting up Sensu in this fashion prevents arbitrary and potentially malicious checks to be run due to a system breach., and it also decreases infrastructure complexity.

High Availabiliy: High availability is achieved by:
  • Sensu: master election is internal within Sensu with no additional configuration needed by the user. This is achieved by Sensu-servers being aware of each other through checking connections to RabbitMQ and Redis. Master election and failover is relatively seamless in the eyes of the user.
  • RabbitMQ: RabbitMQ provides for cluster failover built in and Sensu does support this feature; but unfortunately the Sensu Puppet module does not support this feature and a fairly significant rework is needed in order to get this to function properly. In order to work around this pitfall load balancing the 2 RabbitMQ instances will provide HA.
  • Redis: In order to achieve a clustering like end goal, we are using Sentinel to provide master election and failover. Sentinel is built in to newer version of Redis and operates on a quorum election to choose a master in the event of a master failure. 
  • NOTE: these three components will all be load balanced eventually to provide for additional protection during machine or service failure.
The Dashboard- Uchiwa: Basically I went over all the tabs that resides within Uchiwa and different features provided by the service.  Some noteworthy features are:
  • Overall Look/Feel: The look and feel of Uchiwa is very slick, modern, and simple. It provides useful information while taking a minimalist approach. 
Going from a Nagios UI

To Uchiwa's

  • Stashes: Stashes are used to schedule downtime, and can be used to silence a check or a machine from alert notification. While a check/machine is silenced (stashed) the event data will still show, but all alert handling is turned off to reduce noise during things such as maintenance. 

Moving Forward: After some discussion we talked about if we choose to move forward with a larger implementation of Sensu how it will take a pilot or test run in order to ensure Sensu can do what we need it to do. This test run would include propping up a production like instance of Sensu and hammering on it with realistic-like events and event handling to put it through its paces. Also moving forward we would like to replace the auth provided by Uchiwa which provides only for a single user, and put LDAP with Apache in front of Uchiwa for authentification and then turn off Uchiwa's. Since I have been able to get a couple different handlers working (mailer, pagerduty) and have a good handle on checks my supervisors/coworkers would like me to look at how to send event data to graphite through Sensu to compare to our current approach. Also they would like me to look into how Sensu checks handle dependencies. 


Wednesday, December 3, 2014

December Update

Checks/Handlers

Finally, after a lot of troubleshooting and falling through some rabbit holes I've been able to get checks and handlers working. There were some problems that arose while using some of the Yelp handler's within my environment. Some of these problems were a result of some Yelp specific code that still existed within the module, and others because of some misconfigurations. With some help I was able to get the checks working to call a handler to send emails for alerts/resolutions. Currently only 5 checks are implemented across the 8 Sensu dev machines, and only one handler has been configured for the environment.

HA Testing

Testing for High Availability within the environment was easy, but there were a few snags I ran into. With both Sensu and Redis Sentinel there were some issues, that I caused because of oversights. Basically what was happening is that I forgot to lock puppet on a few of the boxes so the master would run against the boxes and screw up the firewall and lead to some connection errors. Once I fixed that master election in Sensu and Sentinel failover for Redis were working great.

Moving Forward

Checks/Handlers:
I'd like to get 2 more handlers up and running before presenting this project (Dec. 17th), and also add the sensu client role to a random machine to make sure that client discovery is working and to test the ease at which Sensu can be implemented.
HA:
Get Sensu, Redis, and RabbitMQ load balanced to ensure failover and ease for ease of access for uchiwa dashboard.
Security:
Implement LDAP with uchiwa. From my understanding this is done through LDAP and Apache in front of Uchiwa. Once LDAP and Apache are set up for auth then the auth provided by uchiwa can by turned off. Also I'd like to re-investigate some RabbitMQ specific stuff. There are currently some issues on Github I've been following and need to check the progress on those issues.

Thursday, November 20, 2014

Update

RabbitMQ Failover


So as of this post there is still not a fix for to support multiple broker connections within the Sensu puppet module. I submitted an issue via github and was asked if I could take a look at creating a PR to implement this update, but after looking at the module and talking with my coworker for some insight it seems a considerable amount of work would be needed in order to get this fix in place and would more than likely introduce breaking changes within the module. So for right now RabbitMQ is in the same boat as Redis and Sensu and will need to be load balanced if I want failover and high availability. As of right now load balancing these components has taken a back seat, and implementing checks and handlers has become my new focus.

Checks/Handlers

In order to implement checks/handlers I've decided (with guidance from my supervisor) to borrow from Yelp's sensu_handlers github project to create handlers and also use a similar style as Yelp uses in their puppet_monitoring_check project for checks. This will enable me to create a hiera file with team specific information, and define which team the check applies to, which will then be passed to the handler in order to notification routing and event processing. This also will solve, at least in my opinion, the problem that Sensu Core has with not providing contact routing configurations out of the box. 

Wednesday, November 5, 2014

Sensu Enterprise Announced

Sensu Enterprise
With the latest update to Sensu (0.16), there has been the added feature of installing/purchasing Sensu Enterprise. Sensu Enterprise hosts a slew of features not provided in the free version (Sensu Core), and some of these features kind of frustrate me that they need to be purchased. One such example is contact routing. This feature, provided only in Sensu Enterprise, allows you to create contact group and detail information of the medium to contact that individual or group and then add that contact to a check. I think of this feature as being a basic necessity when it comes to system monitoring rather than an Enterprise feature. I do think some of the features provided in Enterprise are reasonable though, such as: support for alternate transport brokers, "out of the box" support for third-party integrations, and the support features that are expected in enterprise level software packages (deployment assistance, premium support, etc.). The fact that contact routing is not supported in the Core software, in my personal opinion, just seems like bad business. If contact routing was supported in core it would give a more complete open source system monitoring solution, and give it another positive mark for those considering to migrate from another monitoring software such as Nagios.  

Monday, November 3, 2014

Milestones

Project Milestones

Below is a list of Milestones that I have set forward for my project. I hadn't posted any of this yet because my milestones kept evolving with my project, but now I believe that I have enough done and a good enough handle on things to set my milestones down more officially. This list will continue to change but hopefully I will just be adding details to the incomplete milestones as needed, rather than complete restructuring of this list.
  • [DONE] Read Sensu docs and get a feel for what it is and how it works
    • sensuapp.org/docs/latest/guide
  • [DONE] Setup basic 2 machine Sensu environment manually (1 server and 1 client)
    • [DONE] Sensu Server
      • RabbitMQ
      • Sensu API
      • Sensu client
      • Sensu server
      • Redis
    • [DONE] Sensu Client
      • Sensu client
      • connectivity to RabbitMQ
  • [DONE] Add Uchiwa dashboard to server for web UI.
    • provided in Sensu repo
  • [DONE] Puppet implementation and dev environment setup
    • [DONE] Create VMs via foreman
      • sensu[1,2,3]
        • sensu-server
        • sensu-api
        • sensu-client
        • uchiwa
      • sensu-mq[1,2]
        • RabbitMQ
        • sensu-client
      • sensu-redis[1,2,3]
        • Redis
        • Redis Sentinel
        • sensu-client  
    • [DONE (tentatively)] Initial Puppet Setup
      • Clone sensu puppet module from github
      • Clone uchiwa puppet module from github
      • Create sensu-server profile/role
      • Create sensu-client profile/role
      • Create sensu-redis profile/role
      • Create sensu-rabbitmq profile/role
      • Update existing redis module to allow for sentinel
    • [DONE (tentatively)] Initial Hiera Setup
      •  Create necessary files for each machine
  • Setup and implement Sensu check/handler structure in Puppet/Hiera
    • Modify Yelp's sensu_handlers to conform to our needs, and add any that may be missing.
      • [DONE] mailer
      • hipchat
      • pagerduty
    • Create custom sensu checks to work with custom handlers (borrow from Yelp's style)
    • [DONE] Create package for sensu handlers/plugins so we can manage the installation of these components easily via Puppet (there is an issue filed with sensu to create these packages, but no timeline or projected date has been given)
  • Testing/Troubleshooting
    • Run puppet (successfully) on all VMs (1st iteration with initial configs is done, need to implement checks/handlers before this can be considered 100% done.)
    • [DONE] Ensure necessary connections can be established
    • Ensure master selection/failover is working properly for RabbitMQ
    • [DONE] Ensure master selection/failover is working for Redis + Sentinel
      • Needs load balanced
    • [DONE (tentatively)] Ensure master selection/failover is working for Sensu-servers
      • Needs load balanced
    • Ensure checks/handlers are working as expected
      • Really hammer on handlers to make sure exponential backoff is working, or that handlers aren't blowing up hipchat/email/jira/etc.
    • Ensure Uchiwa dashboard is displaying information correctly
  • LDAP + Uchiwa 
    • Accomplished by using Apache as a proxy for authentification, and turn off Uchiwa native auth.
  • Code review(s)
  • Clean-up/Refactor Code
  • Explore Graphite integrations with Uchiwa
    • You can embed graphs within Uchiwa and jazz up the interface more details on this can be found at http://roobert.github.io/
  • Explore integrations with Consul?
  • Explore creating a Sensu API command line tool via Ruby's OptParse?
  • Demo Sensu?
Presentation preparation
  • Provide documentation on how to start up Sensu manually, and provide well documented code in Puppet if clarification is needed.
  • Presentation (ppt?, oral?, whiteboard outline?, project map?)

Monday, October 27, 2014

The Puppet Journey

Puppet implementation

Implementing puppet started out as something that I believed would go fairly smooth. However, there were several hiccups along the road that caused me some headaches, most of which were due to my own error and the environment I was working in. Now I think I have got it all fairly buttoned down and have got a basic 3 machine Sensu environment setup that includes a Sensu server, RabbitMQ server, and Redis server (managed with Sentinel). The next step will be to expand this infrastructure to include 3 Sensu server, 2 RabbitMQ servers, and 3 Redis servers (with Sentinel). This 8 machine environment will become the dev environment and will initially serve as a testing platform to get a feel for the ease of use of Sensu and possibly integrate Sensu with Consul later on. Also getting the 8 vm environment up and running (correctly) will mark the next milestone in my project.

Friday, October 3, 2014

Puppet and Sensu

Moving Forward: Implementing Puppet

For the most part I have a fairly decent handle on how Sensu works in terms of how events are routed and managed and different ways to implement checks and handlers. I've decided to move forward with my project and start implementing the Puppet module that is provided by Sensu on Github. This will mean creating Sensu profiles for both the server and client, as well as profiles to manage RabbitMQ and Redis. On a side note after listening in on the Sensu presentation at PuppetConf I've decided to use Redis Sentinel, which will provide clustering management for Redis and failover procedures in the event the Redis master goes down.

Monday, September 29, 2014

PuppetConf 2014

This last Tuesday (the 23rd) I took the opportunity to listen in on PuppetConf via live web stream. In particular I listened in on the segment on Sensu which was presented by Tomas Duran, a Yelp employee. During this segment Tomas went over the challenges of monitoring, and how Yelp needed to evolve their own system monitoring in order to meet business needs. He laid out how monitoring stereotypically works in today's work environments and then broke into what Sensu is and how it helps break these stereotypes. The part that I focused in the most was how Yelp is using Sensu. They do not use all the components of Sensu, and have tried to simplify the workflow by only having *standalone checks run on all their clients. This helps reduce complexity when using configuration management tools such as Puppet, and also tightens security. Tomas also talked about using custom variables within checks that handlers would use for notification and escalation using tools such as Pagerduty and Atlassian's JIRA. Also during this segment Tomas presented how Yelp uses a custom 'monitoring_check' puppet module to build and manage all service checks on their nodes, which has become a research point within my project. I would like to utilize their monitoring check module, or something similar in order to make service management easier across all nodes like Tomas presented. 

* A standalone check is like a normal check, but instead of the server pushing a request to the client and running the check, the client runs the check locally and the publishes the response up to the server.

Friday, September 19, 2014

Install/Setup

Installation and Setup


So it occurred to me that I completely forgot to post an abbreviated installation and setup for Sensu, which in my opinion is a pretty important point. So the following is a brief checklist of steps that need to be taken in order to install and setup a basic Sensu framework (CentOS). This information was summarized from http://sensuapp.org/docs/latest/guide.

  • To begin you will need two machines for a basic setup. These two machines will be referred to as the 'server' and the 'client'.
  • On the 'server' machine, you will need to generate SSL certificates for Sensu and install both RabbitMQ and Redis.
    • Installing RabbitMQ will involve installing dependencies, installing RabbitMQ, configuring SSL, and adding user credentials.
    • Installing Redis is very simple, a yum install is all that is needed (given the dependencies are already installed).
  • On both the 'server' and 'client' systems you will then need to install Sensu and create the appropriate SSL locations and move the previously generated certificate files into place.
  • Now you will configure the Sensu connection, Sensu API, and Sensu clients. The Sensu connection and Sensu clients will be setup up on both systems, while the Sensu API is configured on the 'server' machine only. Configuration is done entirely by creating/editing JSON files within /etc/sensu/
  • After these steps have been completed (correctly) you should be able to enable and start up the sensu-server, sensu-api, and sensu-client on the 'server' machine; and enable/start the sensu-client on the 'client' machine.
  • Dashboard: In previous versions of Sensu a dashboard was provided within the package. However, the more current releases no longer include this feature and the Sensu Installation Guide recommends using Uchiwa as the Sensu dashboard, as it is up to date with the latest Sensu API changes.
    • Uchiwa in provided via a yum package and is configured in a JSON file under /etc/sensu/

Thursday, September 11, 2014

Spammed!

Oops!

So I was getting the first checks on my sensu client to report back to the server and these checks were controlled via an email handler. I had a ping check running every minute (I thought), that was causing some issues due to permissions and thought I had it resolved by the time I got done working on it. Turns out that the issue persisted and I left the check on, and in reality the check was occurring every second. So when I started up my computer and opened my email and was massacred by around 65,000 emails all saying that the ping check failed (18 hours * 60 minutes * 60 seconds ~ 65,000). 



So the lesson that I learned was make sure to double check that checks are reporting correctly before you leave for the day, and if they are not turn off the handler so your email doesn't get blown up by failed check reports.

Tuesday, September 9, 2014

Sensu

Overview

According to sensuapp.org, Sensu is
"often described as the “monitoring router”. Essentially, Sensu takes the results of “check” scripts run across many systems, and if certain conditions are met; passes their information to one or more “handlers”. Checks are used, for example, to determine if a service like Apache is up or down. Checks can also be used to collect data, such as MySQL query statistics or Rails application metrics.Handlers take actions, using result information, such as sending an email, messaging a chat room, or adding a data point to a graph."
 Sensu is written in Ruby, can use existing Nagios plugins, is configured entirely in JSON, has a message oriented architecture, using RabbitMQ and JSON payloads, and is available via 'omnibus' packages.

There are only a few dependencies required to run Sensu. The Sensu server and API must be connected to a RabbitMQ instance for message routing and Redis for storing persistent data. The Sensu client will also need access to the RabbitMQ instance in order to get messages to the server. The following diagrams shows a basic workflow for Sensu checks and responses.


 

Project Discovery

Project Ideas

Below is a list of project's that I considered while brainstorming:
  • Using Sensu as a system monitoring solution
  • Setup a virtualized development environment using packer,virtualbox, and vagrant
  • Research continuous integration tools and give detailed analysis of my findings
  • Build and implement an ELK stack (Elasticsearch, Logstash, Kibana)
  • Working with Ruby to solve a problem common to sysadmins/devops
After considering the implications, extent, and difficulty of each I arrived at the conclusion of using Sensu as a system monitoring solution as my senior project. I will be maintaining my project at my place of employment(Buckle Inc.), as they have some interest in Sensu.