Tuesday, December 17, 2013

Logstash Metrics Filter and Graphite Output

Not many people have published more advanced metrics filter configurations. After spending a day with the examples and source code I have a more advanced configuration to share. NOTE: I saw weird behavior with logstash 1.2.2 and I'm not sure if it was my in-progress configuration at the time, but after upgrading to 1.3.1 everything worked as expected.

The problem: We are trying to get metrics on our API usage by user. We were already logging the operation and the user to disk and picking it up in Logstash. Now we want metrics on how frequently each user makes each call.

After quite a bit of googling, I couldn't find an example where the metric name has more than one field name in it. The last message in this thread had some pieces of the puzzle as we needed to translate some special characters to be more graphite-friendly. But I got confused as to what was contained in the metric event. The metrics filter creates a new event, but until I threw the event to a file output and saw it I didn't realize the new metric event didn't have any knowledge of any field of the message that generated it (Actually, the "meter" option can use any name from the original event, but no other option can. I tried to use add_field inside the metrics filter and it didn't work). So instead of mutating the metric event, I have to mutate the original event. Also, you can't gsub a newly added field in the same mutate block, so I had to break the gsub to a second mutate.


    if [type] == "apicalls" {
    mutate {
      add_field => [ "modhost", "%{host}" ]
      add_field => [ "modorgname", "%{org_name}" ]
    }
    mutate {
      gsub => [ "modhost", "\.", "_", "modorgname", "[\.\,\ ]", "_" ]
    }
    metrics {
      meter => [ "apioperations.%{pod}.%{modhost}.%{operation}.byOrg.%{modorgname}" ]
      add_tag => [ "apiopmetric" ]
    }

  }

Now when I send those metrics events to a file output I see my resulting event is

{
"@timestamp":"2013-12-17T19:08:42.968Z",
"@version":"1",
"message":"server.name",
"apioperations.qa1.host_domain.Login.byOrg.my_org_name.count":11,
"apioperations.qa1.host_domain.Login.byOrg.my_org_name.rate_1m":0.0,
"apioperations.qa1.host_domain.Login.byOrg.my_org_name.rate_5m":0.0,
"apioperations.qa1.host_domain.Login.byOrg.v.rate_15m":0.0,
"apioperations.qa1.host_domain.GetModifiedRecipients.byOrg.my_org_name.count":2,
"apioperations.qa1.host_domain.GetModifiedRecipients.byOrg.my_org_name.rate_1m":0.0,
"apioperations.qa1.host_domain.GetModifiedRecipients.byOrg.my_org_name.rate_5m":0.0,
"apioperations.qa1.host_domain.GetModifiedRecipients.byOrg.my_org_name.rate_15m":0.0,
"tags":["apiopmetric"]
}

Next is to figure out how to get those to graphite. Since there is an unknown number of operations and users, I had to go to the source code to really figure out how all the options to the graphite output work. It turns out you can't use the "metrics" option as that would need to enumerate each name. The magical "fields_are_metrics" option sends all the fields in the event to graphite. All you need to do is use "include_metrics" or "exclude_metrics" to  get just what you want to graphite. Our graphite output looks like this (the file output was for debug purposes only and is turned off now that it works):

  if "apiopmetric" in [tags] {
    graphite {
      host => [ "10.1.1.1" ]
      include_metrics => [ "apioperations.*" ]
      fields_are_metrics => true
    }
    file {
      path => "/var/log/logstash/apidebug.log"
    }
  }

And Shazam! All your metrics start flowing into Graphite!

Tuesday, April 9, 2013

Marching Off the Map

The title is not a new one, but it is a great image of what I feel like the Devops community is doing. The Map is the way businesses have been run for the last 100 years and which the IT industry adopted in the 80's and was mostly adopted even during the dotcom days. In the 80's and 90's Enterprise is what everone wanted. In the 00's as the large web operations started growing "Enterprise" became a dirty word among the cool kids. Now, Enterprise does describe some very large companies, but many of the Enterprise ways are in many smaller companies (generally older ones). Damon Edwards used the term "Classic Organization" and I think that is a much more inclusive and less emotionally charged term than Enterprise, so I will use that term to mean "Orgainzations operating with the culture and processes akin to Enterprise". Classic Organizations are the epitome of the "before" picture in the Devops transformation. Devops (building on Lean and Agile and others) is marching off the map of business models and, I think, incoporating much of the best of the past into new models to lead us into the future.

Recently, I realized my personal life has been paralleling my professional life in many ways. I'm seeing the core principals of Devops echoed throughout my life. Many people are discovering that things happening in the tech industry will work in running a household or other community too. Not sure if this is behind a paywall but the Wall Street Journal ran a story by Bill Gates in January where he describes what sounds a lot like Lean thinking as a solution to fixing global problems. http://online.wsj.com/article/SB10001424127887323539804578261780648285770.html and some interesting replies that generally uphold Lean principles and illustrate the challenges of applying Lean in a "Classic" culture http://online.wsj.com/article/SB10001424127887324156204578275993802414124.html. There are also many articles around the internet on running a household on Agile principles.

Then I heard a few podcasts from Growing Leaders describing the need to look for new ways of communicating with and educating young people today.http://growingleaders.com/blog/podcast-7-an-interview-with-dan-pink/, http://growingleaders.com/blog/podcast-8-the-benefits-of-a-gap-year/. One theme they share is that in school you are measured on (roughly) 75% IQ and 25% EQ, but in the workforce the proportions are reversed. This tweet illustrates that shift.
The conclusion is that school is not teaching people how to be productive workers. For Devops and Lean to work there needs to be more focus on EQ development in people. It is said that your IQ is relatively fixed from birth, but that EQ can be trained and developed. When you have your technical people thinking more with their "Right Brain" (Big Picture, Context, Synthesis) you should see the culture fall into place much easier. The "Left Brain" logical, analytical stuff is so easy, probably too easy, that we use it as a crutch to not work on peope, culture, empathy, systems thinking, and such. Just read some of the stories about the Etsy Hacker Grant program and its effects. "Right Brain" thinking can be developed and learned.

This post is rambling a bit through multiple topics but my main point is that I feel Devops is on the right path because its driving principles are echoed throughout life and so many cultures. I put a lot of weight on "uncommon, common sense" where you re-discover eternal truths that are built into human nature (respect, empathy, purpose, quality) and build on top of those. The name "Devops" doesn't really matter and will pass away, but the principles behind it should always be the foundation of all we do. I'm marching off the map at work and marching my kids off the map in their education at home. It's a little scary, but exciting to be doing something new and discovering a vibrant community around you to let you know you are not alone.

Wednesday, January 23, 2013

Upgrading Chef Server from 0.10.8 to 10.18.2

Here is my story of upgrading Chef from 0.10.8 to 10.18.2 while moving to a new server and updated OS. Someone please comment and tell me where I may have been able to do better.

So, we are running Chef Server 0.10.8 on CentOS 5.4 with Ruby 1.8.7 and I want to upgrade to latest release of Chef and go to Centos 6.3 and Ruby 1.9.3 at the same time. So I couldn't just do an in-place upgrade on the existing server. I needed to migrate my Chef server to a new system and upgrade everything.

My first plan was to take the cautious route since I wasn't sure if Chef could be updated that many revs, so I tried to export all data as JSON, build a new Chef 10.18.2 server, then import all the JSON. It worked perfectly EXCEPT all the client's couldn't authenticate to the server even though I imported its public key. I could create a new client key in the Chef server and the node could authenticate, but it wouldn't with an imported key. I spent about a day on this to no resolution. Maybe someone else will have better results.

Next I tried to just copy the couchDB database. Unfortunately I flubbed things up a few times and spun my wheels for a few days because things didn't work (mostly my fault). Finally I found this method that works:

1) Compile Ruby 1.9.3 and rubygems 1.8.23
2) Install Chef via chef-solo http://wiki.opscode.com/display/chef/Installing+Chef+Server+using+Chef+Solo
3) Fix for the CentOS 6.3 bug for rabbitmq init documented https://bugzilla.redhat.com/show_bug.cgi?id=878030. We decided to change the rabbitmq init script to work around the bug

CONTROLPROG=/usr/sbin/rabbitmqctl
CONTROL="sudo -u ${USER} ${CONTROLPROG}"

4) Add the chef queues because rabbitmq was broken when chef-solo tried to do it. And change the solr maxfieldlength to 100000 to work around the problem of indexing nodes with lots of attributes.

/usr/sbin/rabbitmqctl add_vhost /chef
/usr/sbin/rabbitmqctl add_user chef testing
/usr/sbin/rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"

ex /var/lib/chef/solr/home/conf/solrconfig.xml
:%s/<maxFieldLength>10000/<maxFieldLength>100000/g
:wq

5) Shut down couchdb and rename /var/lib/couchdb/chef.couch to chef.couch.bak
6) Copy the couchdb database from the old server (can still be running)
7) chown chef.couch to be "couchdb:couchdb"
8) Start couchdb back up
9) Start rabbitmq, chef-solo, chef-expander, chef-server, chef-server-webui (in that order)

Now I wish I could say it's working at this point, but all the cookbooks are broken. Maybe someone from Opscode can tell me where the cache is of cookbook files that can be copied. But I tried to load all the cookbooks with a knife cookbook upload -a -d, but that still didn't give me working cookbooks. In the UI when you click a cookbook it says "end of file reached" and has no data. The name and version are there, but no contents. I had to knife cookbook delete -p each cookbook, then when it was added back MOST of the cookbooks worked. Some still gave the "end of file" and I had to purge them one-by-one and upload one at a time.

I hope this helps someone. Let me know if you want more detail on any step. I still haven't done extensive testing on the new server but the few clients I tested seem happy. I'm really looking forward to an omnibus install for Chef Server 11 and hope the migration is not painful.

Thursday, January 3, 2013

Talking Deming with my Dad

Driving home from a hunting trip with my father, conversation turned towards work. I started trying to explain how we're trying to adapt "some old concepts from the manufacturing industry, now called Lean" to the IT industry and how it fits remarkably well and people are excited to find that when you boil down the patterns that make the best IT companies tick, you re-discover patterns that were spelled out in the manufacturing industry a half-century ago. As I'm feebly trying to put this in words my father stops me and says "Back at Best Foods I was in the Quality Control department and our driving principle was 'Quality is conformance to specification'. That came from a guy named .. umm.." And I pipe up "Deming?" And he lights up, "Yeah, Deming." He then goes on to explain how Best Foods (maker of Skippy Peanut Butter and Hellmann's Mayonnaise) was a great company to work for and the QC department had the ability to stop the line and were an integral part of the business. Unfortunately they closed the plant he worked in and he was unwilling to move out of state so he went to Anderson Clayton Foods (now owned by Kraft) to work in QC there. At Anderson Clayton (ACF) they had the alternate definition where "Quality is fitness for use." He's not sure if it was the definition of quality, or the fact that the QC department at ACF reported to the Plant Manager and the Plant Manager's incentives were based on product shipped. I quote: "We shipped some marginal product."

Where am I going with this? I'm not exactly sure. It was great bonding with my father over talking about Deming and quality and how the things he dealt with are big concerns in the IT industry today. It just struck me the stark contrast in how he sounded talking about how great it was to work at Best Foods and how he was lifeless talking about Anderson Clayton.

I guess my conclusion is that I am adding another data point that I believe itrevolution.com is on the right path looking for patterns from Deming and Lean. When I asked my father if he thought based on his experience that Deming would be a good pattern for us to follow he answered without hesitation "Yes". He summed it up by saying "So you're trying to translate Deming from widgets to digits."