Tuesday, August 30, 2011

Automation Getting Started Guide Chapter Q

I don't know if I'm actually writing an entire Getting Started Guide to automation tools, but here is a chapter in the book that is getting written all through the community.

After you have installed your tool Configuration Management tool (e.g. Chef, Puppet, etc...) and have done your proof of concept and think it's really cool and is going to be really useful and have your first set of Kanban stickies on the board to write some code, read this.

But before you read this, read the chapter called Infrastructure as Code in "Test-Driven Infrastructure with Chef" by Stephen Nelson-Smith. Specifically the section titled "The Risks of Infrastructure as Code" It will justify you for being where you are, and also set the appropriate tone of sobriety to your automation endeavors,  and get you off to the right start.

One thing that hit me full force was the need to emphasize the CODE part of Infrastructure as Code. If you are a sysadmin like me, strong coding practice is not in your DNA. Reach out to your CM Team or whoever manages your source code repository and get some lessons in code management from them. Then reach out to a developer or two and get some lessons in (very) basic design patterns and a few "What Not To Do" tips.

Now, when you write some automation code (cookbook, manifest, etc...) that is destined for production, after you get the first draft written so that it compiles and runs without error your first job of refactoring is to answer the following questions. Repeat this exercise for every block of code you write (generically described below as a "function")
  1. What will happen if this function runs on a brand-new system? (What are the prerequisites?)
  2. What will happen if this function runs on an existing system but has never been run before? (different prerequisites?)
  3. What will happen if this function's behavior is changed from the last run?
  4. What will happen if this function is re-run on the same system with no other changes? (idempotence)
  5. What will happen if the prior run failed? How will the function recover from a failed or partial run?
We found from experience that until all of the above questions have been considered your code is not ready for production, because you are at risk of unexpected behavior. The key to automation is predictable behavior. You will be amazed at how many ways automation can be unpredictable because you coded it poorly.

Not every question needs to be answered every time, and they are most important when new code is written.

There is some great discussion on this topic in the devops-toolchain Google Group, notably advice on using as many features of your tool to help make the above questions non-issues and to avoid code repetition. Also some interesting discussion on Greenfield systems (question 1) and Brownfield systems (question 2).

What are some of your getting started lessons learned? Care to add a chapter to the guide?

Tuesday, July 19, 2011

Done Means Deployed

John Willis of DTO Solutions was shaking out their "Devops Workshop" here in Atlanta. Trying to cover 2 days of material in 1 day was a herculean effort. I took a bunch of notes, but one sentence resonated strongly with me.

Done Means Deployed.

If you read Agile books and blogs, or probably if you work in a software shop you have read or heard discussions about "Done-Done". We have them fairly regularly. For us, Done-Done means QA accepted all available tests. I don't think regression is even a requirement. Done-Done means the code is ready to be deployed to production (not all code gets regression tests).

When I heard John say "Done means Deployed" it all clicked. If your Development Department believes "Done-Done" is "QA Accepted" then your developers have the classice "Throw it over the wall" mentality discussed by the dev2ops blog.

If your developers adopt a "Done means Deployed (to production)" mentality, then they are invested in the code all the way until it is in front of the customer. They can't disengage from it until it passes the ultimate QA (the end-user). We haven't achieved this here, but I would bet anyone who has has seen a dramatic increase in software qualtiy as a result.

In order to do this you need, at a minimum, frequent deployments, and ideally approaching continuous deployment. If your deployment cycles are on the order of months, it is impossible because too much happens between check-in and deployment. You can't wait 3 months to call something done, and you can't expect a software developer to stay fresh on that much code.

We're working on speeding up our deployment cycle. I'm keeping this idea in front of us as a stretch goal.

Wednesday, June 1, 2011

What makes a good Generalist?

I am glad that Devops is bringing generalists out of the closet and showing how valuable they are and how much companies need them. Also, I would bet money that the majority of successful "specialists" have a fairly broad set of  knowledge outside of their specific job function. So, your best generalist is a competent specialist, and your best specialist is a competent generalist.

So, "What makes a good generalist?"

Since we can only learn one thing at a time, I think most generalists started out as specialists (Solaris System Administrator, got CCNA certified, passed the MCSE test, Oracle certified, etc...). But, over time they gained knowledge outside of their special area in order to solve a problem. Over time that accumulation of knowledge became patterns of understanding they can use to synthesize information and make decisions. Finally they gain wisdom to have insight into things and project into the future.

I think one of the greatest contributions Devops makes is in shining a strong light on the fact that there is only one problem: The Business Problem. There is not a system problem and a software problem and a network problem and a security problem, there is only the business problem. The more everyone in the business knows about the rest of the business, the more they can understand how the parts relate, and ultimately make wise decisions to help the business succeed.

You can develop knowledge, understanding, and wisdom from others. One way to foster generalist growth is to rotate new employees through various departments when they are hired in order to give them a full picture of their peers. It's not quite cross-training because you don't need them to be competent in the job. It's more like cross-exposure so they can feel some of the pain and absorb some of the knowledge, understanding, and wisdom of the other departments. Every department has wisdom worth knowing. The level 1 CS person knows a lot about the software after the 10th customer calls up about a feature that doesn't work like it should. Or, the Ops person gains knowledge as they sift through all the "normal" errors in the log file to find the one, little "important" error buried in the noise. And the developer knows how to work with a team sharing a software repository; and knows what is the right version of the application and how a lifecycle is important for any code.

Our company rotates new developers into a support role which is a good start. I think it would be better if every Dev or Ops employee spent a week in some rotation like: Customer Support -> QA -> Security -> Development -> Operations to see the full context of their work and how it affects all departments and how all departments affect what they do.

When we have a broad knowledge of our business and feel some of the pain of our peers, I think we will be more successful at whatever specific role we have.

Monday, May 23, 2011

Two Approaches to Devops

I know I'm a little late in posting my follow up from the April Devops Meetup in Atlanta, but it was a great morning and I wanted to share some of the things discussed. Not everything in this post was explicitly discussed in the meetup, but some are thoughts I had related to the topics.

Artisan Server Crafting
First, I put out a challenge/request for John Christian to record his routine on "Artisan Server Crafting". He talks about how the traditional system administrator "crafts" each server, gives them personal names, and treats them like family. "Oh Look! Gandalf has been up for 365 days, let's throw a party!" The cloud makes the family too big that you can't give each of your children the attention they "deserve". I guess we just have to treat our machines like, well, machines. Automation, not art.

Devops, bottom-up
Next, John talked about how Devops got started inside his company. They started with one person from Dev and one from Ops working together on some automation to improve their lives. This is a common vector for Devops in companies. Start small as a skunkworks project, produce some results that show business value, then get management buy-in to continue the work and hopefully dedicate more time to the project, show more business value from your incremental success, then the business is hooked and you can't go back. I'll ask John to write a guest post to go into more detail.

The challenge of the bottom up process is that it is hard to get past the 1.0. You start out with a few energetic people that get things going, but scaling up can only happen with management's support. How do you cope with the original team moving on to something else? Are the new people going to be able to sustain progress on energy and enthusiasm alone? To move beyond 1.0 you have to show business value and show how the goals of Devops are aligned with the goals of the company. Also, you need to maintain the balance of Dev and Ops. A dominant personality can sway projects one way or another and alienate one side of the team.

Bottoms-up does work and has the potential to create a great deal of cohesion between Dev and Ops. Just be aware that at some point someone from the team is going to have to sell the story to management and get the business bought in. Devops is not complete unless the alignment goes all the way to the top of the management chain.

The Devops Team

A second way of introducing Devops occurred at my company. We unintentionally, through a series of reorgs, created a "Devops Department" without really knowing it. We created an Ops team inside Dev to take care of the non-production (Dev, QA, Load Test, CM, etc..) systems. Since this team reported up to the Dev executive and was chartered to take care of Dev's needs, there was a natural alignment of goals. This team and the Configuration Management Team got together and started automating deployments. After about a year building up Control Tier to deploy the code and succeeding in automating all deployments from Dev through Production the focus went to configuration. We have a suite of Java apps that are "overconfigured". Our current project is automating configurations of applications.

Automation and "Devops" got started with a standalone team inside Dev, but through a reorg that team merged with Production Operations. This was actually the best thing that could possibly happen because the biggest risk of a dedicated Devops Team in an organization with separate executives for Dev and Ops is that the team must naturally report up one silo and not have an affinity to the other silo. Also there is a more subtle factor that comes to play. The Devops team is not a part of any one Dev team, and not a part of any one Ops team, so Dev and Ops both think the team is an outsider. The whole point of Devops is lost. Dev doesn't have any ownership and Ops doesn't have any ownership. The team spends a lot of time trying to sell to the bottom and to the top.

Now, we were fortunate that the original Devops Team was populated with some of the senior people in the company that had deep relationships inside Dev and Ops so the selling wasn't too hard, but if you are considering a Devops team, the team will have to have strong support from both Dev and Ops executives with the ability to roam freely within both organizations. (This assumes Dev and Ops have different executives.)

So now our Devops 1.0 was a standalone team inside of Dev, but after the reorg the members are in Ops. But we have the benefit that the first project was to automate deployments which helps both teams, the second was to automate configuration which simplifies life in Ops and gives more ownership to Dev (that's just how things have been over here). Our third phase now fully branches with Ops embracing automation for system configuration and Dev thinking of the code in terms of operational impact and how it can be run and maintained easier and with automation.

But, you argue, Devops is about People first, then process, then tools. I think if we analyze the stories from companies we will find that the tools are the gateway drug to bring in Devops. You can stop at tools and just have some automation, or you can show the business value your tools bring and start a revolution that will ultimately encompass people and process. We have our tools now, we are in the long stage of battle to align the people and process.

I'll conclude with my wholehearted agreement that I believe that Devops will be most successful if Dev and Ops report to the same executive before the CEO. If you have two silos and they don't share goals, Devops will remain a bottoms-up battle.

Thursday, January 27, 2011

Chef Is My Documentation

We have an ongoing project to automate the management of our custom software's configuration files. There is a hierarchy and some groupings of configuration data so we wanted to define configuration at the highest level possible and have its use be inherited at lower levels and with groups. We looked at all the "configuration management" software in the open source and decided that Chef had the right flexibility for our need. It wasn't a perfect match, but it was the closest thing available.

When we started the project to automate configuration I was in dev, but have since moved back into ops. Since one of Chef's main purposes is "system configuration management", the work we have done for configuration files is directly applicable to production operations' system management. So we're training and "selling" the tool to the Ops and Infrastructure teams as more than an application configuration tool. As I was putting the finishing touches on a presentation giving an overview of the effort and the rationale behind a new configuration management system, I came across a blog post by Jez Humble in his post in Agile Web Operations where he said:

"Effective configuration management – including automation of the build, deploy, test and release process, provisioning and maintenance of infrastructure, and data management – make the whole delivery process completely transparent. As any good auditor will verify, there is no better documentation than a fully automated system that is responsible for making all changes to your systems, and a version control repository that contains everything required to take you from bare metal to a working system."

That quote summed up the presentation I wrote on why we needed to automate our configuration file management. I didn't have as cogent a thought when I wrote my presentation, but am thankful for Jez to frame the problem so well. The words "there is no better documentation" jumped out at me and I used that to shift my thinking and reframe the rationale behind why we are automating configurations.

Chef is our documentation

Everyone makes an attempt at documenting their configuration. Between wiki pages and emails you can probably piece together 80% of your documentation. The problem comes when you make changes you have to keep your documentation up to date and in the heat of the battle documentation almost always gets left behind. Then you are tasked with building another system and you spend weeks of trial and error, copy and paste, search and replace to build a system. Your documentation never seems to be complete or up-to-date enough.

I've lived there in all my ops jobs and we're now fixing that problem. Our documentation is runnable configuration. The emphasis is on the runnable. If your documentation is a copy of your configuration (a wiki) you will never be able to keep it up to date. If your documentation is runnable, it will always be up to date. Now if that "good auditor" comes by and asks for documentation of how our systems and apps are configured we have complete, accurate documentation at all times. We don't have to scramble at the last minute to update a bunch of wiki pages. As soon as a new release of software goes into production the act of configuring the software for the deployment is also the act of updating the documentation.

The way we are doing it, the Chef database is the presentation of the documentation but not the source of the documentation. All configurations are saved in version controlled JSON files (either roles with attributes or databags) so all configuration is versioned and even if the Chef database gets destroyed we can re-create the database from source JSON. The files are named, scoped, versioned, and updated in a way that requires the fewest places to make changes while maintaing adequate clarity and re-use.

I'll follow up with a post on some of the technical details of what we are doing and the decisions we made along the way. We just discovered a few behaviors of Chef that somewhat complicate this plan, but nothing severe enough to be a showstopper.

Let me know what you are doing or what you think about the plan.