Tuesday, August 30, 2011

Automation Getting Started Guide Chapter Q

I don't know if I'm actually writing an entire Getting Started Guide to automation tools, but here is a chapter in the book that is getting written all through the community.

After you have installed your tool Configuration Management tool (e.g. Chef, Puppet, etc...) and have done your proof of concept and think it's really cool and is going to be really useful and have your first set of Kanban stickies on the board to write some code, read this.

But before you read this, read the chapter called Infrastructure as Code in "Test-Driven Infrastructure with Chef" by Stephen Nelson-Smith. Specifically the section titled "The Risks of Infrastructure as Code" It will justify you for being where you are, and also set the appropriate tone of sobriety to your automation endeavors,  and get you off to the right start.

One thing that hit me full force was the need to emphasize the CODE part of Infrastructure as Code. If you are a sysadmin like me, strong coding practice is not in your DNA. Reach out to your CM Team or whoever manages your source code repository and get some lessons in code management from them. Then reach out to a developer or two and get some lessons in (very) basic design patterns and a few "What Not To Do" tips.

Now, when you write some automation code (cookbook, manifest, etc...) that is destined for production, after you get the first draft written so that it compiles and runs without error your first job of refactoring is to answer the following questions. Repeat this exercise for every block of code you write (generically described below as a "function")
  1. What will happen if this function runs on a brand-new system? (What are the prerequisites?)
  2. What will happen if this function runs on an existing system but has never been run before? (different prerequisites?)
  3. What will happen if this function's behavior is changed from the last run?
  4. What will happen if this function is re-run on the same system with no other changes? (idempotence)
  5. What will happen if the prior run failed? How will the function recover from a failed or partial run?
We found from experience that until all of the above questions have been considered your code is not ready for production, because you are at risk of unexpected behavior. The key to automation is predictable behavior. You will be amazed at how many ways automation can be unpredictable because you coded it poorly.

Not every question needs to be answered every time, and they are most important when new code is written.

There is some great discussion on this topic in the devops-toolchain Google Group, notably advice on using as many features of your tool to help make the above questions non-issues and to avoid code repetition. Also some interesting discussion on Greenfield systems (question 1) and Brownfield systems (question 2).

What are some of your getting started lessons learned? Care to add a chapter to the guide?