Thursday, May 17, 2012

Automating Application Configuration

A lot of what I will discuss below is useful for anyone beginning any configuration automation project. I will detail out where I feel application configuration has specific challenges that differ from system configuration

Before You Begin

The first and most important thing anyone must do before embarking on an automation project is to create the standards for the environment. You must have a consistent naming scheme, packaging scheme, and directory layout. There has to be a simple way to derive "any file of this type shall be named like this and stored in a place like this". Have you ever written up a presentation to your boss after an outage where one slide is titled "Everything is Different" and go on to tell about how much room for human error exists because you have no standards?

But consistent naming is not enough.

Model Your Environment

What caught us off guard when implementing Chef was the amount of effort it took to model our environment. At first we didn't even know we needed to model our environment. We had a great set of standards for naming conventions, but the thing Chef offered us was now a hierarchy of attributes. We could define something once and have it used repeatedly. That always sounds great on paper, but when you get down to implementing it, it opens up cans of worms you never knew you had. We had to invent an appropriate hierarchy of configuration data. Even more, it needed to exist in a way compatible with a tool we knew little about.

I think modeling is more involved for application configuration automation than for system automation. (Please correct me in comments if you think otherwise). For system config automation you focus on the standard dev/qa/prod hierarchy of systems. For application config automation you turn that on its head and your "system" is now an instance of your application suite. The suite is the unit of configuration where you stamp out a fully running suite in dev or qa, or prod. Servers are secondary because the suite resides on a set of servers working as a unit and the suite needs to be aware of all the servers participating in that instance , but your primary focus is on configuring the application suite.

We built a cookbook called "derivations" that has a bunch of recipes that derive other sets of attributes. Yes, we could have done all the searches on the fly, but our thinking right now is that we like seeing our configuration persisted as node attributes. So our derivations cookbook makes a bunch of node attributes that describe various collections of servers needed to configure the application suite. Here, each node has attributes describing what all the other nodes in the suite are called and some things about what they do. Examples: "All app tier servers", "all memcached servers", etc... Various recipes can use those node attributes to write configuration files. Know that all this is point-in-time and if you add a new server to the environment, you need to run chef-client on all other nodes to have them discover their new neighbor.

How much is Just Enough

First you come up with an elaborate hierarchy that's perfect until you realize that as soon as an actual person six months from now needs to add an attribute they are going to spend half a day debating where it belongs. Or someone needs to figure out where something is defined and has to look through dozens of files and hours tracing complex paths of inheritance. DBA's have wrestled with this issue: just google "normalize too much" for lively discussion. As Adam Jacob said (paraphrasing) "If you can get to a 98% solution you will find that you likely can change something external so that the 2% edge-cases go away."

Add to this challenge that you barely know how this new tool works. How does the tool handle hierarchy and inheritance? How do I keep from going down a design path that isn't supported by the tool? At some point you have to work from both ends and check your thinking against the tool(s) you are going to use.

We honestly spent several months coming up with a model of our environment with what we believed was "Just Enough" hierarchy that could meet our design goals of reducing the sources of configuration data to as few locations as possible, while keeping it "human" where we believe someone 1 year later can figure out where things are. I encourage you to spend a lot of time on this. Getting your hierarchy close to right the first time will pay back dividends later. If you skimp here, you may be spending a lot of time reworking code to a new convention. (More below on the "Third Time" rule)

For our model we chose our nodes to have a run list that is all roles. Most roles have only attributes in them to give us our "just enough" hierarchy of configuration. Only one type of role has a run list with actual recipes in it. It looks like:

  • datacenter - a few attributes defining the datacenter - useful for searches like "Find everything in datacenter X". Exclusively attributes, no run list.
  • logicalsite - the term we came up with sort-of equivalent to environment. We group our servers inside a private DNS Top Level Domain to differentiate environments (dev, qa, load test, staging, production, etc..) so our logicalsite name is the DNS TLD of the environment. A datacenter may have many logicalsites in it. Almost exclusively attributes and the place we get the most bonus of high-level attributes.
  • pod - our name for one complete, running instance of our entire product suite. A logicalsite may have multiple pods running in it. Exclusively attributes, no run list.
  • tier - based on standard tiers like "app" and "data" but used to break apart the suite into deployed code. In our tier role we set the run list to satisfy all the dependencies to build a node of that tier. A node can have multiple tiers and we specifically designed them that one server can be all tiers or all tiers can be spread among individual servers (here is where your naming convention gets tested). A tier role is mostly run list and only a few attributes.
  • constants - Not a role, but a Chef data bag. Some things you need to be universal constants across every possible environment. Here are things like IP Ports, mbean names and such. Now you know that every environment will have each service listening on the same port without variation.

The Product Manifest - The Secret Sauce

All of the above configuration is great, and in some ways is not really specific to application or system configuration. We have one additional set of attributes that comes into the Chef run from the outside and ties everything together. When our product suite is built one artifact that comes with it is what we call the "product manifest" (props to our awesome CM team who built this). It is a JSON file that describes every piece of code that needs to run and lots of metadata about it. In one blow I know everything about every piece of code that needs to run (build stamp so we get the right version, tier to deploy it on the right servers, dependencies like java or tomcat version). Now I have the ability to say "Deploy manifest version X to this environment" and the right code goes on the right server types with the right configuration data. There is no "dev manifest" and "prod manifest". It is one manifest used for all environments with no variation. Your variation comes only in your Chef roles named above, and that variation is as little as possible (URL names, memory settings and such).

Third Time Rule

The first time you are so excited simply that it works.
The second time it works again, but you may have some misgivings about imperfections.
The third time the flaws in the design become clear. You realize "We should have called this something else" or "We should have grouped this way" or ... By the third time, you figure out what is really important, and it's often not what you thought the first time.

Be patient and diligent. Refactor fast and often. Don't let bad code languish. Stamp out technical debt while it's fresh on your mind. Bad automation is REALLY BAD! The tool can implement something destructive really fast across all your server. (See below about testing). You have a limited window to get buy-in and the more you have to stammer "ummm, well, it really should be ..." the harder it is.

What Next?

First, determine where you are on the spectrum of standardization. You may not have one naming standard, but instead have a dozen evolutionary standards set by multiple admins, datacenter moves and company acquisitions. This is the hardest case because you will be implementing one new standard along with new automation software. The new standard has implications on monitoring and daily administration and can be very invasive. The "Third Time" rule will likely bite you frequently as well, because you don't know what you don't know. You may have a good file, package and directory structure, but lack a coherent model of hierarchy (this is where we were). Or you may have great standards and were just waiting to plug them into a tool.

Spend a lot of time working out your standards on paper. I would expect it to take over 3 months to come up with standards and a model. In this time you aren't writing a single line of code. Resist the urge to code. Learn the tool inasmuch as it will help you make standards, models, and hierarchy. You're asking questions like "What should I call this?" "How does this fit with feature X?"

Iterate

Yes, I just told you to spend a lot of time modeling your environment, but I assume some of that time is learning a new tool, or comparing multiple tools. Once you have your tool and your model I think the Agile philosophy is great for development. Start with a small problem and solve it. Every release should be production ready. When you have one thing done go to the next. Early on, "Production" is your test system, but before long you will have real value and want to start using your shiny new automation everywhere.

As Ops people we wanted to plan for every possible future scenario. Here is another use of "Just Enough". We decided to iterate and make just enough tuneable to match our current world. If we need more, we will add it later. We didn't want to spend all our time coding for situations where we don't have an immediate need. Be diligent about writing "Just Enough" code. It's easy to fall into the trap of "but we might need this feature..."

Try to minimize backward compatibility by fixing as many standards first, but you will surely find yourself needing to support some one-off situation by having some compatibility logic in code. The good thing about automation is that you can turn it all into code. "If OSVERSION=X then install pkgX, else install pkgY". Get a plan to fix the on-off in your environment and remove the code as soon as possible.

Follow Your Development Lifecycle

Caveat: We are not a continuous deployment shop. If you are continuous deployment, your lifecycle will probably be very different.

The other thing that caught us off guard was the strong need to follow our product's development lifecycle. It helped immensely that the group implementing Chef were Ops guys inside Dev and we sat really close to the software CM team. If you are in Ops, make friends with the team managing your source code. I can't say it strongly enough: If you are automating application configuration, the automation will follow to a high degree the lifecycle of the product. You need to know the release cycle and branching strategy.

With Chef, our Chef Environments are primarily Product Release Versions, not dev/qa/prod. We branch many of our cookbooks with the product, and we release some new Chef features with product releases. With a focus on application configuration we have found we use Chef in ways not common in the community.

Automation needs its own QA

Again it helped that we were inside Dev and managing the dev and QA systems, so we had a clear sense of building and testing the automation before releasing to production. I know how fallible I am when it comes to  making changes in production. I, personally, have been responsible for building tools for production only and never implementing them in Dev and QA. Picture in your head what it would be like if your lowliest dev server and your biggest production server were all configured by the same templates. Birds sing and unicorns dance. It is possible. Don't settle for less. You have one shot at this. (re-read above about the product manifest)

Practice, practice, practice. Test your automation on a bare system and on built, running systems.

Your "development" Chef server does not manage your development systems, it is a Chef server that manages your sandbox environments where you can write new code and break stuff. Your product development systems are still "production" to someone, so you need to develop and test your Chef code in a way that won't break any environment that the business depends on.

The Devops Problem - Application vs. System

What if you want to keep system management and application management separate events? Out of the box you think about Chef in a one node, one run list way where system and application configuration converge in one run. There is a way to separate the events where you can let Chef be used by the systems team for system work, and the application team use it for application work and have as little overlap as possible. It turns out to be quite useful (not perfect, but quite workable) to maintain that separation where you can have the ability to implement a "system" change event or a "product" change event without having to coordinate between them.

In Chef, one server can have multiple nodes defined on it. One node is used for "system" management and the other for "product" management. They have separate run lists so you can tell them exactly what needs to be done. Given this, the app support team can now update application configuration without impacting any system configuration. Some system events may need to have a "not_if" tied to the running application, because some system configuration can have negative impact on running code.

Here you find ways to tie into your orchestration tool. We write out application config files to a directory named "predeploy" and the orchestration tools are responsible for copying the files to the running location at the right time. Chef makes the application config files but does not implement the changes into the applications.

There is a gray area here. In the perfect world your product run list will have all dependencies needed for the product (including system dependencies), but then you find yourself in the one run list model. We settled on a tight-loose coupling between system and product. When a server is bootstrapped the system node bootstrap goes first then the product node bootstrap, then there is a running system. After that they are loosely coupled and can iterate at their own pace. There may be some cases where one may depend on another (e.g. big datacenter-wide structure change), but by taking the 98% lesson, try to make those go away instead of coding for them or ensure they are so seldom they can be a one-off event.

Lessons Learned

It was a huge victory the day someone said "server X isn't working the way it should" and I immediately, without effort, threw away the notion that it was configured wrong, and started looking elsewhere. Think of the mental energy you save when you can eliminate one whole set of variables with a wave of a hand.

The Toss Test. When we needed to upgrade the OS, we re-kicked the boxes and bootstrapped from scratch. Throw the old box away and build it fresh.

If you automate something you have to be 100%. If automation manages a file put a banner at the top saying "Managed by Chef. All manual changes will be overwritten!!!!"

As scary as it sounds you do want chef-client to run often (you define often). The less change between runs the better. Once all change is converged, subsequent runs will do nothing and are safe.