Wednesday, January 23, 2013

Upgrading Chef Server from 0.10.8 to 10.18.2

Here is my story of upgrading Chef from 0.10.8 to 10.18.2 while moving to a new server and updated OS. Someone please comment and tell me where I may have been able to do better.

So, we are running Chef Server 0.10.8 on CentOS 5.4 with Ruby 1.8.7 and I want to upgrade to latest release of Chef and go to Centos 6.3 and Ruby 1.9.3 at the same time. So I couldn't just do an in-place upgrade on the existing server. I needed to migrate my Chef server to a new system and upgrade everything.

My first plan was to take the cautious route since I wasn't sure if Chef could be updated that many revs, so I tried to export all data as JSON, build a new Chef 10.18.2 server, then import all the JSON. It worked perfectly EXCEPT all the client's couldn't authenticate to the server even though I imported its public key. I could create a new client key in the Chef server and the node could authenticate, but it wouldn't with an imported key. I spent about a day on this to no resolution. Maybe someone else will have better results.

Next I tried to just copy the couchDB database. Unfortunately I flubbed things up a few times and spun my wheels for a few days because things didn't work (mostly my fault). Finally I found this method that works:

1) Compile Ruby 1.9.3 and rubygems 1.8.23
2) Install Chef via chef-solo http://wiki.opscode.com/display/chef/Installing+Chef+Server+using+Chef+Solo
3) Fix for the CentOS 6.3 bug for rabbitmq init documented https://bugzilla.redhat.com/show_bug.cgi?id=878030. We decided to change the rabbitmq init script to work around the bug

CONTROLPROG=/usr/sbin/rabbitmqctl
CONTROL="sudo -u ${USER} ${CONTROLPROG}"

4) Add the chef queues because rabbitmq was broken when chef-solo tried to do it. And change the solr maxfieldlength to 100000 to work around the problem of indexing nodes with lots of attributes.

/usr/sbin/rabbitmqctl add_vhost /chef
/usr/sbin/rabbitmqctl add_user chef testing
/usr/sbin/rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"

ex /var/lib/chef/solr/home/conf/solrconfig.xml
:%s/<maxFieldLength>10000/<maxFieldLength>100000/g
:wq

5) Shut down couchdb and rename /var/lib/couchdb/chef.couch to chef.couch.bak
6) Copy the couchdb database from the old server (can still be running)
7) chown chef.couch to be "couchdb:couchdb"
8) Start couchdb back up
9) Start rabbitmq, chef-solo, chef-expander, chef-server, chef-server-webui (in that order)

Now I wish I could say it's working at this point, but all the cookbooks are broken. Maybe someone from Opscode can tell me where the cache is of cookbook files that can be copied. But I tried to load all the cookbooks with a knife cookbook upload -a -d, but that still didn't give me working cookbooks. In the UI when you click a cookbook it says "end of file reached" and has no data. The name and version are there, but no contents. I had to knife cookbook delete -p each cookbook, then when it was added back MOST of the cookbooks worked. Some still gave the "end of file" and I had to purge them one-by-one and upload one at a time.

I hope this helps someone. Let me know if you want more detail on any step. I still haven't done extensive testing on the new server but the few clients I tested seem happy. I'm really looking forward to an omnibus install for Chef Server 11 and hope the migration is not painful.

Thursday, January 3, 2013

Talking Deming with my Dad

Driving home from a hunting trip with my father, conversation turned towards work. I started trying to explain how we're trying to adapt "some old concepts from the manufacturing industry, now called Lean" to the IT industry and how it fits remarkably well and people are excited to find that when you boil down the patterns that make the best IT companies tick, you re-discover patterns that were spelled out in the manufacturing industry a half-century ago. As I'm feebly trying to put this in words my father stops me and says "Back at Best Foods I was in the Quality Control department and our driving principle was 'Quality is conformance to specification'. That came from a guy named .. umm.." And I pipe up "Deming?" And he lights up, "Yeah, Deming." He then goes on to explain how Best Foods (maker of Skippy Peanut Butter and Hellmann's Mayonnaise) was a great company to work for and the QC department had the ability to stop the line and were an integral part of the business. Unfortunately they closed the plant he worked in and he was unwilling to move out of state so he went to Anderson Clayton Foods (now owned by Kraft) to work in QC there. At Anderson Clayton (ACF) they had the alternate definition where "Quality is fitness for use." He's not sure if it was the definition of quality, or the fact that the QC department at ACF reported to the Plant Manager and the Plant Manager's incentives were based on product shipped. I quote: "We shipped some marginal product."

Where am I going with this? I'm not exactly sure. It was great bonding with my father over talking about Deming and quality and how the things he dealt with are big concerns in the IT industry today. It just struck me the stark contrast in how he sounded talking about how great it was to work at Best Foods and how he was lifeless talking about Anderson Clayton.

I guess my conclusion is that I am adding another data point that I believe itrevolution.com is on the right path looking for patterns from Deming and Lean. When I asked my father if he thought based on his experience that Deming would be a good pattern for us to follow he answered without hesitation "Yes". He summed it up by saying "So you're trying to translate Deming from widgets to digits."