Pets vs Cattle vs complexity

So way back in November 2014 I started on an experiement around configuration management. It might have started earlier, but that’s the first commit date in the git repo. Basically I was motivated (somehow) by the realisation that the pets vs cattle analogy worked really well to me. My handful of machines were more bespoke and unique (pets) than they could have been, and it would be a good idea to make them more throw away (cattle).

At the time I was already using pxe booting kickstart scripts - with a fairly complete base build coming out of the kickstart process. My media PC was entirely configured this way, and could be rebuilt - on demand - in about 15 minutes elapsed time. So if anything went bad with a package update, it was already cattle and not a pet. Other machines (desktop and vm’s) were built with kickstart but less cattle and more pet like. So the pets vs cattle thing had room for improvement, and the other thing this methodology needed was a configuration management tool. Kickstart scripts was not it, as they didn’t work for cloud where you build from a cloned image.

After some reading around and talking to people. Some people loved Puppet, others liked Chef, a newcomer (at the time) was Salt which was gaining some interest. All of these needed agents installed on the destination, and I think all needed a server (application) to drive it. This is ignoring one was written in Java, one in Ruby (and erlang) and one in Python. So they also needed their base language installed on the destination to function too. This meant to me, I couldn’t escape the kickstart script completely as it would need some software installed beyond the minimum and the agent software too.

Then I found Ansible which Redhat was sponsoring and Fedora was using. Ansible only needed ssh on the destination to work - no agent at all. However it did benefit from having python on the destination for most of it’s functionality.

The methodology of each of these tools varied a bit.

Puppet worked on a model approach and tries to make the destination realize the model. Scripts were written in a custom language and called plays.
Chef used the model idea too and applied the recipe to fit the target to the model. Recipies were in a custom ruby style language.
Salt I think was the same again, so I didn’t look too closely.
Ansible was pretty much a top down script of custom modules. The modules (mostly) had checks so they can flag if they need to do anything, and track success - idempotent scripts was the key. Your stuff is written in yaml documents called plays and they are arranged into playbooks.

So I started off with Ansible and trying to translate my kickstart scripts into ansible roles and playbooks. Splitting out common bits which apply to all machines into a common role which even worked across software versions and distributions (various releases of centos and fedora, and later debian). Each system type then had several roles assigned which then apply the steps in the playbook in a top down fashion. My kickstart script shurnk to a totally minimal centos/fedora install which adds a user and ssh key only. From there ansible could connect and run the playbooks to turn a machine into any system type.

Early teething issues annoyed me, like not being able to have multiple things done in a task step. So you end up with heaps of tasks in a playbook, each doing one thing - the exception was anything that could be done repetitively from a list (so multiple calls to same module could be parameterised from a list/dict of items). Playbooks could be included and passed variables, so some high level automation was possible. Ultimately it was a very verbose way of doing things.

You end up having to do this

1
2
3
4
5
6
7
8


- name: setup privoxy
  lineinfile: dest=/etc/privoxy/config state=present regexp="^listen-address" line="listen-address {{ ansible_default_ipv4.address }}:8118"

- name: insert firewalld rule for privoxy
  firewalld: service=privoxy  permanent=yes state=enabled immediate=yes

- name: enable privoxy
  service: name=privoxy state=started enabled=yes

rather than what made more sense

1
2
3
4


- name: setup privoxy
  lineinfile: dest=/etc/privoxy/config state=present regexp="^listen-address" line="listen-address {{ ansible_default_ipv4.address }}:8118"
  firewalld: service=privoxy  permanent=yes state=enabled immediate=yes
  service: name=privoxy state=started enabled=yes

though there’s a new keyword since 2.x block which I need to look at. It might let me do this.

As time went on, I think I started with ansible 1.6, I hit issues where modules lacked the one ability I needed, or changed in behaviours. Then other system things changed - yum to dnf, iptables to firewalld. These necessitated using conditions on tasks to check distribution or release version (which was easy, but meant doubling up of tasks, one for each way of doing it). It seemed ok, and I plodded on. Each release of ansible got better, 1.9 was good, 2.0 was a big improvement and now I’m on 2.3. Each iteration more modules have been added, issues have been fixed and it’s got more powerful which is great.

I expanded my playbooks to include my omnios host server and the package repositories on there. I created a parameterised play which was given the release name and a tcp port, and it would create the source repo, populate it and start the service for it. Rerunning the playbook would update the repo. Happy days.

Ipxe worked really well. Simply include another playbook and the ipxe boot menu was updated. Change a variable for what Fedora release I wanted and it would download the pxeboot files (kernel+initrd) to the appropriate web server (ipxe rocks by booting from http) and update the menu. Easy. Except it wouldn’t clean up the old files unless you wrote a task to do that - disk is cheap anyway.

Then I tried my router - a vyos VM. I had a templated config for this, so looked at applying a script by playbook. Some initial success spurred me on, but eventually I hit an issue with changes. The script just couldn’t apply in an idempotent way. The router needed to delete ALL firewall rules and run the script, inside one transaction to handle deletes or changes. This meaned EVERY run would dump and reload the firewall, even if no change was present. So I stopped there and kept on elsewhere.

Ansible modules had changed over this time (2 years) so I could clean up some old hacks that were there. I’d marked them so they were easy to find. Firewalld now didn’t need to reload the service, the change was immediate. Clean up here and there. Now I wasn’t using “old” centos, so I could dump some old hacks I had present for centos6 now everything worked on centos7. This still left centos on yum and fedora on dnf for packages. The “unified” package module didn’t exist yet.

More apps came along and it was easy to automate them. OnCommand Insight was easy to automate the install without interaction on centos. I even got the playbook to hit the API to install the license key.

Sounds like a great success. Except now I have a mess of playbooks written in yaml which need testing regularly to ensure upstream changes don’t break them. Changes both in ansible modules and distribution packages. So I setup a good way to test them on vmware; clone a base image and apply the playbook, over and over. This way I didn’t need to pxe boot the vm manually to test. I never got to the point that I felt comfortable that rerunning the playbook had no risk at damaging/trashing the proper machine, so testing was required. I’m not sure how close I got either, it might have been just one round of cleanup more and happy days, or it could have been heaps - I just didn’t have any datapoints to draw a conclusion from.

And I still netboot and kickstart ESX host builds.

I’d failed. I’d automated my pets again.

Automatic cat feeder