Configuration Management for Continuous Delivery (part 2)

Containment and Test driven infrastructure

In the first part of this blog, I argued that the evolution of product designs into separate branches (where differences can be isolated and contained), although a simple approach to versioning, can also lead to an explosion of complexity for both developers and consumers. This comes with both cognitive and maintenance burdens, and hence increasing uncertainty. By collapsing these `many worlds' or branches into a finite number of desired end-states (or semantic buckets), one effectively merges multiple worlds back into a single one, and avoids Deming's random walk of "tampering". (Hash tables or Bloom filters might be used here.)

This view is a different way of looking at the directly practical advice on Continuous Delivery expounded by Jez Humble and co-workers, in their excellent book. I phrase it this way, because it then slots neatly into the theory of configuration management.

The key things to ensure are the carefully defined semantic buckets, which form the basis of testable promises about function.

The many faces of continuous delivery

Not all Continuous Delivery pipelines are about software development. The airline industry, for example, does a pretty good job of delivering continuous back-to-back flights by scheduling rather different infrastructure slots. But even within IT services, you don't have to make your own software to profit from what Continuous Delivery is all about. You might be deploying someone else's sofware, purchased off the shelf, or downloaded for free. You might be writing number crunching jobs.

CD is really just about the merger of two ideas: dynamically scheduling change, and avoiding semantic complexity while doing it. Last year, I wrote about how dynamics and semantics of systems can be understood together in my book In Search of Certainty. And I want to add this appendix to that discussion.

Regardless of whether sourcing is from internal development, external purchase or batch job, there is a need for testing and closed loop convergence for stable functionality.

Testable configurations

In one sense, everything is configuration management (even airline traffic). We schedule resources (planes, virtual machines, sockets, etc) to keep some kind of promises. What model-based configuration (at least CFEngine's idea of convergence) does is to fold testing and deployment into one unified process, where patterns can be exploited. Since every single configuration promise is tested and verified in real time, and repeated continuously through an automation cycle (ususally its 5 minute default cycle), it takes away the burden of policing multitudes of things.

Configuration separates intent from implementation: design from build, and policy from enforcement, because it models through automatable primitives. There is not even much need for version labels in the dynamics of configuration, no need for rollback. Versioning can be entirely isolated to the semantic part (policy or intent), because you don't have to care about what comes later. Convergence to a desired state will get you there with automated error-correction.

If you make a mistake, you correct the forward semantic process (policy design), and the dynamical follow-up (running the configuration agent) will self-heal through hands-free automation. It's directly analogous to correcting code which gets picked up in automated builds.

This is even more important in infrastructure than in software development, because very few changes can be isolated as meaningfully reversible transactions in infrastructure. Once applied, there is no guaranteed `undo' in a folding timeline without a convergence model. You can undo the policy design change (like software), but then you've got to implement it. There are no real closures in infrastructure (not even virtual machines are closures). So, one has no choice but to abandon `rollback' as an artificial approach that tries to use relative-change for error correction (Deming's tampering model). The forward-moving `zero' model of absolute change through targeted convergence will always work (Shannon error correction).

At least, this is true as long as you don't deprecate or change the semantic buckets (or `zeroes') of the product design, breaking the symbols of convergence. This leaves us with two kinds of change:

  1. Changes in the semantic targets of a design (signal).
  2. Errors relative to those targets, `regressions' (noise).

Containers are the new cron

The return of the `container' (e.g. Solaris, Docker etc) has made scheduling jobs new again (containers have existed for a long time ). Container management is a scheduling problem; it is essentially like batch processing. In software development we might think of scheduling processors, memory etc. In `ops', we think more about the machine level abstraction and communications.

When I started CFEngine in 1993, a major use-case was to manage all the cron jobs on different machines. One approach was to copy or edit a separate cron file for every host from a central repository (basically a golden image approach). But that doesn't take advantage of a model. A better way was have only one cron job on every machine: to run CFEngine, which then could identify which patterns the host belonged to at any given time and on any given host. Then all of the possible jobs across the entire orchestrated domain could be in a single policy file, under a common model. 200 things suddenly become one.

The same issue is back with us in the shape of containers. Today, some are arguing again that we don't need configuration management, because we can keep it simple by using fixed immutable images. It's true, you can do that. You can also empty a lake with a spoon, but would you? It depends what it costs you.

DIY economics versus patterns

Adrian Cockroft pointed out to me that my aversion to branching in infrastructure comes from thinking too much like an operator than a developer. At Netflix, there are many small services that work together, and that the authors of a particular micro-service are responsible for making and managing their golden infrastructure. They make their own `immutable servers', which live for just as long as they need to, mirroring branching of code. In a world where there are many autonomously responsible small parts, one can spread the cost of templated images relatively easily, just using the brute force of autonomous agents. This is like putting a human to do what a CFEngine agent does. It is scalable (if more error-prone) as long as each human doesn't end up with too much to do.

Promise theory suggests that such autonomous collaboration is fundamentally simple, and a natural way to approach scaling, but with it also comes a collaboration cost, and risk of fragility. It puts all the agents on the defensive, which Adrian well knows, hence his championing of methods for combatting that with a culture of `anti-fragility'. But it puts all of the cognitive burden onto humans, where much of it could be modelled and automated.

Adrian was right that my automation view is `ops-centric' (in fact we don't disagree). In a developer-centric view, people want to think in terms of push-button control, which supports DIY thinking. But, I argue that this is a little selfish to put what we want ahead of what could be best, when money and even lives are at stake. `Dev' and `ops' are basically the same thing, just on a different set of promises.

We should not be confused. In a collaboration of microservices, like Netflix, the separation of promises is not branching. So one should not confuse branching infrastructure containers with the breakdown into microservices. Co-existence of micro-services is not the same as branching, because the point is precisely to make a collaborative system. Separation is what I refer to as `pulling things apart to keep them together' in my book. It doesn't remove the need for continuity. Indeed, it magnifies it, because with a greater number of parts, the combinations are quadratically larger. Ultimately, we must bring the parts together like a distributed `make' at the level of intent.

A semantically noisy channel

Enough of parallelism, what about the serial pipeline of steps we have to take to deliver a product? Amdahl's law tells us that this is the bottleneck of any process. A lot has been written about this, around DevOps, so suffice it to say that configuration `make' can also play a role here too (which I will demonstrate in my next blog on Kindle publishing with Latex using CFEngine).

Here, it's not just errors of final intent, but sequential errors in transformations that incrementally assemble data and product. These stand in the way of smooth delivery-- whether it is stages of construction on a car, or format encoding conversions on data. The tools we have available do not always do exactly what we need, so we have to massage what comes out of one before feeding it into the next.

These tasks too can be handled with configuration (see Kindle publishing with Latex using CFEngine).

Continuous test-driven software on a service

Convergent, desired end-state configuration and software both benefit from having a unified model, rather than a collection of disconnected branches of templating. The short-term simplicity of images leads to uncontrolled divergence in the long run. This is why configuration was born in the first place.

With a fast configuration management engine, you can quickly build the infrastructure, test its promises, and schedule it, dynamically and semantically, without allowing complexity to run riot. You can do the same for a software build.