Universality and IT management

The death this year (14 October 2010) of the Polish-French-American mathematician Benoît Mandelbrot, best known for introducing the idea of fractal dimension to mathematics, reminded me of issues that I have moved away from over the last ten years, but which are every bit as important now as they were back then. We must learn to appreciate that scale is not merely a design issue in IT management -- there is much to be learned if we approach modern datacentre management more scientifically. It might hold some surprises in store.

The Mandelbrot set, which is surely the most iconic symbol of Mandelbrot's contribution to science, is not merely an intriguing image of immense beauty, it symbolises an important phenomenon, frequently ignored in computer science: that of instability and the critical importance of initial or boundary conditions to eventual outcomes.

Mandlebrot's work came to be associated more with the physics of complex systems, (`Chaos Theory'), but by neatly compartmentalising topics like this, we merely hide them from general consciousness where they might do some good. The significance of the Mandlebrot set is that it represents a boundary between stability and instability in a dynamical system. It shows how deceptively simple problems (in this case solving quadratic equations) can defy our intuition and lead to unstable behaviour. To be surprised by the Mandelbrot set is to see why software contains so many bugs. Human-Computer systems are also dynamical systems.

Stability first

Concerning the stability of computer systems, we seem to have learned little in the 10 years since I wrote about the primacy of stability as a management paradigm (for a full discussion, see Analytical Network and System Administration). With the exception of Cfengine and IBM's autonomics, the message about dynamics did not catch on. In the aviation industry, no one thinks twice about placing the stability of aircraft above other design criteria. There is a simple connection between stability and safety. For some reason, this is less obvious in the case of computers. There is another reason, however. Computer programs are discrete or digital, not bulky and statistically robust like the systems of the natural world. Ironically, this makes them more sensitive to instability, not less.

The principle around which I built the `immune system' and configuration software Cfengine was (and is) this:

Forget about exactly what you want, and start with what is stable and achievable. Then find the closest stable candidate to what you want, and build on that.

Some critics still rile against this viewpoint. I often hear `we don't want notions of convergence and probability, just change the computer to the state we want by command -- make it so!'. The field of deontic logic (essentially the study of rule and obligation) has made a career of examining the idea that one can impose restrictions on external entities and agents. If one looks at which this field has achieved in its fifty or so years, it amounts to very little. (The failure of deontic logic to say anything useful about anything is one of the motivators behind Promise Theory.)

`Just make it so!' This is a crude and naive point of view. Of course, one can try. We can make any kind of policy and insist that it be obeyed. Imagine passing a law that people are only allowed to stand on one leg, and keep the left foot at least 20cm above the ground at all times. This might satisfy someone's view of correctness, but it is hard to enforce, insisting on a highly unstable state.

Scale and stability -- facing the unmanagable

In physics, one of the most important realizations of the past 50 years, has been the importance of scale in understanding behaviour. Not only does the world look different at different scales (from 10,000 feet, or through a microscope), but it also behaves differently. This scale-dependence explains many phenomena in the natural world, like why aluminium bends, water flows and glass shatters, to mention a couple; and we have to believe that it can explain phenomena in computer systems too. Or, to put it another way, we ignore it at our peril.

The concept of renormalization was originally used in physics and statistics to grapple with large numbers. The idea goes basically like this: suppose we start counting something of interest; when the numbers are small, we can easily see the difference between the numbers: 1 is different from 2 and 3, no doubt about that. But what about 1237821499273642299992773 and 1237821499273642299992783? What about the difference in height between two mountains next to each other, towering above us?

When numbers get big, we can't easily see the difference between them, or see the wood for the trees. The answer is to renormalize the numbers, e.g. by cutting off the mountain bases, and keeping only the top few metres. Then you can compare them more easily. Similarly, you can subtract 1237821499273642299992700 from the numbers above and instead compare 79 and 89. To compare two specs of dust, you would magnify them 1000 times, and so on. You simply change your notion of scale, like recalibrating a set of weighing scales. The you will see the relevant phenomena better. These small differences, although hard to see at one scale, can be amplified into significant differences at others scales.

This is all well and good for simple comparisons, but what is interesting about scale is how it affects relationships, as interaction is a great amplifier of effect. We might understand the behaviour of cells on a microscope slide, but who could guess that they would clump together to form elephants and human beings at a thousand times greater size? Is this a bug, a feature, or an inevitability? Renormalization or scaling behaviour is an approach we must embrace in IT management to reveal relevant behaviour from irrelevant detail.

When a system of anything, e.g. computers interacts, or communicates somehow, the relative sizes can have an effect on the outcome. When a bird lands on the back of an elephant, it does not change the behaviour of the elephant significantly. But when a parrot lands on the shoulder of a person standing on one leg, it can topple the person causing them to fall to the ground. Why? Because the initial configuration of the person was unstable.

Lying on the ground is an attractor (actually it is about lowering the centre of mass of the object), but of course there are many ways one can do it. And there are many natural phenomena that share this general feature: a stationary bicycle lying on its side, a man lying on one side eating grapes, a penguin on its back shooting downhill, etc. The details are unimportant; the general principle is that, at a given scale, a perturbation will tend to cause a system to fall into a stable state (close to the ground). This is called universality -- general inevitability of outcomes at a particular system scale. It doesn't matter what the thing is, or who made it, or even what it was asked to do, it is going to end up in a stable state, like it or not.

We conclude from the universality of such behaviours that it does not make sense to try to walk a tightrope in a hurricane, or make humans stand on one leg in a crowded shopping mall. Nor does make sense to base IT management on the assumption that a system will not fall over, or that simple quick fixes can avoid the instabilities.

The mistake system administrators and programmers often make in computing is underestimating the inevitability of certain behaviours. No matter what you do, universal behaviours will come to dominate fundamental issues, beyond our control.

Stability and self-repair

The computer `immune system' Cfengine was designed to encourage system architects to take advantage of natural stability, by developing convergent behaviour -- i.e. behaviour that would naturally be attracted to a desirable state. If you base tools and policies on the notion of stability, surely the desired state of the system will persist for as long as possible (making it predictable and thus usable)? Indeed, if the desired state is `lying down at the bottom of the hill' then it will also self-repair when something perturbs it by trying to push it uphill.

Traditional approaches to management fight stability -- they are hill-climbing approaches, starting from an arbitrary point and following a fragile path to an unstable point. The ability to reach the desired state from the initial state is unstable to: i) the choice of initial state, ii) the predictability of the path. Cfengine turns this approach upside down, and makes the desired state a stable point of attraction.

Traditional system configuration starts from a single
known baseline and climbs to another point.
Cfengine is like a valley insensitive to initial conditions.

As we develop more tools for datacentres, I believe we should make greater use of this understanding. Intelligent systems in the future would have self-knowledge and understand universality of behaviours. They would be able to warn architects of poor decisions, and even limit the probability of unstable behaviour by design (like Cfengine).

Cfengine introduced `model-based' configuration, for a class of models that can be expressed as stable attractors. Not just any model, but stable models. Of course, there is also the problem of patchworks. There is no single model in use. The world's computer systems might be interconnected, but they represent a patchwork of overlapping models, with different goals. Moreover, the scale of systems is growing to the point where new phenomena might occur. We need to understand these phenomena better, and design to work around them.

The Mandelbrot set's fractal image is a persuasive visualization of universality in a specific area. Surely nothing designed and programmed could deliver such infinite regularity at every possible scale. It also underlines the importance of boundary (symmetry breaking) to understanding dynamic systems. I think system administrators need to wake up to these lessons and think less deterministically about the systems they manage. We can't program our way out of universality, so we'd better understand it at all scales to harness and use it to our advantage.