Three Myths Holding System Administration Back...

When travelling, I often hear: "Hey, MB, I love your stuff -- but in the real world we have to do things differently".

"Really?" I reply, quietly wondering what world that might that be.

As it happens, I have yet to see a case where these mythical real-world considerations were really true -- it's more a case that the IT industry is suffused with a fear to change, from something they are used to something that would be a big improvement. IT clings to outmoded techniques and flawed methodologies; it's "the devil they know". If pressed, few SA truly condone the methods they use, but equally few show the leadership to drive a change (note).

In this essay, I have selected three fundamental `world-views' or beliefs that underpin what very many SAs do, and I shall attempt to explain why they are flawed, and lead to poor choices.

1. Hierarchy: tree structures (divide and conquer)

In books and in society, humans have been organizing things into hierarchies for most of our collective history. A tiny number of non-hierarchical societies is known to anthropologists (in the Niger delta, for instance), but on the whole, kings and leaders with their middle managers and class-system underdogs have been the norm. Institutions and government, documents and tables of contents all present hierarchical structures. It's little wonder then that we cling to them with such vivacity.

But hierarchies are often only a stepping stone to something more useful. In biology, hierarchical phylogenetic trees are associated with classification of large numbers of species, for instance, but who actually finds this to be useful information? In Chemistry, Mendeléev found a more important pattern for the atomic table, placing elements into columns according to their behaviours rather than using a hierarchy to divide and conquer (animal, vegetable, mineral, etc). Escaping the dogma of hierarchies often leads to real insights.

In books, we make tables of contents to guide broad strokes, and we use an index to jump into the book as a network. Indices are not hierarchical, they follow proper names usually alphabetically, though they often have sub-categories. The WWW broke the notion of hierarchy most convincingly by allowing us to link anything to anything else based only on relevance. Companies like Alta Vista and Google have won over Yahoo and other directories because hierarchies tend to scale poorly when the number of items is large. You run into, what I call, the depth-versus-breadth problem.

Arranging hosts into hierarchies is fundamentally ambiguous. Do we do it by function, by location, by operating system type? There are too many characteristics to make a simple tree-like decomposition of them -- and therein lies the weakness of hierarchies.

Hierarchies are, in fact, simply spanning trees for arbitrary coverings of sets or networks of related things. Classification is about the identification of relevant subsets of a larger set. There is no special reason or necessity to stack information like Russian dolls (note).

Why is hierarchy a myth that holds system administration back? Well, by forcing systems into the badly fitting shoe of a tree structure, one causes a number of problems:

  • The Depth versus Breadth problem. Either you have too many things in each box, or too many boxes inside other boxes. Either way, it's hard to deal with.
  • It implies relationships that are not really fundamental, and fails to show others.
  • It conveys a false sense of simplicity and uniqueness.

The alternative is to flatten hierarchies and link things together more autonomously, as in the WWW, and then use a relational search approach to manage the knowledge. This is the approach that Cfengine has taken in building Promise Theory and using Topic Maps for knowledge representation.

The final aspect of hierarchies is that they imply centralization: many subordinates clustered around a hub. Centralization is a double edged sword. It is a strategy for consistency, but without proper caution, it can lead to bottlenecks and `single points of failure'. (In a tree, every branching point is a possible high-impact failure point.) Centralization scales neither well nor poorly because of these bottlenecks; trees have certain efficiencies (which is why they are used to span networks), but they have a lot of fragility and need to be used wisely. Network architects shouldn't jump immediately to centralized architectures just because they are familiar; there are actually few cases where centralization is necessary.

2. Determinism and rollback

The second of our myths is probably the most dangerous: a reliance on being able to do "rollback". It stems from a fundamental and misleading act of faith in something that is patently wrong: the idea that computer systems behave deterministically. The argument often goes: if you only use the right tools and program the system properly, then computers do not more or less than what we ask of them, and they need not change unless we so wish it. But if we make a mistake, then there is a safety net: we can just roll back the system to some previous state and all will be well.

The trust in `rollback' really starts with this idea of determinism. Modern, multitasking operating systems cannot in any sense be considered deterministic, because they are too interconnected. Moreover, what we see of systems is only ever observed from within, so we can't see the full picture. The implication is that what can not be done predictably cannot be reversed predictably either -- so rollback is a fiction. It is harmless? No. It becomes an irresponsible claim, like saying that a having an insurance policy means you don't have to drive safely.

To believe that systems can be done and undone predictably seems disrespectful to system designers. Operating System designers go to extraordinary lengths to implement `critical sections' or monitors in which single threads can be isolated for just long enough to provide transactional determinism to a high degree of approximation. However, no mechanism exists for providing such transactional security at the system level (single user mode is the closest one can get). In reality, a dozen external factors limit and influence the outcome of system changes.

In IT management, the term rollback is not used in a strictly correct technical sense. The idea originates in transaction processing, where critical sections enable some degree of `undo' ability for atomic operations. It is easy to show, however, that one cannot roll back an entire system in any meaningful sense, Rollback is impossible because we always have incomplete knowledge about exactly what took place over any interval of time.

A reaction to the apparent lack of success in managing change is often to try a brute force `lock down' of the system to prevent any kind of change at all. This generally has one of two consequences (as government users will know only too well): 1) you break your system so that it no longer works correctly, since it no longer has the freedom to operate as intended; 2) a significant portion of the resources are used to monitor and detect unauthorized changes, but there is still no plan for what to do next, should unauthorized changes be discovered.

"We have a plan!" I hear. "We'd do a rollback! Go back to the last known good state."

Alas this mythical state no longer exists. Let's assume that you've done a snapshot of the filesystem so that you know what the disk looked like at some acceptable time in the past. What about the processes? You can't just change their data and expect them to behave predictably. "We'll kill and restart them!" But what about the users that are interacting with them? Kill them too? (Just kidding). They have already been affected, and we cannot undo those experiences. Indeed, we cannot really undo anything, because the exact state before the changes occurred cannot be reproduced exactly as before, so no matter what ad hoc changes or `reversals' we attempt, we cannot regain the state of the system exactly as it was -- no matter what we do, it will be a new state.

Consider the following:

  • A change causes a hole in security.
  • System is infected by a virus.
  • Restore the original settings from backup.
  • System is still infected by virus.
  • The change causes a failure in program X.
  • We wipe out the system and lose all runtime data since the event.
  • Data have been lost and the complete original state has not been recovered.
The point about trying to do rollback is:
  • You don't get back to where you were.
  • You need to clean up afterwards anyway (which is going forward).
  • The rest of the world has moved on in the mean time. The damage is already done, so you are then out of sync.

If you think you've done a successful rollback before, I maintain that either the change was small and trivial, or you are hiding from the facts. The problems of indeterminism are, of course, magnified by the IT industry's predilection for project-based management -- waiting for too long to change things and then lumping together big revolutionary changes into a single deployment. No wonder things go wrong and one looks for an exit due to the pure fear of messing up.

Belief in rollback is irresponsible. It is at best a delay, a little back-pedalling, before correcting the system to the way it should be. Until such a time as we invent time-travel, we can say that rollback does not exist.

Cfengine's innovation: state teleportation

Recently Alva Couch and I wrote a paper explaining that the basic reasons for a no-go theorem on rollback lie in simple arithmetic. We can go forward with an arbitrary level of predictability, but we can never go back. The way to look at it is to think of addition and multiplication of state (values).

Suppose you start at some state "20" and make a phased plan to get to a new state "0". First you change by -5 and arrive at 15. So far so good. Then, instead of another -5, you make a mistake or unexpected change of -2 landing you in state 13 (a very unlucky state). So you frantically try to rollback +2 to get back to 15 so that you can continue with the plan of two more -5s leading to 0. (Now the rollback looks faintly absurd - why not just calculate the difference instead of getting stuck on these -5 operations?)

What Cfengine brought about was the idea of convergence, which arithmetically is the equivalent of multiplication by zero. No matter what number or state you start from, a single application takes to to the desired end state 20*0 = 0. Also 19*0 = 0, 13*0 = 0. No matter where you are you can apply this operation as many times as you like and you only get where you want to go. So it is also idempotent at the end point.

Should you want to change this desired `origin' state, you rearrange you coordinates to that 0 is somewhere else and do the same thing over again. This makes going forward completely predictable as long as you just keep applying the medicine "*0" as fast as changes occur.

Of course, we still can't go back, because division by zero makes no sense either. Zero wipes out the memory of where it came from. So you can't get back. But who cares? you don't want to go back, after all, you want to go forward to the correct place. Only fear and lack of a sound method for control cause people to start back-pedalling to no effect.

3. Order of operations is essential

Just as we are programmed to think in hierarchies, so we are programmed to think sequentially. The two are partly related.

Perhaps it stems from thinking as an individual, rather than as a team. How can *I* do everything -- I have to do it one thing at a time? Not how can the team accomplish the work -- divide up the tasks into independent parts and then coordinate the results.

The general purpose programming languages of the day have also taught us that if you want to accomplish a desired outcome, you have to string together a list of steps, in a particular order, one after the other, in a nice flow diagram.

Only later came parallel programming, but this is a more specialized discipline that few have the same exposure to, and so-called Monte Carlo (or non-deterministic) methods of evaluation are known only a tiny minority -- but these randomized methods hold great promise in scheduling and accomplishing certain kinds of problems.

Ordering is important only if a later outcome depends on the outcome of an earlier one. This is not always the case, and our intuition is not always good. A file is a sequential structure -- we don't want to fetch its words in a random order -- but disk accesses would be inefficient indeed if every block were fetched in the sequence that a user requested them: the disk head would constantly be thrashing around. It would be like shopping at the supermarket, blindly following a list without picking up items as you go.

Most tools are unabashedly order-dependent. Cfengine takes a more parallelized view -- but we'll come back to that presently. Puppet, for instance, attempts to resolve dependencies in the order of execution by sorting a deterministic task-tree (make-span). This classical approach assumes the absence (or rejects the presence) of loops in the schedule -- it becomes a partial spanning tree for the actual task graph.

However, in order to sort the task graph in this way, it's necessary to know all the information at the outset. It is a static world assumption. But for the same reason we can't de certain of the outcome of changes, we can't be certain that we know everything in the system either, so there is a fair probability that this approach will either miss issues or limit the capability to accomplish tasks.

To use the shopping analogy, once you have a binding, sorted list, you cannot adapt to unexpected changes while running around the store, even if changes are caused as a result of your own actions. What if one item on the list were out of stock (someone took the last one while you were shopping), or perhaps you need a new kind of ingredient to go with something (like a tin-opener) due to a change in the packaging that you were not expecting. If you cannot adapt to this information dynamically, the plan will fail to execute and will need to be recreated with new information. If the information is changing constantly, you will never capture it all, so you need a different `best effort' approach to adapt to the change.

The Cfengine approach tries to solve this pragmatically. Cfengine treats configuration items as far as possible as standalone atoms (called promises), each independently maintainable (keep-able). This is done precisely to avoid such ordering problems, by guaranteeing a convergent and completable task graph. It can support a limited amount of adaptation in real time. This maximizes the likelihood of being able to keep all promises in the tree.

Splitting up tasks into non-atomic packages is part of what data-normalization is about. It can be a way rational of delegating (or shirking) human responsibility, but it increases process fragility. The more items involved in a process, the more waiting is introduced, and the more links that can go awry.

In Cfengine, we avoid the problems of dependency by allowing redundancy in the presence of convergent (idempotent) actions. Repeating the same content (caching) to avoid fragility is safe as long as there is idempotence to avoid a divergence or explosion.

Software packaging systems today build fragility into their design, making it a human problem to manage that fragility. Software tools like Rpath try to get out of this conundrum, while keeping the advantages of the data normalization design.

System administrators tend to over-constrain dependencies -- fearful that the order they expect is not the order that will result. Just because the recipes says add sugar before eggs doesn't mean you have to buy sugar before eggs. As long as the necessary elements are all in place when the time comes, there is no such restriction.

Postscript

I've picked on three issues at the very core of industry belief system, things that I have been arguing against for almost 20 years. Am I saying that Cfengine is the answer to all of system administration? No, but obviously it is designed to be a tool addressing many of its issues. Cfengine is the proof of the pudding that there is another way (that's why I created it). It rejects all three of these beliefs and demonstrably succeeds where other tools fail. Still, all is not rosy -- I hear: "MB, I love your stuff, but I can't wrap my head around it. Tool XYZ is just easier to understand." I take this seriously, but should an entire industry be limited by a lack of understanding?

Always listening, I am starting to understand why others don't understand these points, and I hope to write more on these issues, from different angles, in the future. For now, I challenge you to think it through -- there is indeed a better way, and it's not that hard to grasp, and there is no more real world than the one in which we face realities.

Notes

Note 1 Being a little behind the times might seem like an innocent misdemeanour but it is both harmful to industry progress and demoralizing to students who learn about new and exciting research, but are forbidden to use it in this self-inflicted "real world".
Note 2 The Java class hierarchy must be a red flag to anyone who does not intuitively see the pitfalls of tree-like classification. Some things do not fall into mutually exclusive categories. Whenever functions did no fit neatly into an existing classification tree, they simple started another one to compete with it. Pretty soon, there were multiple input output libraries, and we were no better off than having the flat "libC" library of old -- except that function names were suddenly much longer and anal in their bureaucracy.