Infrastructure management timescales

How to avoid wasting your time in infrastructure

This is a very brief sketch of the current timescales we deal with in infrastructure management. Without a proper understanding of time and how it is used, we end up wasting a precious resource as well as looking foolish.

In any human endeavour, we need to understand relevant timescales for what's going on. There are two main reasons. (I wrote about this fusion of dynamics and semantics in In Search of Certainty.)

  • We need to know have often to look at something to understand it. Sampling rates to observe phenomena (Nyquist's theorem).
  • We need to know how fast and how often to interact with something, when it fails to keep its promises. Matching changes to correct errors (Shannon error correction theorem, Burgess Maintenance Theorem)

Separation of timescales is an indication of the coupling strength in a total system. Good separation indicates weak coupling, which is good for stability. Poor separation implies strong coupling which means fragility and instability.

It's worth laying out an overview of some main timescales (see figure).

What's in a timescale?

It's not only about performance, it's about stability.

The KT-boundary line (or kernel-thought boundary for non-geologists, see figure) is the line beyond which humans cannot be involved. Our bodies and brains simply don't work fast enough.

Notice how most of the clutter happens around the 1s timescale, where we do our thinking. This is partially coincidental, as many of the rates are increasing. What will be the boot-time of a process, or the speed of a network next year? Notice also, some timescales limit others. The time to boot a system limits the rate at which a new server can be deployed for scaling, for instance.

What are the broad timescales we need to pay attention to? These are determined partly by the speed limits of technology, but mainly by the speed limits of human interaction.

Mismatched timescales can be costly either because we miss important information, or because we dwell too much on a watched metric that never boils (busy waiting). When we plan processes, for comprehension, participation or automation, it makes no sense to try to handle a process at one timescale by a process at another.

It's not just about human inspection either. The whole issue of eventual consistency, or `data consensus' (à la Paxos, Raft etc), also known as data equilibration, arises from a simplistic view of the semantics of dynamical timescales. In the IT industry, we are at best unsophisticated when it comes to understanding relativity.

Matching dynamics as well as semantics

We are quite good at inventing software to fix (semantic) problems, but almost never try to investigate or match the dynamical process that causes the problem (and will cause it again).

State can be held in check by a counterforce. In a stochastic system, that is a statistical equilibrium.. The execution time of the remediation has to be matched with the , as proven in the Maintenance Theorem.

Understanding timescales allows us to see semantic confusions. The following are classic examples:

  1. Any problem that arises at a timescale T must be maintained at that timescale. Too fast is a waste of resources, too slow leads to significant cumulative error. Slow processes should not hold up fast processes in scheduling. This has been a weakness of configuration systems in the past. Compromises on remediation times for efficiency are in constant flux as the overall speed of systems increases. We may witness how even simple convergent processes have gone from hourly checks to 2.5 minute checks in CFEngine. Now, issues like container management require maintenance on the order of seconds. It is no longer expedient to do this in a monolithic agent; agents can be embedded in hypervisors, filesystems, kernels. This has already lead to projects like systemd on the single system level. It could further be applied to the `orchestration' of distributed systems.
  2. The most obvious case is the monitoring alarm. An error due to a process collecting data on a timescale on microseconds is fed to a human, with a pager, to respond to on a timescale of minutes or hours. A single bulk remedy is then applied, and then one waits for it to happen again. (We should fix the microsecond mismatch.)
  3. Virtualization is a dependency of elastic scaling, elastic scaling is constrained by the time to spin up a virtual instance. The dynamical timescale of elastic scaling (web traffic in say milliseconds) is much smaller than that of the average configuration management process (say minutes), so it makes no sense to try to solve elastic scaling with a traditional `configuration management' process (like CFEngine's agent, Puppet, Chef, etc), even though there is a semantic fit. Dynamics always trump semantics. (Embed something in the process dispatcher, but note that spinning up instances is limited by the boot timescale.) You can still call it configuration management, if you like, but don't try putting it in the same process that handles minutes, hours, and days. We need to split up agents and architect around timescales to win the arms race against speed and resilience.
  4. Similarly, there is no point in `monitoring' (in the traditional sense) anything that happens faster than a human ingest it (unless of course it is a closed feedback loop). Humans readily build time-transducers to speedily help consume data analyze and pick out summaries (analytical tools, log parsers and searches). Automation requires humans to step aside when busy stabilization is needed.

Misunderstandings like these are how time bottlenecks arise. It's a basic application of the queueing problem. The service time in a queue should be shorter than the arrival time of a task. Timescales tell us when we need to introduce automation to extend our limited faculties.

Cost of resolution: choose your equilibria wisely

Working at smaller and smaller timescales costs more of system resources, and human cognition. Our world is speeding up, but if we try to work at a generically faster rate, and solve slow problems in the same way, we will end up busy waiting. The watched server doesn't boil. This is what sleep() was invented for.

Models that make predictions of time are important, because there is always someone to argue that you need to look in more and more detail, just in case!! Equilibria are arms races. We cannot win.

Study of metrics

The industry desperately needs an updated version of the general study of computer metrics for the present decade. Such a study would help us get out of the rut of the majority of contemporary monitoring techniques.

Mon Nov 17 14:45:32 CET 2014