Artificial reasoning about leaky pipes

How monitoring lets us down by shrugging off non-trivial causation

In a series of recent papers, I sketched out a particular way of looking at reasoning, based on the idea that space and time are at the root of all our cognitive and logical processes. It's a simple idea with deep consequences, that reasoning is about the ability to represent aggregate clusters convergently, rather than the traditional one that reasoning is about addressing and searching for nodes in a tree. The results have been implemented (as proof of concept) in my Open Source Cellibrium project. The application of these ideas has many possibilities, especially in cognitive computing. However, in this commentary, I dare to ask why IT monitoring technologies, suitable for automated reasoning, have failed emerge and take hold of modern systems. In IT, we try hard to partition and separate networks, examine only the most primitive metrics, making causal reasoning hard, then we try to compensate by collecting `big data' that we brute force search or apply awkward inference techniques to. This kind of blind aggregation has hindered the evolution of knowledge-based monitoring, and thus we have failed to make advances comparable to those that occurred in medical scanning for instance, in the same timeframe (shortcut to the main papers).

One of the key problems we face in diagnosing systems of a certain scale is that fault finding is challenging to the point of stretching the mental faculties of humans. I have long imagined using artificial semantic reasoning (what some still like to call a form of root cause analysis) to perform such diagnostics, and this recent work makes a few breakthroughs in identifying a general approach. I find the role of space and time to be particularly interesting. When the evolution of human reasoning began, there was only space and time to bootstrap from, so if we are looking for so-called root causes for anything these would seem to be suspicious candidates. Reasoning is often attributed an almost mystical significance, in technology, due to its connection to mind and the human condition. Some will always reject any attempt to reduce it to functional explanations, let alone a root cause, preferring to elevate humanity above machinery at all costs; but, I find no contradiction in these goals, as long as they are approached with diligence and humility.

Artificial reasoning is probably controversial, not least the idea that it is related to space and time at a fundamental level, but if we think about what options exist for tracing concepts and relationships all the way to the most basic or elementary phenomena in the universe, space and time (location and order) can't really be surpassed. I have written about that in The Semantic Spacetime Project.

As IT systems have developed, monitoring of their conditions and objectives has remained almost unchanged. It has clung onto the most basic measurement of low level performance counters (and only the smallest number of systems even do that right). Any kind of causal relationship between software and these metrics is highly speculative, even if the monitoring vendors knew statistical methods at PhD level. Monitoring practically rejects semantics (what systems intend and what outputs mean), preferring to focus on the easier problem of collecting any and every data at all costs, leaving users to desperately try to reconnect the unlabelled measures to the problems they face, without any assistance from the technology. We can blame the `big data' craze driven by storage manufacturers for this. However, we can do better by instrumenting software properly in the modern world.

The search for meaning in information and phenomena

Diagnosis of faults and conditions is one of the hallowed problems to solve in AI, because it has immediate application to the understanding of systems, or collective distributed intent. It is difficult because intent is not universal, and the definition of a `fault' or a condition is both subjective and relativistic. The systems we describe are usually artificial (intentional) designs: we can try to apply diagnosis to more directly `natural' phenomena too, but this is often more speculative, because nature has no obvious intent, other than that which emerges through us (again pointing to the subjectivity). Nonetheless, the situation is not hopeless. There are ways of addressing these challenges. We can decompose nature, and then project it into a pattern of our own intentions (e.g. we can simply claim the viewpoint that illness and disease are `faults' in the biology of humans, to be repaired or avoided, even while biology looks on, from its own perspective, disappointed that we didn't like its new cool experiment). This is all consistent and fine; however, it brings up an interesting point, which, although many have written about over the years, is still widely ignored in the search for simple deterministic answers.

Because Plain Old Monitoring is a long standing cash cow, with few exceptions (that receive little mainstream attention), monitoring systems have evolved mainly aesthetically; they still measure the same low level kernel performance measures, like load average, CPU percentage, and memory usage that were relevant in the age of singular timesharing systems, or collect log messages from disparate sources with random formats that mainly babble private state change garbage like a Cylon hybrid. No one can really know the meaning of these data without context that originates from the source. Yet, year after year, papers get written about statistical inference from these basic lifesigns as if there were a magical solution. It makes challenging problems for PhD students, but the approaches could easily be outperformed by proper instrumentation that carries semantics. I wonder why a standard embedded library, or even smarter programming languages have not emerged to this end. As we wrap software in increasingly sophisticated sensory deprivation suits, it would be a great idea to re-equip them with channels for connecting and reporting useful state, that match the scale and sophistication of the systems we are now building. Particularly as we move away from shared resource systems past containers to unikernel microthreaded systems (actually `pure' autonomous agents in a promise theory sense), why is there no standard logging library to output semantically meaningful information about running software?

We can observe and characterize states and conditions in systems, at different scales, by aggregated observation (learning) over different locations and times. Humans observe simple mechanical causation, and we are used to being `hands on' and `causing' things to happen as extensions of ourselves. Thus, when we see something that we were not directly responsible for, we want to know how and why it happened. Indeed, we often break the world down into clues (tuples) based on: who, what, how, when, where, why. For example: `Professor Plum was murdered by Ms Scarlet in the library, with a bread knife, because she refused to marry him'. In this event, we can identify:

Who{ Professor Plum, Ms Scarlet}
Howby bread-knife
WhereThe library (a specific library)
WhyBecause Ms Scarlet refused to marry Professor Plum

This kind of reasoning has a natural arrow, along the axis of increasing information, and a level of detail that is quite impossible to infer from a system kernel's pulse. This is an interesting observation, because tradition prefers the opposite, i.e. the divide and conquer approach to explain stories: breaking compound things down into smaller parts that are easier to keep in mind in one go---hence the prevalence of reductionism as a strategy for understanding. So our first question is: what is explanation? Is it breaking something down, or putting it back together? Polar opinions about the evil of reductionism notwithstanding (usually because of politicized misrepresentations of `complexity'), there is a deeper point, which is about scale. In any system of parts, we need to understand which parts (agents) at which scales influence other parts (agents), and which are not directly coupled. We are extremely poor at this kind of understanding in IT (see the discussion of scale in In Search of Certainty).

Anomalies and explanations

Anomaly detection (figuring out when something unusual or unexpected is afoot) is already an ambiguous problem, because to know what is anomalous, we have to learn what is normal, by aggregating `ensemble' evidence over a spacetime region. This was the thinking behind simple anomaly discrimination for distributed systems, going back decades.

In the Cluedo example above, if we consider the murder to be an anomalous or unintended state of the system, then what is the root cause of the murder? One answer is that it is the act of refusal to marry. But this was given by Professor Plum, so is he the root cause of his own murder (death by misadventure)? Or was Ms Scarlet wearing inappropriate clothing (cause attributed to clothing or male chauvinism?). Of course, we can go on and on. The only objective answer is that causal explanation is about storytelling, which preserves a natural order but reserves the right to pick the cherries that look appealing, in which events are partially ordered by certain prerequisites. Causation and causal explanation are two different things: explanation or reasoning is a serialization of what is generally a deeply parallel phenomenon. In spite of our best attempts to find uniqueness, stories are nothing but flattened spanning trees over networks of relationships, that have been collapsed into some non-unique grammatical ordering. So, if the description of an event is so ambiguous, can there be any intrinsic meaning to events? Clearly not. What about a weaker claim, of possible meaning relative to a particular viewpoint and context? The latter seems more doable, but it is not the kind of admission computer science usually owns up to, due to its post-war affectation for binary states and deterministic models.

If we start from any system, observing it through some kinds of sensors, what missing information do we have to add to raw data to create meaning from it? Usually meaning (as we understand it, through concepts) takes a long time to emerge or self-organize, because autonomously generated meaning is a cumulative phenomenon, based on long term evolution of contextual network (see The structure of functional knowledge...). However, meaning by intent is much easier to ascertain. Context is about environment, about how things come together in space and time over many interactions, and how memory encodes a projection of those experiences (see a summary); but, what some fail to mention is that networks don't have to be explored in realtime: systems with memory can cache distributed network information locally, and process it at their convenience (asynchronously); thus, the process of discovery and learning accumulates knowledge banks, allowing individual (yes, even temporarily autonomous) agents to be creative with their own self-contained resources, stored up for a rainy day. If you don't believe me, try writing down your dreams: our brains tell quite creative and very different stories, when unshackled from realtime sensory constraints.

The source of imagination: to create is 'cumulation, to reason divide

What is the relationship between innovation and semantic networks? Innovation is about combinatoric mixing (technically, the formation of unifying hubs that bring together impulses from around the network of events). Memory provides that singular cauldron for bringing ingredients together into a conceptual primordial soup. Fragments of idea, like genes, recombine into new ideas, either by direct realtime experience, or later by remembering over time. They are then developed (cooked, developed, or gestated, if you will) over time and selected or rejected on taste.

[ Footnote: Managers and business clichémongers, take note: teams don't have to brainstorm in front of a blackboard. Introverts will out-perform your whiteboard teams, using what they already read and experienced, by compressing reading and experience into a thought process without further talking, discussion and calibration of interpretations. Memory has precisely the function of enabling introversion in systems.]

Conceptual aggregation (not statistical inference) is the opposite of the causal discrimination that algorithmic logical methods emphasize. It seems like pure heresy against the deterministic philosophy of the enlightenment, to suggest that logic (based on the precise discrimination of pathways, by if-then-else), through some landscape of axioms, concepts, and events is not the only `correct' way to think. But this is because we have long muddled the ideas of storytelling (which is what logic is a disciplined form of) and creative reasoning (which is largely about selection of experimental hypotheses formed by `genetic' recombination of fragments collected by aggregation---the opposite of logical classification).

When we tell a story, we parse a network of concepts by trying to lay out a spanning tree in a serial sequential linguistic stream. The inner structures of language and grammar are telling remnants of the non-unique, hierarchical ways of constructing different spanning trees for the same network. But if (as in logic) we focus only on the telling of a particular kind of stories, not on how they come about in the first place, we ignore the constructive processes of learning and aggregation, which must take place before such a story can be told. We have to bake the cake, and try it, before writing the story of its origin. The same is true of fault diagnosis and root cause stories.

Finding invariant (stable) concepts

Imagination is only the beginning of innovation, as we know from biology. Once idea genes are combined, there is a longer process of gestation in which we isolate what can be stabilized in the system from further recombination, in order to test the outcome. This is how an anomaly becomes normalized to become compatible with context. The same happens in learning. The key at every stage of exploration is aggregation not discrimination: learning is temporal aggregation; normalization is ensemble aggregation over space (frequentist) or time (Bayesian). Innovation is combinatoric specialization by accumulating attributes and nearby context, not by branching to a singular answer. Efficient memory is a search for stability or semantic invariance, which in turn relates directly to spacetime invariance.

Invariances lead to efficient storage (keeping data small because `big data' are a useless burden to realtime cognitive processes). They also bring certainty in routing/addressing, and harmless idempotent overwriting that negates the need for pointless searches for what is already stored. Storing events as endless timeseries of every possible noise fluctuation is not a form of knowledge, it's just a sensor stream like keeping CCTV---all the work in extracting invariant knowledge is yet to be done. We extract knowledge by generalizing or classifying into `coarse grained sets' in order to compress data. Invariance comes from the semantic averaging, based on evolved criteria, in which variations are coarse-grained into blunt contexts or patterns that match many cases, generically. When we combine data into coarser grains

A + B -> (combination)

the amount of information has to increase (A and B come together, singular become paired), but can now be compressed into a new symbol. So semantics, learning, and all those compressed tokenized forms of spacetime events progress along a directional gradient of increasing information. This is another way of couching the second law of thermodynamics, if we associate entropy (in Shannon's sense) with increasing (rather than decreasing information).

Note added: When scientists and technologists talk about entropy and information, it can be confusing, because physicists and computer scientists define information differently. Information in physics is usually counted implicitly in terms of the uniqueness of an experiment within an ensemble of experiments, so it measures the uniformity of statistical patterns across many episodes (external time). Information in Shannon's theory of communication is about uniformity of states within a single episode, thus it measures the uniformity of patterns within a single experiment (over internal time). This is why information scientists would say that information is entropy, and why physicists say that is it the opposite of entropy!

We can only know what is normal, and thence what is anomalous, by aggregating, i.e. by accumulated learning, and then by purging unwanted semantic instances to distill invariants. In papers I-III, I argue that these are all derivative of simple spacetime interpretations.

A quiet belief in root causes

The search for a `root cause' (i.e. the idea that every `interesting' event had a single cause that preceded it) is often considered to be a form of meaning all by itself. It is an artefact of storytelling. When we design narrow purpose `machines', usually no other meaning is needed than tracing its gear and levers to back to some faulty thinking, because the design and functioning of the machine has already been designed or evolved into a particular meaning, based on its own particular context. If the machine is generic and reusable, that context has already been compressed into an invariant form, stripped of particular experiences, episodes and experimental instances, coarse-grained into `classes' and `categories'. Such functional meaning is already a selective summary of evolutionary context over long timescales -- as mentioned in paper III.

So why don't machines always do simply as we intend? Any backward search for meaning by looking at anomalous output is a discriminant process. We try to break something down into `smaller' (more elementary) contributory parts; yet, this is partly misguided because explanation does not refer to spacetime scales and how parts of different scale interact (intentionally or unintentionally): one could flip around the entire picture and view the the root cause as the original holistic intent to build the entire machine or system in the first place. A concept realized in many parts. What the components do is constrained by the semantics of their specific combinatoric arrangements. Configurations are intended to match intents. So surely that intent is in some sense responsible for the outcomes. If we interpret an `interesting event' as a failure, then the notion of root cause equally suggest that the design could simply be flawed, in some sense, as much as it points to a particular link in the artificial spanning story-chain being unsuitable, or failing to keep a particular promise. In promise theory terms, a system design is an imposition. The system promises what its components and their arrangements promise, the imposition of a design might not match this reality.

Leakage of intent

The leakage of influence or intent is at the heart of what we are looking for in root cause analysis. While some (statisticians, singularly focused on inference) want to reject the notion of cause altogether, sabotage remains the obvious case in which cause is undeniably attributed to intent, as long as you don't spend your life thinking in reverse. The counterpoint to `no such thing as causation' is summarized by this tweet:

While this statement is taken entirely out of context (with thanks to Yaneer for permission), and is perfectly true in any context, it is also a tautology. In a case where things are strongly coupled, things are hard to separate. There is too much causation, mixing and muddling. But it does not mean that there are not conditions or regimes in which things can be separated. Indeed, we could not give names to the parts if we could not separate them conceptually.

Leakage of memories, from parallel models into one another, is an unexpected source of causation in which real or artificial reasoning can help to find causal triggers (prerequisites). (Pinch yourself here, if you expect complete certainty of such a method, but continue reading when you are ready to accept and make the best of the uncertainties one inevitably faces.) Leakage between causal networks happens essential because of what we call `namespace conflicts' in IT. We muddle concepts that are superficially similar because they are not properly separated into different `autonomous' scopes. An example I use a lot is this: anomalous network traffic is measured by the network host of a Sushi restaurant called Jaws Sushi. This happens because there is a film festival in town featuring the movie Jaws, and people are accidentally clicking on the wrong link. The only connection between the restaurant and the film festival is a partial name, i.e. a misunderstanding. The two systems are only causally connected by sharing a common infrastructure. There is a leaky pipe. There is no kernel performance event that will be able to untangle this kind of semantic connection, but source semantics from a modern embedded logging system could do this, as proposed in Cellibrium.

In developer culture, this latter condition might be derisively dismissed as a `bug' that simply needs fixing and then sweeping under the rug, rather than documenting and understanding. We learn from mistakes in real life. There is no need to cover tracks or force people to do brute force searches through version control (that few can understand) to diagnose such mistakes. Compressed stories can capture these events. What's more they cannot realistically be avoided. All the wishful thinking in the world, all the isolation and containerization, cannot prevent humpty dumpty's yolk from getting into all the fragments. Systems are connected by definition. Indeed, the very fact that we are looking for root causes is a sign that modern IT abhors aggregation, and celebrates limitation of scope, as it tries to engineer determinism through isolation of causal channels. But learning and reasoning are two different things. Problem solving is about the connecting together parts, by many different criteria and spacetime pathways, so we need to get used to leakage and be ready to deal with it, not just shame it by name-calling.

My recent work on semantic spacetime, and the proof of concept Cellibrium toys, make the point that the simplest notions of space and time lie at the heart of systemic knowledge. Very few concepts are needed to make such a semantic system generic. The modularization fetish IT celebrates is dishonest if it ties our hands in diagnosing scaling to practical deployments for the satisfaction of developer aesthetics. If we want to find diagnostic causes in IT systems, we have to apply methods of discovery that can cross over the artificial boundaries between components.

As we separate IT systems more and more into virtual boxes, containers, and even functional wrappers, we pretend their independence, and can fail to see the bigger picture, with almost wilful negligence. Every time a new boundary layer is introduced, it alters the causal structure of the system, and the conceptual framework for describing the system. Monitoring that tries to ignore all that continues to be sold for the colour of its eye-candy, but is practically useless in the diagnosis of issues without the ingenuity of experts who are completely dedicated to understanding causation in the system. If we are going to apply AI and artificial reasoning to diagnosis, we need to understand how systems really work. There is no need to lose sight of a holistic view, if we only instrument systems properly and aggregate data, by giving systems a calibrator brain (not necessarily for control, but for understanding).

MB Oslo Tue Aug 29 14:15:24 CEST 2017