Can AI Agents replace experts?

We have a century of research that says that "context cannot bootstrap understanding"

January 25, 2026

Core Thesis - The promise of agentic AI rests on an unexamined assumption: that sufficient context yields understanding. A century of research in epistemology, cognitive science, and organisational theory converges on why this assumption fails. Knowledge requires conjecture, not just accumulation; recognition, not just retrieval; theory, not just data. The agent is the instrument, not the intelligence. The question was never "How do we give agents enough context?" The question was always "How do we develop the understanding that makes context useful?" I trace the implications of this reframing towards understanding and close with five testable predictions for how agentic ecosystems will evolve.

I. The Seduction

The enthusiasm is understandable. Claude Code ships, and suddenly developers are delegating entire coding workflows to an agent that navigates their codebase, proposes changes, and iterates based on feedback. Context windows expand to hundreds of thousands of tokens. Retrieval-augmented generation matures. Agent memory systems proliferate. The vision crystallises: give agents enough context (decision traces, organisational knowledge, workflow histories) and competence emerges.

The architecture takes various names. Context graphs. Knowledge graphs. Agent memory. The underlying thesis is consistent: capture organisational knowledge into a structured representation, let agents query it, and watch them become effective participants in organisational life. The flywheel version sounds particularly compelling: a self-reinforcing loop where agents learn from the organisation's accumulated decisions, make new decisions that get captured, and thereby enrich the very knowledge base that guides them. Enough rotations, and the agent bootstraps its way to competence.

The hyper-scale question lurking beneath: if we capture enough decision traces, can we replace employees with agents?

Here is the first rest stop on our journey:

The promise contains its own refutation. If understanding could be captured, it would already have been. Organisations have been capturing decisions for decades ~ in documents, databases, process maps, institutional memory. The understanding didn’t emerge. What makes us think adding an LLM to the retrieval layer changes the fundamental problem?

The answer, I will argue, is that we have misunderstood what kind of problem this is. We have mistaken an epistemological problem for an engineering problem. And decades of research spanning philosophy of science, cognitive psychology, organisational theory, and systems evolution all converge on why the knowledge-graph thesis cannot work as imagined.

II. The Hidden Epistemology

Every engineering approach rests on a theory of knowledge, usually unexamined. The knowledge-graph thesis rests on this one:

Data accumulation → Pattern extraction → Knowledge → Competent action

This is inductivism wearing an engineering costume. The assumption that enough observations, properly compressed, yield understanding. That knowledge is what you get when you successfully summarise experience.

The assumption appears everywhere in the current discourse:

Context graphs (with LLMs): Capture decision traces, extract patterns, agent learns organisational wisdom
Agent memory: Store experiences, retrieve relevant ones, act wisely in new situations
RAG pipelines: Embed knowledge, retrieve semantically similar chunks, generate correct responses

In each case, the arrow points from data to understanding. Accumulate enough of the right observations, organise them properly (or just throw them into an approximate retrieval engine ~ aka. LLM), and knowledge crystallises.

Our own empirical work reveals cracks in this assumption (Physics of AI; see References). Across three generative AI case studies, we found that stakeholder requirements consistently failed to align with evaluation metrics. Automated measures (e.g., similarity scores, precision, recall) served as proxies that never quite captured whether requirements were actually achieved. A high embedding similarity between two items doesn’t guarantee they are equivalent in meaning or usefulness. It only indicates the model thinks they’re close.

This is the inductivist gap in miniature. More data, better retrieval, higher scores on automated metrics; yet still no guarantee of understanding. The gap between correlation and comprehension persists regardless of how much context you provide.

Retrieval is not recognition. An agent can fetch every decision trace in the organisation’s history and still not know which ones matter for the situation at hand.

III. Conjecture Precedes Data

The philosopher Karl Popper spent his career dismantling inductivism. His central argument: you cannot derive explanations from observations. The inference from particular observations to general theories is logically invalid—the famous problem of induction. But more than that, the very act of observation is theory-laden. What counts as relevant data, what patterns are worth noticing, what constitutes a meaningful regularity—all of this depends on prior theoretical commitments.

Knowledge, for Popper, grows through conjecture and refutation, not accumulation and compression. We propose bold theories that go beyond our evidence, then subject them to criticism. The theories that survive are not proven true but merely not-yet-falsified. Understanding is always provisional, always revisable, and always dependent on theoretical frameworks that precede the data.

W. Edwards Deming, the quality theorist who transformed Japanese manufacturing, arrived at the same insight from the factory floor: “Without theory, experience teaches nothing”. You can watch a production line for years and learn nothing if you don’t have a framework for interpreting what you see. The data is not self-interpreting. Theory tells you what to look for, what counts as signal versus noise, what variations matter.

The context-is-all-we-need thesis inverts this relationship. It assumes you can compress your way to understanding. That sufficient data, properly organised, generates the theory needed to interpret it. But compression is selection, and selection requires criteria, and criteria require theory. The theory must come first.

The assumption runs backward. We imagine that capturing enough decisions will yield the wisdom to make them. But wisdom is what tells you which decisions are worth capturing. The graph is downstream of the judgment, not its source.

My own empirical research on software evolution revealed something that illuminates this point. Conventional wisdom holds that robust systems should depend on stable components. Build on proven foundations; avoid volatile dependencies. This principle appears everywhere in software engineering best practices.

It appears to be wrong.

Analysing 100s of open-source systems across hundreds of releases, I showed that the most depended-upon components, that is, those with the highest Fan-In were not the most stable but the most frequently modified. The very act of depending on a component increases the pressure for that component to change.

This is the stability paradox: viability requires coupling, but coupling creates instability.

This finding has been subsequently validated across multiple package ecosystems. The “stable foundations” that inductive reasoning assumes, the bedrock of established patterns and proven decisions, just do not exist in the form imagined. The system is not a static structure to be captured but a dynamic process that must be theorised.

IV. Knowledge as Causal Power

David Deutsch, the physicist and philosopher, extends Popper’s insights in a direction particularly relevant to our question. For Deutsch, knowledge is not merely information that correlates with reality. Knowledge is information with causal power ~ information that, when instantiated in a physical system, tends to cause itself to remain instantiated.

Consider the difference between a random string of DNA and a gene. Both are information. But the gene has causal power: it causes proteins to be produced, which cause cellular processes to unfold, which cause the organism to survive and reproduce, which causes the gene to be copied into the next generation. The gene persists because it does something in the world.

Explanatory knowledge has reach. It applies beyond the situations in which it was discovered because it captures something about the causal structure of reality, not merely surface correlations. Newton’s laws were discovered by observing falling apples and planetary orbits, but they reach to situations Newton never imagined because they capture genuine causal relationships.

This distinction illuminates two kinds of “learning”:

Inductive accumulation: More data → Better correlations → Higher benchmark scores
Explanatory growth: Conjecture → Criticism → Refined theory → Reach to new domains

Large language models excel at the first. They find statistical regularities in vast corpora and use those regularities to generate plausible continuations. This is genuinely useful. But it is not the second kind of learning. The current architecture of LLMs does not form theories about why the regularities hold, does not subject those theories to criticism, does not refine them through the pressure of counter-evidence.

The context-is-enough failure mode becomes clear through this lens. Decision traces capture what happened, not why it worked. An agent can retrieve “last time we had a revenue dip, we cut sales team” without understanding whether that dip was signal or noise, whether the current situation is structurally similar, or whether the original decision was satisficing under arbitrary constraints rather than optimal under careful analysis.

Correlation is not cause. An agent trained on organisational decisions learns what the organisation did, not what it should have done. Every dysfunction, every reaction to noise, every garbage-can collision gets encoded with the same fidelity as genuine insight.

Note: More on the garbage-can shortly ~ just hang in there.

V. The Observer’s Cage

Stephen Wolfram’s recent work on observers adds another dimension. Any observer is computationally bounded. The observer doesn’t passively receive reality; the observer’s computational structure determines what distinctions can be made, what patterns can be recognised, what features of the environment become salient.

There is no “view from nowhere”. Every observation is from a position, shaped by the observer’s perceptual and computational constraints. What counts as “signal” versus “noise” is not an objective property of the world but a function of the observer’s capacity for distinction.

An LLM-based agent is a particular kind of observer with a particular computational structure:

Next-token prediction as its fundamental operation
Attention over context as its perceptual field
Statistical regularities as its mode of “understanding”
No causal model, only correlational patterns
Reasoning within the confines of above (or its knowledge prior)

This structure determines what the agent can “see.” It can notice that certain words tend to follow other words. It can recognise that certain patterns of text tend to co-occur. It cannot see the causal structure that makes those patterns meaningful. No amount of context expansion changes the kind of observer it is.

Deming’s insight complements Wolfram’s: “The system cannot understand itself from inside”.

This is not merely an empirical observation but a structural limitation. An agent embedded in an organisation inherits the organisation’s blindspots. It sees what the organisation’s data practices make visible. It cannot see what the organization cannot articulate about itself.

Consider: if an organisation systematically fails to document why decisions were made, an agent querying that organisation’s records will systematically lack access to decision rationales. If an organisation’s culture discourages acknowledging failures, the decision traces will be sanitised. If the organisation’s metrics measure the wrong things, the agent’s optimisation will be misaligned in exactly the same ways.

An agent cannot see around corners its training data doesn’t illuminate. The blindspots don’t appear as gaps; rather, they appear as confident responses drawn from partial information. The most dangerous failure mode is not ignorance but inherited bias wearing the mask of comprehensive retrieval.

In my empirical work on software evolution, I found that systems maintain organisational identity not through component-level stability but through distributional homeostasis: the preservation of characteristic statistical profiles even as individual components churn. Consecutive releases are statistically similar (distributionally speaking) for systems undergoing normal evolution, dropping only during genuine architectural shifts.

The identity is in the relations, not the components. An observer that only sees components (spanning, data points, decision traces, individual records) misses the organisational pattern that makes them coherent. The pattern is not in any single observation but in the distribution across observations. And that distribution is precisely what gets lost when you extract and store individual decisions. Specifically, when the data capture was done without any deep theory.

VI. Recognition Is Not Retrieval

Gary Klein, the cognitive psychologist who pioneered Naturalistic Decision Making, studied how experts actually make decisions in the real world. Not in laboratories, but in burning buildings, emergency rooms, and military command centres.

His finding overturns classical decision theory. Experts don’t follow the textbook model: list options, weigh pros and cons, calculate expected utility, choose optimally. Instead, they engage in what Klein calls Recognition-Primed Decision Making: see the situation, match it to something encountered before, mentally simulate the obvious response, execute if it feels right, adjust if it doesn’t. No comparison of options. No weighing of pros and cons. Experts recognise what kind of situation this is and act on the typical response for that type, automatically, drawing on thousands of hours of feedback-rich experience.

Recent research in emergency medicine confirms and extends this finding. Studies of physicians’ clinical reasoning show that expert diagnosticians differ from novices not by superior reasoning ability but by a deep base of experiential, case-based knowledge (this remains superior even accounting for fluid intelligence that may actually decline with age). The educational challenge, as Pelaccia and colleagues note, is “to optimize the functioning of System 1 by ensuring that practitioners acquire relevant knowledge in a way that can be retrieved as needed”.

But here is the critical gap: researchers have found that while physicians can gain knowledge of cognitive biases through training, there is much less evidence that they can recognise biases in practice. Recognising and mitigating biases remains a challenge precisely because they occur during decision-making at the subconscious level. You can teach someone about pattern-matching failures. You cannot, through information transfer alone, change how they pattern-match.

The problem compounds under cognitive load. Using granular electronic medical record and audit-log data, researchers have shown that under higher cognitive load, physicians substitute mental deliberation with more numerous but less precise diagnostic actions. They order more tests, but less targeted ones. Uncertainty in diagnostic beliefs increases. A physician in the highest cognitive load decile increases hospital admissions by 28% relative to the same physician in the lowest decile. The expert’s pattern recognition degrades predictably when the system overwhelms their cognitive capacity.

Context graphs give you retrieval. Retrieval is not recognition.

The expert doesn’t search a database for similar cases. The expert immediately perceives which features of the situation matter, often features that would seem irrelevant to a novice. A junior analyst can retrieve “last time revenue dipped, we cut HR spend.” But only an expert recognises that this dip has different characteristics (e.g., seasonal vs structural) and that cutting spend in the wrong category would be catastrophic.

The hard part isn’t finding similar cases. The hard part is knowing what “similar” means for the problem at hand. That knowledge doesn’t transfer via database. It transfers via feedback loops that reorganise perception.

When Klein asked firefighters how they knew to evacuate a building moments before a floor collapsed, they said “it felt wrong”. They couldn’t articulate the 47 micro-signals they had processed unconsciously: the slight temperature differential, the unusual sound quality, the subtle visual cues. The decision traces capture what was said, not what was known. The reasoning that gets logged is the post-hoc rationalisation, not the actual pattern recognition that happened in the moment.

What Cannot Be Captured

Our own research on advance care planning in aged care settings reveals this dynamic in a domain far from burning buildings, yet structurally identical. Effective ACP relies on both explicit knowledge (clinical guidelines, procedural requirements) and tacit knowledge (professional judgment, interpersonal skills, cultural sensitivity). Staff described ACP as a dynamic, iterative process requiring intuitive judgment to adapt to evolving patient needs, cultural contexts, and family dynamics. They emphasised gauging health literacy and willingness to engage in existential conversations, often using humor, reframing, or personal anecdotes to foster openness and build trust.

These micro-skills of knowing when to use humor, sensing reluctance, reading family dynamics constitute precisely the tacit knowledge that doesn’t transfer via documentation. You cannot write a decision trace that captures “I sensed the daughter wasn’t ready to hear this, so I shifted the conversation”. The perception that triggered the shift, the judgment about timing, the selection of an alternative approach: all of this happens below the threshold of articulation.

Critically, our ACP findings showed that tacit knowledge, while essential, is insufficient without organisational support: training, reflective practice, and system-level changes to address time pressures and competing demands. The individual’s expertise operates within a system context. An agent that retrieves decision traces inherits neither the tacit knowledge nor the organisational context that made those decisions sensible.

If this tacit knowledge cannot be captured, can it be transferred at all?

What Can Be Transferred

A separate study demonstrates both the stakes of this question and its answer. We developed an embodied conversational agent, a virtual patient named Ted simulating dementia, to train aged care staff in communication skills. Ted responds to what caregivers say: use positive communication principles, and he cooperates; fail to slow down, listen, or acknowledge his personhood, and he becomes frustrated and withdrawn. Caregivers must navigate the conversation in real time, experiencing the consequences of their communication choices.

We compared this simulation-based training to traditional online training modules covering the same communication principles. The results were striking. Eight weeks after training, all participants in the simulation cohort still remembered the core principles and continued implementing them in practice. In the online training cohort, 80% had forgotten the material entirely. More telling: 11 of 12 simulation participants reported changing their actual care practice, versus only 2 of 11 who completed online training. Participants in the online cohort described the material as comprehensive but “boring” and complained that the examples didn’t depict reality. The simulation cohort described feeling challenged, engaged, and emotionally connected to Ted.

The learnings that stuck were not abstract principles but embodied skills: slowing down communication, listening to understand needs, not imposing choices, building relationships through personal engagement. These are precisely the tacit knowledge skills that the ACP research identified as essential yet resistant to documentation. The same domain, the same skills, radically different outcomes depending on method.

The principle generalises. Lia DiBello, a cognitive scientist who has spent decades studying expertise acquisition, shows that all great business experts share the same mental model structure: an integrated understanding of financial, operational, and market dynamics. The difference between expert and novice isn’t what they know but how it’s organised.

Give a novice every decision trace, every case study, every report, every piece of organisational data. They remain a novice. Why? Because expertise isn’t about having information. It’s about having the right cognitive structure to make sense of information. The sense-making architecture matters, and it has to be systematically developed.

DiBello developed methods to accelerate expertise acquisition, initially for the military and subsequently across industries including mining, pharmaceuticals, financial services, and manufacturing. Studies of over 7,000 people at all levels exposed to her methods indicate that learning normally shown to occur over five years can occur in two to four weeks, regardless of culture. Her approach: strategic rehearsals. Put the learner in a realistic scenario. Ask them to predict what happens next. Give immediate feedback. Show them the reality. When their prediction fails, surface what they missed. Repeat across varied scenarios.

This works because it forces predictions and shows consequences. The brain reorganises how it structures knowledge under the pressure of being wrong. The learner develops the expert’s perceptual sensitivity: the ability to see which features matter.

The aviation industry discovered this decades ago. Simulation-based training, grounded in methodologies that transformed aviation from “risky” in the late 1950s to “safer” within several years, now exceeds traditional didactic and apprenticeship models in speed of learning, information retained, and capability for deliberate practice. Healthcare has adopted these methods for anesthesiology, emergency medicine, surgery, and critical care. The principle is consistent: without timely and appropriate feedback, trainees cannot learn from mistakes and successes. Simulation provides the prediction-feedback loop that reorganises cognition.

The FAA’s scenario-based training explicitly connects to situated cognition: knowledge cannot be known and fully understood independent of its context. The goal is to train pilots through meaningful repetition to gather information and make informed, timely decisions. This is not information transfer. It is cognitive reorganisation through experience.

What makes these methods work is not openness but constraint. Ted doesn’t allow learners to wander through arbitrary conversations; the scenario structure determines which moments matter and delivers feedback at precisely those junctures. DiBello’s strategic rehearsals don’t present all possible business situations; they select scenarios that expose specific gaps between novice and expert perception. Flight simulators don’t recreate every possible flight; they compress critical decision points into structured sequences. The learning is effective because the scenario does the epistemic work of determining relevance that learners cannot yet do for themselves.

LLM fine-tuning doesn’t do this. Fine-tuning optimises for output similarity: make the model’s responses look more like the expert’s responses. But this produces mimicry, not understanding. The model learns to generate outputs that pattern-match to expert outputs without learning why those outputs were right in those situations.

You can train a system to sound like an expert without training it to see like an expert. The outputs match; the understanding doesn’t. And in novel situations, where expertise is most needed, mimicry fails.

But even if we could capture tacit knowledge, we would face a deeper problem: the decisions themselves may not represent the organisational intelligence we imagine.

II. The Garbage Can Reality

The context-is-all-we-need thesis assumes that organisational decisions have a rational structure worth capturing. Herbert Simon and James March, two of the most influential organisational theorists of the twentieth century, spent their careers demonstrating that this assumption is largely false.

Simon won a Nobel Prize for his work on bounded rationality. Classical economics assumes humans are rational maximisers who know all alternatives, compute all consequences, and choose optimally. Simon demonstrated that humans can do none of this. We face limited information, limited computation (echoing Wolframs’ observer theory points), limited time, and conflicting goals.

His solution: instead of optimising, humans satisfice. We search until we find something “good enough” and then stop. The threshold for “good enough” is not objective but contextual ~ depending on who’s deciding, what pressures they face, what alternatives happen to be visible, how tired they are.

Every decision in the context graph was “good enough” for that person at that moment. The threshold was arbitrary. You were never capturing organisational intelligence. You were capturing what satisfied Bob’s threshold on Friday.

March extended this analysis with his Garbage Can Model of organisational decision-making. Studying universities, he recognised them as 'organised anarchies': goals are vague and discovered through action rather than guiding it, nobody really understands how things work or what causes what, and who shows up to any given decision changes constantly

In the Garbage Can Model, organisations are containers where four independent streams float around: problems looking for decisions to attach to, solutions looking for problems to solve, participants looking for work to do, and choice opportunities (meetings, deadlines) when decisions “have to be made.”

The brutal conclusion: these streams operate independently and collide by temporal coincidence, not logic. As March and colleagues put it, a choice opportunity is best viewed as “a garbage can into which various kinds of problems and solutions are dumped by participants as they are generated. The mix of garbage in a single can depends on the mix of cans available, on the labels attached to the alternative cans, on what garbage is currently being produced, and on the speed with which garbage is collected and removed from the scene.”

Agent-based reconstructions of the garbage can model, using modern simulation techniques unavailable in 1972, have confirmed its core findings: most decisions are made without solving any problem, the few problems that are solved generally pertain to lower hierarchical levels, and consequently members of an organisation encounter the same old problems again and again.

The decision trace is not a record of reasoning. It’s a post-hoc story the organisation tells itself to make the garbage-can collision look rational. You’re not capturing organisational intelligence. You’re capturing organisational fan-fiction.

This breaks the “temporal validity” idea that context-is-enough enthusiasts worry about. We can’t reliably track when facts were true because the “facts” about organisational decision-making were never coherent. You can’t build an event clock for a sequence that never existed anywhere outside retrospective rationalisation.

Donald Wheeler, Deming’s student and associate for over twenty years, adds another layer. All data contains noise. Only some data contains signal. Wheeler distinguished common-cause variation (random fluctuation inherent to the system) from special-cause variation (something actually changed).

Wheeler identified two fundamental mistakes that organisations make:

Mistake 1: Treating an outcome as if it came from a special cause, when actually it came from common-cause variation.

Mistake 2: Treating an outcome as if it came from common causes, when actually it came from a special cause.

Both mistakes are costly. But here is the critical insight: by misinterpreting either type of cause as the other, and acting accordingly, organisations not only fail to improve matters—they literally make things worse. Reacting to noise introduces more variation. Ignoring signal allows problems to compound.

Most organisational decisions are reactions to common-cause variation. Revenue drops 2%, panic ensues. Revenue rises 2%, the leadership are causal heros’ to be celebrated. But if both are often within normal variation, everyone is just reacting to noise. And by reacting to noise, they introduce more variation into the system.

Without Wheeler’s filter, your context graph compounds the organisation’s flinches. You capture garbage decisions with perfect fidelity, index them for semantic retrieval, and serve them up to agents as organisational wisdom. If we react to all the noise in the data, overreacting and asking for explanations for every up and down, we waste time, frustrate everyone involved, and inject additional instability into the system.

The knowledge-graph doesn’t distinguish signal from noise. It captures both with equal fidelity. And an agent querying that graph will reproduce the organisation’s pattern of flinching at noise, now at scale, with confidence, and without the human judgment that might occasionally catch the error. Throw in a few hallucinations and you now have even more noise to contend with.

VIII. What Complementarity Requires

A Nobel laureate, and a century of research converge on a single conclusion: the thing the context-is-all-we-need thesis is trying to capture doesn’t exist in the form imagined.

Not because organisations lack intelligence. They often have it, often distributed across experienced people, embedded in effective processes, instantiated in well-designed systems. But that intelligence is not extractable into a graph. It lives in:

Wheeler’s filter: The capacity to distinguish signal from noise
Deming’s outside view: The ability to see the system from a position it cannot see itself
Klein’s recognition: Knowing which features of a situation matter before analysis begins
DiBello’s structure: Cognitive organisation that makes information usable
Popperian conjecture: The capacity to generate theories that reach beyond available data
Observer position: A different computational structure that makes different distinctions

None of these transfer via database. None of these are “more context” or “better retrieval”, or “improved capability of the substrate of intelligence”.

This is not a pessimistic conclusion. It is a clarifying one. It tells us what human-AI complementarity actually requires.

The human contribution to agentic systems is not labelling data, reviewing outputs, or providing feedback on responses. These are valuable but secondary. The primary human contribution is the epistemic work that makes the agent’s context meaningful: providing the theory that determines what to capture, the recognition that determines what’s relevant, the judgment that distinguishes signal from noise.

The agent is the instrument, not the intelligence. Like a telescope, it extends reach. Like a telescope, it doesn’t tell you where to point or what you’re looking at. The astronomer provides that. The domain expert provides that. The organisational understanding provides that.

Palantir seems to understand this; they have a fleet of Forward Deployed Engineers. Specialists who embed at client organisations for months, untangling the messy human work of building ontologies, integrating data sources, uncovering constraints. The engineers do the Deming/Wheeler/Klein/DiBello work manually. Then they document it in the platform. This has been a core observation for me back when I built by very first AI embedded in a manufacturing plant (all the way back in 1995) ~ I was doing much of what is outlined here .. just did not have names of terms for the work; it was just work that had to be done & I (thankfully) did not have the LLMs of the day with associated screams from social media to confuse me about what is possible and what is not.

The platform isn’t the product. The humans are. The “context-for-LLM-calls” is just the receipt.

Selling AI infrastructure without an army of appropriately trained forward-deployed engineers is selling the map as if it were the territory ~ yes, all it guarantees is confusion and anger soon enough when it does not deliver on the promise with perfect fidelity (esp. since in most cases the sales and marketing teams probably do not know what exactly they are selling; not have they experienced yet the customers screaming at them).

IX. Implications for Agentic Design

If this analysis is correct, what follows for the design of agentic systems?

First, evaluation must shift from accuracy to coupling quality. Traditional AI evaluation focuses on prediction accuracy and benchmark performance. But a system that achieves high benchmarks while failing to couple effectively with human workflows is poorly adapted regardless of its technical sophistication.

Our Physics of AI framework identifies “human feedback is substrate” as a structural constraint: AI systems do not define their own objectives or correctness. Meaning originates from people. Evaluation must therefore assess workflow integration, mental model alignment, feedback loop closure, adaptation responsiveness, and error recovery; not just retrieval precision.

Second, design for requisite variety, not peak performance. Ashby’s Law of Requisite Variety states that a controller must have at least as much variety as the system it attempts to control. Agentic capability should be managed as a portfolio that collectively matches the variety of situations the system will encounter.

This means cultivating diverse agent capabilities rather than optimising a single capability, building graceful degradation paths when specific capabilities are unavailable, and monitoring for coverage gaps. The goal is not maximum performance on any single task but adequate performance across the space of tasks that matter.

Third, accept the stability paradox and design accordingly. Our empirical findings show that the most depended-upon components are the most likely to change. Components that achieve significant adoption will face continuous pressure for modification. Design for dynamic stability: the capacity to undergo change while maintaining organisational identity and coupling quality.

This suggests investing in robust versioning, extension mechanisms that accommodate change without breaking existing integrations, and monitoring systems that track distributional profiles rather than just component-level metrics.

Fourth, invest in human judgment infrastructure, not just data infrastructure. The bottleneck for effective agentic systems is not data or capability but the human capacity to provide theory, recognition, and the outside view. Organisations need:

Mechanisms to surface expert knowledge that can guide agent deployment
Training approaches (like DiBello’s strategic rehearsals) that accelerate expertise development
Organisational practices that distinguish signal from noise before it enters the data pipeline
Feedback loops that connect agent actions to consequences that humans can evaluate

Fifth, anticipate the bootstrapping paradox. Organisatsons that could benefit most from agentic AI often lack the discipline to implement it effectively. Those with sufficient discipline may not need it as badly. This suggests that agentic AI adoption will follow existing organisational capability rather than creating it.

The question is not “How do we build context graphs?” but “How do we build organisational discipline that can leverage context graphs?” The latter is a human problem requiring human solutions.

X. Testable Predictions

A framework’s value lies partly in its predictive capacity. Based on the analysis above, I am offering predictions that can be validated or falsified as agent ecosystems mature:

Prediction 1: Verification infrastructure will be destabilised by adoption. Components that become widely used for agent verification—evaluation harnesses, guardrails, validation systems—will face continuous pressure for modification. The stability paradox applies: success creates instability.

Prediction 2: Capability profiles will stabilise early. The distribution of agent capabilities will converge relatively quickly, even as individual capabilities continue to evolve. This mirrors our finding that software systems maintain distributional homeostasis while individual components churn.

Prediction 3: Growth will occur through agent proliferation, not agent expansion. As agent ecosystems mature, growth will come primarily from new specialised agents rather than from expanding the capabilities of existing agents. This follows from the principle that systems grow by addition rather than modification. (Mirrored in findings from software evolution research that shows exponential growth happens when architectures allow for plug-ins to cover for the functional diversity needed)

Prediction 4: Popular protocols will become sites of continuous change. Protocols and interfaces that achieve widespread adoption (like the Model Context Protocol) will require constant evolution. The most depended-upon specifications will be the least stable.

Prediction 5: Organisations with strong prior discipline will capture disproportionate value from agentic systems. The bootstrapping paradox implies uneven distribution of benefits. Organisations that already have Wheeler’s filter, Klein’s recognition, and DiBello’s structure will be able to leverage agents effectively. Those lacking these capabilities will struggle regardless of technical investment.

These predictions are testable in principle, though validating them requires longitudinal data on agent ecosystems that is only now beginning to accumulate.

XI. Closing Arc: The Conjecture Gap

We began with the seduction: give agents enough context, and competence emerges. We traced this assumption to its roots in inductivist epistemology. The belief that knowledge crystallises from sufficient observation. We showed why this belief fails: because conjecture precedes observation, because knowledge is causal power rather than correlation, because observers are bounded by their computational structure, because recognition differs fundamentally from retrieval, because organisational decisions are garbage-can collisions rationalised after the fact.

The conjecture gap names what cannot be bridged by context alone: the space between data and the theory that makes data meaningful, between observations and the explanations that give them reach, between retrieving similar cases and recognising what similarity means.

This gap is not a temporary limitation to be engineered away. It is a structural feature of how knowledge works. Popper, Deutsch, Wolfram, Deming, Klein, DiBello, Simon, March ~ theorists spanning philosophy of science, cognitive psychology, organizational theory, and systems evolution all converge on the same insight. Understanding requires conjecture, position, recognition, structure. These do not emerge from data. They are what make data interpretable.

The implications for agentic AI are not pessimistic but clarifying. Agents are powerful instruments. Like all instruments, their value depends on the skill of their users. The telescope did not replace the astronomer. The microscope did not replace the biologist. The agent will not replace the expert who knows what to look for and why it matters.

Context is abundant. Theory is scarce. The bottleneck for agentic systems is not what agents can access but what humans can understand. The organisation that cannot articulate its own functioning will not be saved by an agent that can query its records. The dysfunction merely scales.

A century of research, multiple case studies, and decades of software evolution data point the same direction. The path forward is not more context but better theory. Not larger graphs but sharper judgment. Not comprehensive capture but disciplined selection.

The question was never “How do we give agents enough context?”

The question was always “How do we develop the understanding that makes context useful?”

And that is a human question, requiring human answers, no matter how capable our instruments become.

The empirical foundations of this essay draw on research published in “The Physics of AI” (CAIN 2026) and “Viability Through Coupling: Evolutionary Dynamics of Agentic Systems.” There were elements of framing that were inspired from work by @parcadei

References

Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman & Hall.

Cohen, M. D., March, J. G., & Olsen, J. P. (1972). A garbage can model of organizational choice. Administrative Science Quarterly, 17(1), 1–25.

Curumsing, M. K., Rivera Villicana, J., Vouliotis, A., Burns, K., Babaei, M., Petrovich, T., Mouzakis, K., & Vasa, R. (2025). Talk with Ted: An embodied conversational agent for caregivers. Gerontology & Geriatrics Education, 46(2), 186–203. https://doi.org/10.1080/02701960.2024.2302584

Deming, W. E. (1986). Out of the Crisis. MIT Press.

Deutsch, D. (2011). The Beginning of Infinity: Explanations That Transform the World. Viking.

DiBello, L. (2023). How experience builds business expertise and what AI can’t replace. Cognitive Agility Lab Research Reports.

DiBello, L., Missildine, W., & Struttmann, M. (2009). Intuitive expertise and empowerment: The long-term impact of simulation training on changing accountabilities in a biotech firm. Mind, Culture, and Activity, 16(1), 11–31.

Felin, T., Kauffman, S., Koppl, R., & Longo, G. (2014). Economic opportunity and evolution: Beyond landscapes and bounded rationality. Strategic Entrepreneurship Journal, 8(4), 269–282.

Ippolito, A., Khawand, C., Chen, J. H., & Abaluck, J. (2023). Physician Decision Making Under Uncertainty. Working Paper, University of California Berkeley / UCSF.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

Kingdon, J. W. (1984). Agendas, Alternatives, and Public Policies. Little, Brown.

Klein, G. (1998). Sources of Power: How People Make Decisions. MIT Press.

Klein, G. (2009). Streetlights and Shadows: Searching for the Keys to Adaptive Decision Making. MIT Press.

Lomi, A., & Harrison, J. R. (2012). The Garbage Can Model of Organizational Choice: Looking Forward at Forty. Emerald Group Publishing.

March, J. G. (1991). Exploration and exploitation in organizational learning. Organization Science, 2(1), 71–87.

March, J. G., & Simon, H. A. (1958). Organizations. Wiley.

Padgett, J. F. (1980). Managing garbage can hierarchies. Administrative Science Quarterly, 25(4), 583–604.

Pelaccia, T., Tardif, J., Triby, E., & Charlin, B. (2024). How emergency physicians make decisions: A review of the decision-making literature. Academic Emergency Medicine, 31(1), 82–94.

Popper, K. (1959). The Logic of Scientific Discovery. Hutchinson.

Popper, K. (1963). Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge.

Sellars, M., Hutchinson, A., & Vasa, R. (2025). The role of tacit knowledge in advance care planning: Aged care staff perspectives. BMJ Supportive & Palliative Care, 15(Suppl 1), A20. https://doi.org/10.1136/spcare-2025-ACP.46

Silberzahn, K., Anderson, J., Arditi, E., Gautier, W., & Kranz, M. (2023). Simulation-based medical education: A review of its historical development and evidence base. Medical Teacher, 45(8), 831–841.

Simon, H. A. (1947). Administrative Behavior: A Study of Decision-Making Processes in Administrative Organizations. Macmillan.

Simon, H. A. (1955). A behavioral model of rational choice. Quarterly Journal of Economics, 69(1), 99–118.

Simon, H. A. (1969). The Sciences of the Artificial. MIT Press.

Takács, K., Squazzoni, F., Bravo, G., & Castellani, B. (2008). The garbage can model of organizational choice: An agent-based reconstruction. Sociological Methods & Research, 37(2), 179–210.

Vasa, R., et al. (2026). The Physics of AI: Structural constraints on generative AI in software engineering. In Proceedings of the International Conference on AI Engineering (CAIN 2026). IEEE/ACM.

Wheeler, D. J. (1993). Understanding Variation: The Key to Managing Chaos. SPC Press.

Wheeler, D. J. (2000). Understanding Statistical Process Control (3rd ed.). SPC Press.

Wheeler, D. J., & Chambers, D. S. (1992). Understanding Statistical Process Control. SPC Press.