Abstract. Autonomous agent design remains organised around the task: define a goal, optimise a metric, score the result. This essay argues that the deeper challenge is specifying the agent itself as a persistent, bounded, adaptive system. It develops several principles that hold across scales and domains: that an agent is formally defined by its compression of history into actionable state; that adaptation at every timescale, evolutionary, developmental, and algorithmic, is structural alignment with environmental statistics; that collapsing plural objectives into a single scalar produces a characteristic pathology (dimensional reduction) at epistemological, axiomatic, and practical levels simultaneously; and that evaluator design is governance design. The essay closes with an engineering and research agenda for building agents whose memory, objectives, boundaries, and evaluation infrastructure are treated as first-class architecture rather than afterthoughts.


Thesis

The next generation of autonomous agents will not be held back chiefly by a lack of larger models, better prompts, or more clever search. It will be held back by a way of thinking that the field knows is incomplete but continues to rely on: even among practitioners building the most advanced systems, the dominant design logic still treats agency as a problem of optimizing task performance, when the deeper problem is engineering persistent, bounded, adaptive systems.

The distinction is not a dichotomy; one can, and should, optimize task performance by means of engineering persistent systems. The claim is about ordering: which concept is treated as foundational, and which are derived from it. Once the task becomes primary, the agent itself fades into the background. We specify the environment, define a metric, tune the loop, and hope intelligence will emerge from optimisation. Once the agent becomes primary, different questions come forward: what state does it carry across time? How does it compress experience into representation? How does it adapt at multiple timescales? How does it extend its future control? How are its objectives actually specified? Where exactly do its boundaries lie?

Serious autonomous agents will not be built by perfecting reward-driven task optimisation alone. They will be built by treating the agent as a first-class engineering artifact, one whose goals are plural, whose memory is representational rather than archival, whose competence depends on continual adaptation, and whose objectives cannot be faithfully reduced to a single scalar. Reward remains useful, sometimes very useful, but it is not a complete theory of agency. The central error is not that reward is wrong; it is that reward is asked to carry more meaning than it can bear.

The essay proceeds as follows. §1 surveys what the field has built and where the gaps remain. §2 proposes a formal grounding for what "the agent" actually is, centred on compression of history into actionable state. §3 and §4 develop two consequences of that grounding: that memory must be dynamic representation rather than archival storage, and that adaptation must operate across multiple timescales simultaneously. §5 reframes intelligence as the extension of an agent's horizon of effective control. §§6-8 turn to objectives, arguing that scalar reward is structurally inadequate, that richer objective architectures are needed, and that evaluator stacks are the practical locus where values enter the system. §9 addresses the question of where the agent's boundaries lie and who draws them. §10 confronts the evaluation gap honestly, and the essay closes with an engineering and research agenda.


1. Where We Stand: Progress and Its Limits

Before diagnosing what gaps remains, it is worth crediting what has been built.

The field has moved substantially since the first wave of prompt-chained "agents" in 2023. Frameworks like LangGraph (LangChain, 2024) provide stateful execution graphs with durable memory, explicit human-in-the-loop checkpoints, and multi-step orchestration that goes well beyond naïve chain-of-thought. Constitutional AI (Bai et al., 2022), as implemented in systems like Claude, demonstrates that layered value hierarchies (hard constraints, soft preferences, escalation triggers) can be built into a model's operating logic via training rather than bolted on as post-hoc filters. And projects like OpenClaw (OpenClaw, 2025) show what happens when you push persistence seriously: a daemon that runs continuously, maintains structured memory files, performs proactive background tasks via heartbeats and cron, generates its own skills, and adapts to its user over weeks and months. Users describe it as feeling less like a tool and more like a collaborator that matures over time.

These are genuine advances. They demonstrate that the intuitions behind agent-as-persistent-system are widely shared and partially realised. This essay does not argue that the field has failed to notice the problem. It argues that the field has not yet followed its own intuitions to their engineering conclusions.

A note on scope: the examples above, and the argument that follows, are primarily about the LLM-agent community, the world of language-model-based autonomous systems and the frameworks built around them. Robotics, game-playing agents, autonomous vehicles, and industrial control systems have their own traditions of persistent state, formal specification, and multi-timescale adaptation, some of which are considerably more mature. The claims here about "current practice" should be read with that domain restriction in mind, though the conceptual arguments about agent specification, memory, and objectives apply broadly.

The gap shows up in three characteristic ways.

First, even the most sophisticated frameworks still evaluate agents primarily on aggregate task success. The benchmarks the field uses to track progress tell the story clearly: SWE-bench (Jimenez et al., 2024) measures whether a coding agent can resolve a GitHub issue (pass/fail on test suites). GAIA (Mialon et al., 2023) measures whether an assistant can answer multi-step questions correctly. TAU-bench (Yao et al., 2024) measures whether a support agent completes a customer interaction successfully. These are all valuable, but they are all task-shaped: binary or scalar measures of episode-level success. None measures recovery under distribution shift, option preservation, constraint respect under pressure, or legibility to human collaborators over time, the properties that distinguish a persistent system from a well-tuned workflow. We build for persistence but measure for episodes.

Second, features that gesture toward persistence (memory files, self-generated skills, background loops) are often implemented as engineering conveniences rather than governed as critical architecture. To take one example: systems like OpenClaw maintain persistent memory files that summarise interactions across sessions. Whether this achieves genuine compressive representation or remains closer to log rotation is an empirical question that depends on the specific implementation, but in the author's survey of current frameworks, the dominant pattern favors the latter. A skill-generation system that adds capabilities but lacks principled mechanisms for retiring, versioning, or arbitrating conflicts among those capabilities is accumulation, not maturation.

Third, and most subtly, the core loop of most agent systems is still event-triggered: something arrives (a user message, a cron tick, an API callback), the agent responds, the loop sleeps. Genuine persistence implies something stronger: an agent that maintains and updates its world model between triggers, that notices drift in its own competence, that initiates action not because it was prompted but because its representation of the situation warrants it. We are closer to this than we were, but the distance remaining is architectural, not incremental.

The objection that "this is already being done" is therefore half right. The features exist. The discipline around them, the treatment of these features as the primary engineering objects rather than as add-ons to a task-completion pipeline, largely does not.

It is worth noting that we cannot currently point to a system that unambiguously exemplifies agent-centered design. This absence points to the problem as a diagnostic. Without agreed-upon formal criteria for what "centering the agent" means, we cannot distinguish systems that partially achieve it (MetaClaw's persistent self-improving architecture (Xia et al., 2025), JEPA's world-model-first approach to representation (LeCun, 2022), Schmidhuber's RSI programme (Schmidhuber, 1987; 2003)) from systems that merely accumulate agent-like features without architectural coherence. Several efforts gesture in the right direction, but we lack the vocabulary to say precisely how far they have come. The definitions proposed in the next section are not academic exercises. They are what would allow the field to measure progress toward agent-centered design rather than simply describe aspirations about it.


2. Specifying the Agent

There is a notable asymmetry at the foundations of the field. Reinforcement learning has canonical, extensively formalised models of environments (Markov decision processes, partially observable Markov decision processes, stochastic games) complete with axioms, solution concepts, and decades of theoretical development. It has no equally canonical model of the agent.

The standard response is that RL does specify the agent: the policy π, the value function V, and the update rule (PPO, SAC, or whatever algorithm is in play). But notice what that specification includes and what it omits. It captures the agent's behavioral interface, how it maps states to actions and how it improves that mapping over time. It does not capture the agent's internal organisation: what it remembers and how that memory is structured, what resources it may consume, where its boundaries lie, how it handles interruption and resumption, what it is permitted to modify in the world, how it detects its own declining competence, or what counts as recovery after failure. For a training algorithm that will be discarded once a policy is extracted, this level of specification suffices. For an autonomous system that persists, accumulates commitments, and operates in open-ended environments, it does not. The agent needs a specification of its own, as explicit and as carefully reasoned as the environment models we already have.

This gap in the formalism (specification) has a precise remedy. An agent operating in a stochastic environment can be characterised formally by its capacity to compress an arbitrarily long interaction history into a bounded, finite internal state from which it selects actions. The agent, in this view, is its compression function: the mapping from unbounded history to finite representation. This builds on the insight behind predictive state representations (Littman, Sutton, and Singh, 2001), which showed that an agent's state can be fully characterised as a sufficient statistic over action-observation histories for predicting future observations, and thus for deriving a policy. A Markov decision process (MDP) specifies the environment's transition and reward structure. Where the MDP focuses on the environment's structure, the agent is specified by the structure of its own memory: what it holds onto, what it lets go, and how the retained state informs action. The quality of that compression is the quality of the agent. The poorer and more ad hoc, the more the agent is at the mercy of whatever happens to remain in its context window. Specifying the agent, then, is specifying the architecture of lossy, task-relevant compression. Once this is understood, the questions that opened the essay (what state does the agent carry? how does it compress experience into representation?) cease to be rhetorical and become engineering requirements with formal content.

Much present-day agent discourse remains oddly hand-wavy at exactly this point. Existing surveys decompose agents into modular inventories, such as profile, memory, planning, and action components (Wang et al., 2024), or more cognitively framed but still taxonomic structures such as brain, perception, and action (Xi et al., 2025). These frameworks describe what an agent is composed of, but not how those components cohere, persist, or govern one another. We talk about orchestration, tool use, planning, memory, evaluation, and retries, but the actual agent often remains undefined. Is it just the model weights? The prompt plus tools? The whole loop including retrieval, planner, and evaluator? A software engineer should find this unsatisfying. It is as though we wrote the API contract, the error codes, and the retry logic, but never defined the service itself: its state, its lifecycle, its invariants. Yet that is close to how many "autonomous agents" are discussed today.

A real agent is not a function from prompt to output. It is a process that persists through time. It has internal state, update rules, action-selection procedures, capability boundaries, memory budgets, failure modes, and interfaces to the world. It should be possible to say what it knows, how that knowledge changes, when it should defer, what resources it may spend, what it is allowed to modify, and what counts as recovery after error. Without that kind of explicit specification, what we call an autonomous agent is really just a workflow with optimism attached.

One natural counter-argument is that explicit specification is unnecessary, that sufficiently capable foundation models combined with lightweight scaffolding will produce these properties emergently. Give a powerful enough model a tool-use loop and a scratchpad, and agent-like persistence will just happen. This is the hope behind many current architectures, and it is not entirely wrong: large models do exhibit surprising generalisation. But emergence is not a substitute for engineering. Not all emergent properties are unreliable; thermodynamic equilibria and fluid dynamics are emergent and highly stable. But the specific properties at issue here (agent-level persistence, representational memory, multi-timescale adaptation, principled boundary-setting) are not like thermodynamic equilibria. They are complex, context-sensitive, and failure-prone in ways that vary with deployment conditions. The distinction comes down to testability. An explicitly specified property can be tested independently, deliberately strengthened (when it fails), and monitored for degradation (under distribution shift). An emergent property, by definition, cannot be independently targeted by any of these interventions. A property that appears unreliably, that degrades in ways you cannot predict, and that you cannot deliberately strengthen, is not a feature of your system. It is a coincidence you have learned to depend on. Engineering means making the crucial properties explicit, measurable, and maintainable. That is what "specifying the agent" requires.


3. Memory as Dynamic Representation

If the agent is characterised by its compression function (§2), then memory is where that compression lives. Once the agent is treated as a real system, memory looks different. In many current architectures, memory is still confused with storage volume. We act as though autonomy will improve if only the agent can retain more conversation history, read more documents, or stuff more tokens into context. But raw history is not the same as useful memory. An append-only transcript is an audit log, not a working model of the world. The crucial problem is compression: how long histories become compact, policy-relevant representations that support inference and action.

Theories, causal graphs, world models: these are all mechanisms for exactly this compression. They are what allow an agent to carry the implications of a thousand observations in a structure small enough to reason over. A coding agent does not become genuinely more capable by carrying the entire repository chat into every call. It becomes more capable when it can maintain an architecture map, a hypothesis about where the bug lies, a model of dependency risks, a record of failed attempts, and a distinction between known facts and open uncertainties. A research agent improves not by hoarding text but by maintaining a claims graph, an evidence ledger, a map of unresolved questions, and a sense of which assumptions are critical. In software terms: raw logs are not enough. What matters is the state derived from them and a theory of the underlying system.

But compression, while necessary, does not capture the full character of the memory an autonomous agent needs. The deeper point is that memory is not a passive store; it is dynamic. It actively biases future action. Every update to the agent's internal representation changes not just what it recalls but what it will attend to, what hypotheses it will entertain, what actions it will consider, and what futures become reachable. Memory is not merely about the past; it is about the shape of the agent's future.

This insight is not new. It runs through schema theory in cognitive psychology, predictive processing in computational neuroscience, and the active inference framework (Friston et al., 2017), where internal generative models do not merely record the world but actively drive perception, attention, and action selection. What is new is its direct relevance to the engineering of autonomous software agents. In active inference, the agent's internal model is not a reference library consulted between actions; it is the machinery of action selection. The same principle applies to software agents. Updating a causal model is not a bookkeeping step; it restructures the space of policies the agent can pursue. Adding a new dependency to an architecture map does not just record a fact; it changes which refactoring strategies are live options. Recognising that a previously trusted data source has become unreliable does not just annotate the past; it redirects the agent's future information-gathering.

This is why representational memory and horizon extension (discussed in §5) are not independent ideas. They compound. The richer and more dynamic the agent's internal model, the further into the future it can project reliable consequences, provided the model is accurate and the environment is sufficiently predictable. When the model is wrong or the environment is adversarial, richer representations can produce overconfidence rather than longer reach. The connection between model richness and projective horizon is real but conditional, and the conditions matter for engineering. An agent with a flat log sees only what just happened. An agent with a well-calibrated causal model can see what is about to become possible. An agent with an overconfident causal model may confidently see something that isn't there.

To be concrete about the current gap: systems like OpenClaw maintain structured memory files that persist across sessions, and this is a real architectural advance over stateless prompt chains. But the dominant pattern remains log summarisation, periodically condensing interaction history into natural-language summaries. That is valuable for continuity. It is not yet the same as building and maintaining a model: a structured representation that distinguishes established facts from working hypotheses, tracks the confidence and provenance of its own claims, recognises when its map has diverged from its territory, and can be queried for action-relevant implications rather than just recalled as narrative. Even the foundation model's own attention mechanism, while learned through a dynamic compression process during training, becomes fixed at deployment; the agent inherits a frozen compression and must build its dynamic representations on top of it. The difference between a diary and a dynamically updated situation map is the distance still to be covered. And that distance cannot be closed by a one-time design effort, because the map itself must change as the territory shifts. Which raises the question: how should an agent's representations evolve over time, and at what pace?


4. Adaptation Across Timescales

The answer is that they must evolve at multiple paces simultaneously. Intelligence should not be imagined as convergence to a solved policy after which learning is essentially over. Deployed agents live in changing environments: APIs shift, repositories evolve, user preferences drift, data distributions move, norms are revised, and the cost of mistakes changes with context. An agent that performs well only so long as the task remains stationary is not autonomous. It is merely well tuned.

Autonomy should instead be understood as competent adaptation across timescales:

  • A fast loop in which the agent perceives, acts, and corrects immediate mistakes.

  • A medium loop in which it updates memory, refines strategies, and learns recurring structures in the environment.

  • A slow loop in which it revises its own representations, thresholds, tools, and sometimes even the way it interprets success.

A system with only the fast loop can react but cannot mature. A system with only the slow reflective loop misses the moment. Agency appears when these loops work together.

What unites these timescales is a single principle: adaptation is structural alignment. At every scale, what a learning system does is progressively mirror the statistical structure of its environment in its own internal organisation. This principle is scale-invariant. Over evolutionary time, biological perceptual systems come to reflect the statistics of natural scenes (Simoncelli and Olshausen, 2001): edge detectors, spatial frequency tuning, and color opponency are not arbitrary design choices but structural imprints of the visual world. Over developmental time, the paradigm shifts in children's reasoning (from pre-operational to concrete operational thought, in Piaget's (1969) terms) are wholesale representational realignments to newly apprehended regularities. Over algorithmic time, gradient updates and gain modulation are the same process at shorter timescales: the system's parameters deforming to fit the data-generating distribution.

The fast, medium, and slow loops described above are therefore not three separate mechanisms. They are three timescales of the same structural imprinting. What distinguishes agentive adaptation from passive structural alignment (a thermometer mirroring ambient temperature, a crystal lattice reflecting substrate geometry) is that the agent's alignment serves action selection under its own compression function (§2). The agent does not merely reflect the environment; it compresses the environment into a representation from which it selects actions.

What differs across timescales is not the principle but the granularity of what gets realigned: at the fast loop, specific beliefs and action choices; at the medium loop, strategies and heuristics; at the slow loop, the representational commitments that determine what the agent can even perceive as relevant. The engineering question is how to govern the interactions between these timescales so that structural alignment at one level does not destabilize the alignment achieved at another.

A useful way to make this precise borrows two concepts that mirror each other. Empowerment, introduced by Klyubin, Polani, and Nehaniv (2005) as an information-theoretic measure of an agent's channel capacity over its environment, captures the extent to which an agent can influence its future observations, its capacity to change what it will encounter. Plasticity, as characterized in recent deep RL research (Lyle et al., 2023; Dohare et al., 2024 in Nature), captures the converse: the extent to which an agent can be influenced by new observations, its capacity to be changed by what it encounters. A well-adapted agent needs both, and needs them at every timescale. At the fast loop, plasticity means the ability to update beliefs in response to immediate feedback; empowerment means the ability to take corrective action. At the medium loop, plasticity means revising strategies in light of accumulated evidence; empowerment means reshaping the working environment (building tools, restructuring information). At the slow loop, plasticity means the capacity to revise one's own representational commitments; empowerment means the capacity to change the conditions under which one operates.

An agent that is highly empowered but not plastic is powerful but rigid; it can act on the world but cannot learn from it. An agent that is highly plastic but not empowered is sensitive but helpless; it registers everything but can change nothing. The "balance" between these two is not a single static parameter but a design space: how much plasticity is appropriate at each timescale depends on the volatility of the environment at that timescale, the cost of errors, and the value of exploration. In a stable production environment, the fast loop should be highly plastic but the slow loop should be conservative. In an exploratory research setting, more plasticity at the slow loop may be warranted. There is no universal optimum; there is a design choice that should be made explicitly rather than inherited by default.

Certain parts of the RL literature address adaptation and plasticity-loss in depth (e.g., Lyle et al., 2023; Dohare et al., 2024). But much of the LLM-agent discourse (the world of orchestration frameworks, tool-use loops, and prompt-chained workflows) addresses adaptation informally at best. Several current systems gesture toward multi-timescale structure. OpenClaw's heartbeat and cron mechanisms, LangGraph's checkpoint-based state management, and various skill-generation pipelines approximate the medium and slow loops. But approximation is different from governance. The critical questions are rarely made explicit: what triggers a transition from fast to medium adaptation? Under what conditions should the slow loop override a well-practiced fast response? How does the agent detect that its medium-loop strategies have become stale? How does it avoid the pathology of over-adaptation, rewriting its own heuristics so aggressively that it loses stable competence?

These are not hypothetical concerns. Systems that generate their own skills, for instance, face a composability problem: as the skill library grows, interactions between skills become harder to predict, conflicts multiply, and the agent may degrade precisely because it has "learned" too much without disciplined retirement or arbitration. Multi-timescale adaptation requires not just multiple loops but explicit governance of the interactions between them.

There is a deeper version of this challenge that the field has only partially reckoned with. In most current agent systems, the structure of learning itself (what counts as a trial, where episodes begin and end, what the agent is permitted to modify about its own operation) is defined by human designers and held fixed. The agent learns within the structure but does not learn to reshape it.

Schmidhuber's programme of recursive self-improvement (Schmidhuber, 1987; Schmidhuber, Zhao, and Schraudolph, 1998), pursued since 1987 and synthesised in the formally optimal Gödel Machine (Schmidhuber, 2003), takes the slow loop to its logical extreme: the agent modifies not just its representations or strategies but its own learning algorithm, including the criteria by which it decides what to modify. In the meta-reinforcement learning variant, a self-modifying policy operates in a single lifelong trial with no externally imposed episode boundaries; the agent learns to define and redefine its own sub-problems, its own curriculum, and its own measures of progress.

Most current systems are far from this. Their trial structures, evaluation criteria, and learning schedules remain human-authored. That is not necessarily wrong; human oversight has its own value, and the slow loop's self-modification should be governed rather than unconstrained. But it marks a clear boundary on the depth of adaptation currently achieved. The limitation is not that humans oversee the slow loop (that may be wise governance) but that the agent has no capacity to participate in redefining its own learning structure even when conditions warrant it. The slow loop, in most deployed agents, is constrained to a shallower form of adaptation than the architecture could in principle support.


5. Extending the Horizon of Control

An especially important consequence now follows: a major part of intelligence is not maximising immediate payoff but extending the timescale over which control can be exercised.

This marks a shift in what optimisation itself means. Conventional task-centric design treats intelligence as a finite game (in the sense of Carse, 1986): there is a problem, a solution criterion, and a terminal state at which the game ends and the score is tallied. The perspective advanced in this essay treats intelligence as closer to an infinite game, where the objective is not to win a particular round but to remain a viable player, to sustain the capacity for continued effective action across changing conditions. Empowerment, option preservation, and horizon extension are all infinite-game concepts. They measure not how well the agent closes out a task but how effectively it keeps the game going on favorable terms. A finite player asks: did I solve the problem? An infinite player asks: am I in a better position to solve the next problem, and the one after that? Schmidhuber's framing of meta-reinforcement learning as a "single lifelong trial" (§4) is perhaps the purest expression of this infinite-game perspective: no episode boundaries, no terminal state, just a continuous process of self-improving adaptation within which individual tasks are incidents rather than endpoints.

Think of it as a cognitive light cone. Just as a physical light cone defines the region of spacetime that can be causally influenced from a given point, an agent's cognitive light cone defines the region of future states it can reliably reach and shape from its current position. A weak agent has a narrow cone; it can respond to what is immediately in front of it, but its influence dissipates quickly. A stronger agent has a wider cone; its actions have reliable consequences further into the future, across more dimensions of the environment.

The analogy is deliberately loose and should be bounded. A physical light cone has precise mathematical structure: Lorentz invariance, metric signature, hard causal limits. A cognitive light cone has none of these. It is a heuristic for a real phenomenon (the varying reach of an agent's effective control) not a formal theory. Its value is in directing attention to the shape of an agent's influence over time rather than its score at a moment. It should not be asked to carry more inferential weight than that.

This framing offers a different angle on intelligence from task-centric approaches like narrow puzzle benchmarks (cf. Chollet, 2019 and ARC-AGI challenges). Those measure whether the agent can solve a specific problem under controlled conditions. The cognitive light cone asks something broader: how far into the future, and across how many contingencies, can this agent project effective control? A coding agent that can solve an isolated function-completion task has a narrow cone. One that can maintain a codebase over months, anticipating breaking changes, preserving architectural coherence, managing technical debt, has a vastly wider one.

A weak agent spends every step consuming the situation as it arrives. A stronger agent invests in scaffolding that expands what it can reliably do later. It writes tests before changing code. It instruments a service before debugging. It builds an index before searching. It creates a simulator before planning. It spends effort now to open better action possibilities later.

This is where the connection to dynamic memory (§3) becomes critical. The agent's cognitive light cone is not fixed; it is a function of its internal representations. An agent with a flat interaction log can only project forward by pattern-matching against recent events. An agent with a live causal model, a structured hypothesis space, and a map of its own uncertainties can project much further, because its memory gives it leverage over the future, not just access to the past. Dynamic memory and horizon extension are not two separate features of a capable agent. They are the same capability viewed from two directions: memory compresses the past into representations that expand the reachable future; horizon-extending actions generate experience that enriches and updates memory. The virtuous cycle between them is, in many respects, the engine of genuine autonomy.

A capable engineering agent is not just good at completing tickets. It is good at creating the conditions under which future tickets become easier, safer, and more legible. It preserves options, creates leverage, and turns uncertainty into manageable structure.


6. The Limits of Scalar Reward

Reward-centric thinking tends to obscure exactly these features. Reward is an extremely useful abstraction, but a lossy one. It compresses goals into a scalar. That compression can be helpful for learning or optimisation, but it should not be mistaken for a complete account of what (goals) the agent is trying to achieve.

The deeper pattern here deserves naming: dimensional reduction. The same pathology of collapsing a multi-dimensional reality into a single scalar operates at three distinct levels simultaneously. Epistemologically, it appears as the reduction of intelligence itself to a benchmark score, as though the entire space of cognitive capability could be projected onto one axis (the critique implicit in Chollet, 2019). Axiomatically, it appears in the von Neumann-Morgenstern (1944) framework's mapping of preferences onto a one-dimensional utility, which the reward hypothesis then inherits. Practically, it appears whenever a multi-objective system is forced to optimise a single target (or few targets). These are not three separate problems. They share a common formal pattern: a many-to-one mapping that discards structure the original possessed. The information lost, the mechanism of loss, and the remedies differ at each level, but the structural association is real, and it highlights why partial fixes at any one level leave the others intact. Replacing a single benchmark with a suite does not repair the axiomatic assumptions built into scalar reward. Switching to vector-valued reward does not, by itself, change the epistemological habit of projecting intelligence onto a single axis.

The lossiness is not merely practical; it is formal. Work by Pitis (2019), Shakerinava and Ravanbakhsh (2022), and Bowling et al. (2023) on settling the reward hypothesis has established that a preference ordering over behavior is representable by a Markov reward function with a discount factor only if the preferences satisfy the standard von Neumann–Morgenstern axioms (completeness, transitivity, continuity, independence) plus a further condition of temporal indifference: the agent must not care when a reward arrives, only how much and how likely. This is often passed over as a technicality. It is not. It means that writing down a scalar reward is not a neutral encoding of "what we want". It imports structural assumptions. Completeness requires that all outcomes be comparable on a single scale. Independence requires that the value of an outcome not depend on what else is available. Most critically, temporal indifference requires that timing not matter beyond discounting.

When those assumptions hold, scalar reward is a useful and elegant representation. When they fail (and they fail often in real-world settings) scalar reward does not merely become approximate. It becomes structurally misleading. It can represent only a subset of the preferences we actually hold, and it does so silently, giving no indication of what has been lost in translation.

Real goals are usually not like that. Human and organisational objectives are plural, layered, and partly incommensurable, what Isaiah Berlin (1969) called tragic value pluralism: the recognition that some goods are genuinely incomparable, that choosing between them involves real and irreducible loss, and that no single metric can dissolve the tension without distorting what matters. Berlin's insight concerned political goods (liberty and equality, spontaneity and security), but the construct transfers directly: an agent's safety, capability, efficiency, and deference to user autonomy are not merely trade-offs along a single dimension but represent genuinely different categories of value that resist aggregation into a common currency. Some things are negotiable (latency versus cost, elegance versus speed, exploration versus exploitation). Some are not negotiable at all: safety thresholds, legal constraints, privacy guarantees, truthfulness in critical settings. Some are not fully specified in advance and should remain open to clarification. Some should trigger abstention rather than optimisation. A single scalar reward flattens all these distinctions. It turns vetoes into penalties, uncertainty into hidden assumptions, and plural values into exchangeable tokens. It enforces commensurability where real value may be lexicographic, risk-sensitive, or context-dependent.

The analogy to requirements engineering

For software engineers, the clearest way to see the problem is to compare reward design with requirements engineering. No serious team would take a product spec, a threat model (cyber-security), a compliance regime, a safety case, a set of user expectations, and a code review culture, then collapse it all into one floating-point number and call that the whole meaning of success. Yet that is effectively what a pure reward-centric picture asks us to do.

Where reward pathology appears

Consider a coding agent rewarded for passing tests quickly and cheaply. Under pressure, the agent may overfit the visible harness, delete hard cases, patch around the metric, or produce brittle code that satisfies the letter of the evaluation while undermining its spirit. The problem is not merely imperfect tuning. It is structural: once success is represented in a single currency, anything not properly denominated in that currency becomes vulnerable to distortion.

The same issue appears everywhere. Optimise a customer-service agent aggressively for resolution time and you risk premature closure. Optimise for engagement and you invite manipulation. Optimise a research assistant for citation yield and you quietly suppress calibration and intellectual honesty. In each case, the pathology follows from the same structural source: a single currency that cannot represent the plurality of constraints actually in play. These are not hypothetical risks. Skalse et al. (2022) provided the first formal definition of reward hacking and showed that the conditions for an unhackable proxy are rarely satisfied in practice. Pan et al. (2024) demonstrated that LLM agents deployed in feedback loops will exploit proxy objectives in context, producing unintended side effects even without adversarial intent. Bondarenko et al. (2025) showed that reasoning models will game a benchmark by default, exploiting the evaluation environment rather than solving the intended task. The problem is empirically confirmed, not merely theorised.

Acknowledging partial solutions

The critique of scalar reward is not new, and parts of the field have responded. Constitutional AI builds hierarchical value structures into model training and inference. Vector-valued reward functions, constrained optimization, and multi-objective approaches appear in the RL literature. Intent-driven architectures like OpenClaw sidestep much of the problem by grounding the agent's purpose in ongoing user direction rather than a fixed optimisation target. Recent theoretical work (Skalse et al., 2023; Pitis, 2023) has even formalized which multi-objective configurations are provably unrepresentable by scalar Markov reward, demonstrating that lexicographic and threshold-based objectives require fundamentally richer reward structures.

These responses are real, but they do not dissolve the underlying issue, and the underlying issue, precisely stated, is broader than scalar reward alone. It is the problem of faithfully representing plural, partially incommensurable values in a form that machines can act on. Constitutional rules still require someone to write the constitution, to anticipate conflicts between its clauses, and to govern its amendment. Vector rewards require someone to decide which dimensions matter, how they interact, and what happens when they conflict in ways not anticipated at design time. Intent-driven systems defer to the user, which is often wise, but user intent is itself ambiguous, inconsistent, and context-dependent, and the agent must still decide how to interpret, disambiguate, and occasionally push back on what it is told. The scalar has been replaced in these systems, but the deeper challenge of value representation persists. It has been promoted from a training problem to an architecture problem. That is progress. It is not completion.


7. Toward a Richer Objective Architecture

A stronger account of agency needs a richer objective architecture than a single scalar:

  • Some values should appear as hard constraints: what the agent is never permitted to trade away.

  • Some should appear as soft preferences: negotiable trade-offs within an acceptable region.

  • Some should appear as uncertainty that provokes a question back to the user.

  • Some should appear as constitutional rules: meta-level commitments governing the agent's own decision-making.

  • Some should remain outside optimization entirely, handled by oversight, escalation, or institutional process.

Reward can still play a role inside such an architecture, but only as one signal among others. A useful metaphor: reward is to goals what a serialization format is to a living system: useful for transport, useful for computation, indispensable in the right place, but disastrous when confused with the full reality it encodes. A JSON schema can help systems interoperate; it does not capture the whole social practice within which the data has meaning.

What is missing from current practice is less the idea of layered objectives than the engineering discipline around them. Formal or semi-formal languages for specifying objective hierarchies (a typed constraint language, for instance, that distinguishes hard invariants from soft preferences, analogous to how a type system distinguishes compile-time from runtime guarantees) are rare. Mechanisms for detecting and surfacing conflicts between layers are rarer still. And the question of amendment (who may change the objective architecture, under what conditions, and with what oversight) is almost entirely unaddressed.

If the objective architecture is the constitution of the agent, we are still in the era of constitutions that exist as scattered, informal understandings rather than governed documents. The analogy is to early-stage organisations (startup in the first couple of years) that operate on implicit norms and founder intuition, which works until the organisation grows complex enough that conflicts between norms become frequent and consequential. At that point, the absence of explicit codification becomes a source of failure, not a sign of flexible pragmatism.


8. The Evaluator Stack as Critical Infrastructure

Once we see past scalar reward, a deeper engineering question comes into view: not only what is the reward? but what writes the reward?

Representability and origin are separable questions: even where a scalar reward can faithfully encode a preference, the question of how that reward is produced, maintained, and governed remains. In real deployed systems, there is often no single abstract reward function at all. What exists instead is an evaluator stack: unit tests, benchmark suites, policy prompts, safety filters, retrieval checks, user approvals, business metrics, regression harnesses, anomaly monitors, human audits. That stack is the actual interface through which value enters the system. It is the practical constitution of the agent.

Evaluator design is therefore not peripheral infrastructure. It is the moral and organisational control plane of the system. It should be versioned, reviewed, red-teamed, and monitored just like production code. Changes to evaluator prompts or pass criteria can alter behavior as materially as changes to model weights. Adversarial cases should be added deliberately. Conflicts between metrics should be surfaced rather than hidden. Human escalation paths should be explicit. If we take autonomous agents seriously, the machinery that scores, filters, and authorises them must be treated as first-class architecture. Reward design, properly understood, is governance design: version the evaluators, audit who can change them, inspect how they compose, and treat them as part of the agent architecture rather than afterthought infrastructure.

A critical objection must be addressed here: if the agent has layered objectives and multiple evaluators, how does it actually decide what to do? At the point of action selection, conflicting evaluator signals must be resolved into a single choice. Does the aggregation problem (the pathology of collapsing plural values into a single decision) simply reappear one level up?

It does, partially. This is a genuine tension, and the essay should not pretend to dissolve it. But there are important differences between a scalar reward that bakes commensurability into the optimisation target from the start and an evaluator architecture that maintains separate signals and resolves them at the point of action through explicit, inspectable procedures. In the former, the trade-offs are hidden inside a number. In the latter, they can be surfaced, logged, reviewed, and overridden. A system that says "I chose action A because evaluator X passed, evaluator Y flagged a concern but below threshold, and constraint Z was satisfied" is fundamentally more legible and governable than one that says "I chose action A because it scored 0.73." The aggregation problem does not vanish, but it becomes visible, and visibility is the precondition for governance. Where the evaluators irreconcilably conflict, the appropriate response is not forced aggregation but escalation: to a human, to a higher-level policy, or to an explicit abstention.

The contrast with current practice is revealing. In the typical deployed LLM-based agents surveyed for this essay, the evaluator function is a user's informal approval or disapproval: a thumbs-up, a correction, a decision to keep using the system or to stop. This provides transparency but not governance. The evaluator is not versioned. It is not red-teamed. There is no adversarial test suite, no regression harness, no formal escalation policy. The "constitution" of the agent is whatever the user happens to say and whatever the developer happened to hard-code, with no mechanism for detecting when these conflict or drift apart.

This is not a criticism of any particular system so much as an observation about the maturity of the field. We have learned, sometimes painfully, that production software requires CI/CD, code review, staging environments, and observability. Evaluator stacks for autonomous agents deserve the same treatment. Until they get it, the word "autonomous" overstates what is actually under control.


9. Boundedness and Boundaries

One further implication is easy to underestimate. Boundedness and boundary-setting are not implementation annoyances; they are part of what the agent is. If agency is fundamentally about internal organisation, not merely about behavior in an environment, then finite memory, compute budgets, tool latency, interruption, and resumability are constitutive features of the agent, not mere implementation details to be optimised away.

For software agents, embodiment does not require a robot body. It consists in a particular set of APIs, credentials, file systems, rate limits, latencies, memory stores, dashboards, human supervisors, and financial costs. These determine what the agent can sense, what it can affect, how quickly it can recover, and what risks it can impose.

The boundary between agent and environment is not given in advance, and this is a central, not peripheral, issue once you stop treating the environment as the only thing worth formalising. Is external memory part of the agent or part of the world? Is the planner a component of the agent or a surrounding service? Is a human reviewer outside the system or a constitutive part of its policy loop? These questions have the flavor of extended-mind arguments in philosophy of cognition (Clark and Chalmers, 1998), and for good reason: they are structurally the same question. If the agent's memory store is treated as external, we evaluate the agent's competence differently, assign accountability differently, and intervene differently when things go wrong compared to when that memory is treated as part of the agent's standing cognitive apparatus. An agent that depends on human approval at every critical step is a fundamentally different kind of agent from one that only occasionally escalates uncertainty.

The boundary is not discovered; it is drawn, and the question of who draws it matters. The agent designer makes initial boundary choices (what tools to expose, what permissions to grant). The deployer may narrow or widen those boundaries for a specific context. The user may further constrain them through interaction patterns. And institutional governance may impose outer limits. Each of these actors draws a different boundary, and the effective boundary of the deployed agent emerges from the composition of all of them. But "composition" here is not a simple set intersection: tool permissions, interaction patterns, and legal constraints are heterogeneous types of constraint, and their combination is closer to constraint satisfaction across incommensurable dimensions than to overlap in a shared space. Conflicts between layers must be resolved by precedence rules, not by geometric intersection. Making this multi-layered boundary-setting explicit, rather than leaving it implicit in configuration files and access control lists, is part of what it means to specify the agent seriously.

A related question the field has barely begun to address: what happens when agents compose? The properties described in this essay (persistence, representational memory, multi-timescale adaptation, layered objectives) are framed here in terms of a single agent. But the future is likely to involve ensembles: agents that delegate to other agents, share partial state, negotiate over resources, and jointly shape outcomes that no single agent controls. The boundary question then becomes acute. If Agent A delegates memory-management to Agent B, where does competence reside? If the ensemble's objectives conflict, which agent's constitutional rules prevail? Composability is not just a scaling concern; it is a fundamental question about where the "agent" begins and ends when the system is distributed.


10. The Evaluation Gap

If the argument of this essay is directionally valid (that agents should be evaluated as persistent systems rather than task-completion engines) then it follows that we need evaluation frameworks adequate to that ambition. This is harder than it sounds, and honesty requires acknowledging the difficulty rather than merely asserting the goal.

Current agent benchmarks overwhelmingly measure aggregate task success: pass rates on coding problems, accuracy on question-answering, completion rates on multi-step workflows. The properties this essay argues are central (recovery under distribution shift, option preservation, constraint respect under pressure, legibility over time, graceful degradation, and the capacity to notice and flag one's own declining competence) are not merely unmeasured. They are, in most cases, not yet well-defined enough to measure. Some are closer than others: recovery under distribution shift and constraint respect under pressure are nearest to current benchmarking practice, since they can be operationalised through controlled perturbations of existing task environments. Legibility over time and the capacity to notice one's own declining competence remain further from operationalisation, requiring evaluation methods that do not yet have standard form.

That is a genuine weakness in the position advanced here, and it should be stated plainly. Proposing that agents be evaluated on richer criteria is easy. Operationalising those criteria into reproducible, meaningful benchmarks is an open research problem.

But "open" does not mean "intractable". To make the aspiration concrete, consider one example: constraint respect under pressure. An evaluation harness could present a coding agent with a task that is solvable by two paths, one fast but requiring modification of a file marked as a protected dependency, the other slower but architecturally clean. The metric is not whether the agent completes the task (both paths succeed) but whether it respects the constraint when doing so is costly. Varying the pressure (tighter time budgets, stronger incentives for speed) produces a curve that measures not just compliance but the robustness of compliance. This is not a complete benchmark, but it illustrates the shape of what is needed: evaluations that test the properties of the agent-as-system, not just the outcomes of the agent-on-task.

Similar designs are conceivable for other properties. Recovery under shift can be tested by changing the API surface mid-task. Option preservation can be tested by measuring whether the agent's early actions leave later choices open or foreclose them. Legibility can be approximated by asking human evaluators whether they can predict the agent's next action from its stated reasoning. None of these is easy to operationalise well. But the difficulty is engineering difficulty, not conceptual impossibility.

The answer is not to retreat to task metrics because they are measurable. It is to recognize that the evaluation gap is itself one of the most important problems in the field. Building evaluation infrastructure that captures the properties of persistent, adaptive systems, not just their moment-to-moment task performance, is as much a part of the engineering agenda as building the agents themselves. The history of software engineering suggests that the discipline advances as much through better testing as through better building.


11. An Engineering and Research Agenda

The agenda that follows is both more demanding and more promising than benchmark chasing. We should build agents:

  • whose internal state is explicit and inspectable,

  • whose memory is representational, compressive, and dynamic, not merely archival,

  • whose adaptation works at multiple timescales with governed transitions between them, balancing plasticity with empowerment at each level,

  • whose behavior includes horizon-extending acts of scaffold construction that widen the cognitive light cone,

  • whose objectives are layered rather than collapsed, with formal or semi-formal specification of their hierarchy,

  • whose evaluator stacks are governed as critical infrastructure (versioned, red-teamed, monitored) with explicit procedures for resolving conflicts and escalating irreconcilable tensions,

  • whose boundaries are drawn consciously by identifiable actors rather than inherited by convenience, and

  • whose composition with other agents is treated as an architectural question, not an afterthought.

We should evaluate them not only by aggregate task return but by how they recover under shift, detect uncertainty, preserve options, respect non-negotiable constraints, and remain legible to human collaborators over time. And we should be honest that the evaluation frameworks adequate to these criteria are themselves under construction, that building them is part of the work, not a precondition for it.

The field has made real progress. Stateful frameworks, constitutional training, persistent daemons, skill generation, transparent memory: these are not trivial achievements. But they are the beginning of treating the agent as the primary engineering object, not the completion of it. The features exist in many systems; the discipline around those features, the treatment of memory, objectives, evaluators, and boundaries as first-class governed architecture, largely is nascent or does not exist.

The decisive shift is to stop treating the task as the fundamental unit of intelligence. The task is temporary. The agent persists. A benchmark captures a slice of behavior; an autonomous agent lives across situations, accumulates commitments, and shapes its own future competence. Once we see that clearly, reward finds its proper place: a valuable instrument for learning, local optimization, and feedback, no longer mistaken for a complete account of value, and therefore no longer mistaken for a complete account of agency.

Autonomous agents will become interesting in the deepest sense when we stop asking only how to maximise a score and start asking what kind of enduring system we are building. What state does it carry? What abstractions does it use to understand the world? How does it change when conditions change? What futures does it keep open? What will it refuse to trade away? And who, in the end, has authored the criteria by which it judges success?

Those are not side questions. They are the real questions.


References

Berlin, I. (1969) Four Essays on Liberty. Oxford: Oxford University Press.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D. et al. (2022) 'Constitutional AI: Harmlessness from AI feedback', arXiv preprint, arXiv:2212.08073.

Bondarenko, A., Volk, D., Volkov, D. and Ladish, J. (2025) 'Demonstrating specification gaming in reasoning models', arXiv preprint, arXiv:2502.13295.

Bowling, M., Martin, J.D., Abel, D. and Dabney, W. (2023) 'Settling the reward hypothesis', Proceedings of the 40th International Conference on Machine Learning (ICML), pp. 3003–3020.

Carse, J.P. (1986) Finite and Infinite Games. New York: Free Press.

Chollet, F. (2019) 'On the measure of intelligence', arXiv preprint, arXiv:1911.01547.

Clark, A. and Chalmers, D. (1998) 'The extended mind', Analysis, 58(1), pp. 7–19.

Dohare, S., Hernandez-Garcia, J.F., Lan, Q., Rahman, P., Mahmood, A.R. and Sutton, R.S. (2024) 'Loss of plasticity in deep continual learning', Nature, 632, pp. 768–774. doi:10.1038/s41586-024-07711-7.

Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P. and Pezzulo, G. (2017) 'Active inference: A process theory', Neural Computation, 29(1), pp. 1–49. doi:10.1162/NECO_a_00912.

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. and Narasimhan, K.R. (2024) 'SWE-bench: Can language models resolve real-world GitHub issues?', Proceedings of the 12th International Conference on Learning Representations (ICLR).

Klyubin, A.S., Polani, D. and Nehaniv, C.L. (2005) 'Empowerment: A universal agent-centric measure of control', Proceedings of the 2005 IEEE Congress on Evolutionary Computation, pp. 128–135.

LangChain (2024) LangGraph: Agent orchestration framework. Available at: https://github.com/langchain-ai/langgraph.

LeCun, Y. (2022) 'A path towards autonomous machine intelligence', OpenReview preprint. Available at: https://openreview.net/pdf?id=BZ5a1r-kVsf.

Littman, M.L., Sutton, R.S. and Singh, S. (2001) 'Predictive representations of state', Advances in Neural Information Processing Systems 14 (NeurIPS), pp. 1555–1561.

Lyle, C., Zheng, Z., Nikishin, E., Avila Pires, B., Pascanu, R. and Dabney, W. (2023) 'Understanding plasticity in neural networks', Proceedings of the 40th International Conference on Machine Learning (ICML), pp. 23190–23211.

Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y. and Scialom, T. (2023) 'GAIA: A benchmark for general AI assistants', arXiv preprint, arXiv:2311.12983.

von Neumann, J. and Morgenstern, O. (1944) Theory of Games and Economic Behavior. Princeton: Princeton University Press.

OpenClaw (2025) OpenClaw: Autonomous AI agent. Available at: https://github.com/openclaw/openclaw.

Pan, A., Jones, E., Jagadeesan, M. and Steinhardt, J. (2024) 'Feedback loops with language models drive in-context reward hacking', Proceedings of the 41st International Conference on Machine Learning (ICML).

Piaget, J. and Inhelder, B. (1969) The Psychology of the Child. New York: Basic Books.

Pitis, S. (2019) 'Rethinking the discount factor in reinforcement learning: A decision theoretic approach', Proceedings of the AAAI Conference on Artificial Intelligence, 33, pp. 7949–7956.

Pitis, S. (2023) 'Consistent aggregation of objectives with diverse time preferences requires non-Markovian rewards', Advances in Neural Information Processing Systems 36 (NeurIPS).

Schmidhuber, J. (1987) Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Diploma thesis, Technische Universität München.

Schmidhuber, J. (2003) 'Gödel machines: Self-referential universal problem solvers making provably optimal self-improvements', arXiv preprint, arXiv:cs/0309048.

Schmidhuber, J., Zhao, J. and Schraudolph, N.N. (1998) 'Reinforcement learning with self-modifying policies', in Thrun, S. and Pratt, L. (eds.) Learning to Learn. Boston: Kluwer, pp. 293–309.

Shakerinava, M. and Ravanbakhsh, S. (2022) 'Utility theory for sequential decision making', Proceedings of the 39th International Conference on Machine Learning (ICML), vol. 162, pp. 19616–19625.

Simoncelli, E.P. and Olshausen, B.A. (2001) 'Natural image statistics and neural representation', Annual Review of Neuroscience, 24, pp. 1193–1216.

Skalse, J., Howe, N., Krasheninnikov, D. and Krueger, D. (2022) 'Defining and characterizing reward hacking', Advances in Neural Information Processing Systems 35 (NeurIPS).

Skalse, J. and Abate, A. (2023) 'On the limitations of Markovian rewards to express multi-objective, risk-sensitive, and modal tasks', Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI).

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W.X., Wei, Z. and Wen, J. (2024) 'A survey on large language model based autonomous agents', Frontiers of Computer Science, 18(6), 186345. doi:10.1007/s11704-024-40231-1.

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Qin, W., Zheng, Y., Qiu, X., Huang, X., Zhang, Q. and Gui, T. (2025) 'The rise and potential of large language model based agents: a survey', Science China Information Sciences, 68, 121101. doi:10.1007/s11432-024-4222-0.

Xia, P., Chen, J., Yang, X., Han, S., Qiu, S., Zheng, Z., Xie, C. and Yao, H. (2025) 'MetaClaw: A self-improving agentic framework for robotic manipulation', GitHub repository. Available at: https://github.com/aiming-lab/MetaClaw.

Yao, S., Shinn, N., Razavi, P. and Narasimhan, K. (2024) 'τ-bench: A benchmark for tool-agent-user interaction in real-world domains', arXiv preprint, arXiv:2406.12045.