Reasoning Gap: Why Hybrid AI Systems are working so well

January 20, 2026

A Technical Analysis for AI/ML Practitioners

Something interesting happened on the path to AGI: we discovered that the most capable AI systems are not pure end-to-end neural networks—and that we can accomplish remarkable things with the architectures we already have.

Consider two data points.

First: Claude Sonnet 4.5 can now maintain focus on complex, multi-step coding tasks for over 30 hours, achieving 77.2% on SWE-bench Verified—solving real GitHub issues that would take human engineers days. Coding agents are shipping production features, passing code review, and operating autonomously across massive codebases. Claude Cowork, launched in January 2026, was itself largely built by Claude Code—almost all of its core code autonomously generated in just 1.5 weeks.

Second: On ARC-AGI-2, the latest benchmark designed to test genuine abstract reasoning, the best reasoning models score between 4% and 54%, depending on compute budget. Claude Opus 4.5 with extended thinking reaches 37.6%. The o3 model that achieved 75.7% on ARC-AGI-1 with massive compute? It scores around 4% on ARC-AGI-2. Humans average 60%, with 100% of tasks solvable by at least two people.

These aren't contradictory findings. They're the same finding, viewed from different angles.

The systems that work ~ as in, really work in production, for hours on end—are succeeding because they've been carefully engineered to avoid relying solely on LLM reasoning. Code execution. External verification. Symbolic critics. Test suites. Type checkers. The neural network proposes; something else verifies and disposes (or accepts).

This is the central insight reshaping AI engineering in 2026: reasoning is a system-level property, not a model-level one.

Part I: What's Working

The Single-Threaded Master Loop

Let's start with what's actually working in production.

Claude Code implements what Anthropic internally calls a "single-threaded master loop" (codenamed "nO"); deliberately rejecting the complex multi-agent swarm architectures popular in the field. The design philosophy prioritises simplicity and debugg-ability over architectural sophistication:


while(tool_call) -> execute tool -> feed results -> repeat

The loop continues as long as the model's response includes tool usage; when Claude produces a plain text response without tool calls, the loop naturally terminates.

This architecture creates a clean separation between the neural component (Claude's reasoning and generation) and the symbolic/external verification components (tools that execute and return deterministic results). The neural system proposes; the external systems verify.

The operational pattern is: think → act → observe → correct.

Consider debugging: Claude reads an error message (perceive), hypothesizes what's wrong and decides to check a specific file (reason), opens the file and examines the code (act), sees that the function signature doesn't match the call (observe). Then it loops—with this new information, it reasons about a fix, makes an edit, runs the test again, and observes whether the error is resolved.

The critical insight: the agent does not try to solve the entire problem in one step. It takes a small action, sees what happens via external verification, and adjusts. Each loop adds information. According to Anthropic's engineering team, this verification loop improves the quality of the final result by 2-3x ~ perhaps the single most important factor in Claude Code's success.

The Lean Theorem Proving Revolution

The clearest demonstration of this architecture comes from formal mathematics, where the results are now impossible to dismiss.

AlphaProof, published in Nature in November 2025, achieved silver medal performance at the 2024 International Mathematical Olympiad by generating formal proofs in Lean. It solved three of six problems, including P6—the hardest problem of the contest, solved by only five of 609 human participants. Crucially, these weren't just answers. They were automatically verifiable proofs. Every logical step checked by Lean's trusted kernel.

The architecture is explicitly hybrid: a neural language model fine-tuned on Lean generates proof candidates. The Lean proof assistant verifies or rejects them. Reinforcement learning trains the generator based on verification feedback. The loop continues, with the model learning to generate proofs that survive the verifier's scrutiny.

Harmonic AI's Aristotle takes this further, achieving gold-medal performance on IMO 2025 problems—with formal proofs. Not natural language solutions that look right, but Lean-verified proofs that are right by construction. The company raised $100 million in 2025 specifically to build "hallucination-free" AI using Lean as its backbone.

What makes this work isn't that the neural networks have learned to reason. It's that the architecture routes around the need for the neural network to reason correctly. The LLM's job is to propose candidate proofs—to use its pattern-matching capabilities to suggest plausible next steps. Lean's job is to verify. The system reasons correctly because verification is externalized to a component designed for exactness.

Code Agents at Scale

The same architectural insight drives the most capable coding agents in production.

When Anthropic reports that Claude Sonnet 4.5 can work for 30+ hours on complex tasks, they're not describing a model that's learned to maintain long-term coherence through some emergent property. They're describing an architecture: the Claude Agent SDK with checkpoints, progress files, git commits as save points, browser automation for end-to-end testing, and explicit prompts for incremental progress.

The engineering post on effective harnesses is revealing:

"Out of the box, even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows will fall short of building a production-quality web app if it's only given a high-level prompt."

The solution? An initializer agent that sets up environment scaffolding. Progress files that persist across context windows. Git history as a form of memory. Browser automation tools for verification. In other words: external systems that hold continuity the model can't.

ARC Prize's 2025 analysis identifies the same pattern: the top approaches all use refinement loops—evolutionary search harnesses that generate candidates, verify them against training examples, and iterate. The central theme driving progress, according to ARC Prize, is refinement loops: "iteratively transforming one program into another, where the objective is to incrementally optimize a program towards a goal based on a feedback signal."

This isn't reasoning as traditionally conceived. It's search guided by neural intuition and constrained by symbolic verification.

Part II: The Standard Explanation

Why Self-Verification Fails

In 1985, Peter Naur wrote a paper called "Programming as Theory Building" that's become quietly influential in software engineering circles. His argument: the real output of programming isn't code—it's a theory about how the problem domain maps to the solution. Code is just a lossy representation of that theory.

LLMs are extraordinary at generating plausible code. But plausible isn't the same as correct. And code isn't the same as theory. An LLM generating a solution doesn't understand the problem-domain mapping any more than a photocopier understands the document it's copying. It produces artifacts that look like the output of theory building without having built any theory.

This is why self-verification fails. When Kambhampati's team at Arizona State tested whether LLMs could verify their own plans, they found performance no better than random chance. The models can't distinguish good solutions from bad ones because distinguishing requires exactly the theory they don't have. Having an LLM critique its own output is like asking the photocopier to fact-check the document.

The LLM-Modulo Framework

Kambhampati's "LLM-Modulo" framework, presented at ICML 2024, provides the conceptual vocabulary:

"Auto-regressive LLMs cannot, by themselves, do planning or self-verification (which is after all a form of reasoning)… LLMs should be viewed as universal approximate knowledge sources that have much more meaningful roles to play in planning/reasoning tasks beyond simple front-end/back-end format translators."

The framework pairs LLMs with external model-based verifiers in a "generate-test-critique" loop. The LLM generates candidate plans. External critics—symbolic planners, constraint checkers, domain validators—evaluate them against hard constraints. Feedback flows back to the LLM for refinement.

On travel planning benchmarks, this framework achieved 6x better performance than baseline LLM approaches. On blocks world planning, performance jumped to 82% within 15 feedback rounds with a model-based verifier. The LLM's autonomous planning success rate? About 12%.

The key point: the framework doesn't try to make LLMs into something they're not. It positions them where they're strong—as "approximate knowledge sources" trained on humanity's collective output—while routing verification to systems designed for correctness.

The Manifold Problem

None of this is surprising if you understand how these models are trained.

François Chollet's original critique remains precise: gradient descent works beautifully when the problem space is smooth. Vision, speech, perception—these domains align well with the manifold hypothesis. Small changes in input produce small, learnable changes in output.

But abstract reasoning lives in discrete spaces. These landscapes are riddled with cliffs, dead ends, and sparse rewards. A single wrong inference doesn't give you a gentle gradient—it just fails. There's no partial credit for almost-correct logic.

This explains the ARC-AGI-2 results. The benchmark is specifically designed to resist memorization. Each puzzle requires recognizing a pattern from a few examples and applying it to a new situation. The architectures haven't changed. The test got harder—harder in ways that expose what was always true about these models.

Verification as Theory Building

Here's the standard explanation for why hybrid architectures work: external verification systems supply the theory that generators lack.

When AlphaProof solves an IMO problem, it doesn't just produce an answer—it produces a formal proof in Lean that can be mechanically verified. The Lean proof assistant doesn't care whether the proof was generated by a neural network, a human, or a random search. It checks each logical step against type-theoretic axioms, and either the proof holds or it doesn't.

This is verification as theory building. The verifier enforces the problem-domain constraints that the generator lacks. The formal proof is a kind of theory—a rigorous mapping between premises and conclusions that can be inspected, validated, and trusted.

The pattern generalizes:

A test suite that passes is a kind of theory about what the code should do
A type system that accepts a program is a kind of theory about its structure
A symbolic planner that validates a sequence of actions is a kind of theory about how the world changes

Each of these systems can hold knowledge that the LLM itself cannot.

Part III: The Deeper Question

But AI's Choices Are Often Good

Here's where we need to be honest: the standard explanation is incomplete.

Claude Code's architectural choices are often good. Sometimes better than what a typical human engineer would produce. This isn't surprising: the model has absorbed the distilled wisdom of millions of engineering hours. It's seen every variation of "you asked for X but actually needed Y." It knows patterns, conventions, and tradeoffs at a scale no individual human can match.

So if verification supplies the "theory" that generators lack, and AI already has access to the collective knowledge of engineering, what do humans actually contribute?

The naive framing ("humans understand, AI doesn't") is increasingly untenable.

Deutsch on Knowledge and Theory

David Deutsch offers a framework worth preserving: knowledge is information that tends to cause itself to persist—it has causal power in the world. A good theory isn't merely a description but an explanatory structure that constrains and predicts across contexts. Good theories are "hard to vary"—you can't change them arbitrarily without losing their explanatory power.

By this definition, AI absolutely has knowledge. Its pattern-matching causes code to be generated, systems to be built, problems to be solved. This knowledge persists because it works.

But Deutsch's framework also suggests what's different about human knowledge: it's embedded in a purposeful agent encountering reality with stakes.

What Forms Theory in a Human Mind

If AI has the collective knowledge of engineering, what constitutes "theory" in a human mind that AI lacks?

Theory as Situated Purpose

The human isn't solving "a sorting problem"—they're solving their sorting problem, embedded in a business context (why does this need to be fast?), a temporal context (what's coming next?), a resource context (what can we afford?), and a political context (who needs to approve this?).

The AI has seen millions of sorting problems. But it hasn't lived any of them. It doesn't know what it feels like when the system goes down at 3am. It doesn't carry the scar tissue of the last architectural mistake. It doesn't have a boss, a deadline, a career, a reputation.

Theory, in the fullest sense, isn't just explanatory knowledge—it's knowledge that connects to purposes. The human's theory includes: why am I doing this at all? what am I actually trying to achieve? what would count as success?

Theory as Accountable Commitment

When a human makes an architectural choice, they're making a commitment—one they'll have to live with. This changes how knowledge functions:

The human asks "what could go wrong that I'd have to fix?"
The human considers "can I defend this choice to my team?"
The human weighs "if this fails, what's my fallback?"

The AI generates options. The human commits. The theory in a human mind is shaped by accountability—the knowledge that you'll face the consequences.

Theory as Temporal Narrative

The human understands the system as a story. It came from somewhere, the narrative is going somewhere, and the current moment is a chapter in that narrative:

"We chose this database because of X, but X is becoming less true"
"This module exists because of a constraint that no longer applies"
"The original architects assumed Y, but the business has pivoted"

The AI sees the current state. The human sees the trajectory. Theory in the human mind includes why things are the way they are and where they're likely to go ~ not as patterns, but as lived narrative.

Theory as Consequence-Grounded Taste

Taste isn't arbitrary preference. It's compressed consequence-awareness—the ability to recognize quality without fully articulating the criteria. The senior engineer who says "this feels wrong" is pattern-matching on their own experiences of things going wrong. Their taste is grounded in having been burned.

AI's "taste" (if we call it that) is grounded in the statistical structure of training data. It knows what code looks like when it's good. The human knows what it feels like when code fails. This is observational knowledge versus participatory knowledge.

Theory as the Capacity for Genuine Surprise

Perhaps most fundamentally: the human can be genuinely surprised. They can encounter something that violates their expectations in a way that forces them to revise their understanding. This is how theories grow ~ through the shock of contact with reality.

The AI can generate outputs that are "surprising" in a statistical sense. But it can't be wrong in the way that teaches. When the system fails, the AI doesn't experience the failure. It doesn't carry that failure forward as a scar that shapes future judgment.

The theory in a human mind is alive ~ constantly being tested against reality and revised.

Part IV: The Purpose Gap

The Human Role: Direction, Not Knowledge

This reframes the human contribution entirely between what huamsn provide and what that actually is:

Feedback: Preference signals grounded in purposes AI can't have
Steering: Commitment among options, shaped by accountability
Shaping: Defining what counts as success, not just achieving it
Taste: Compressed consequence-awareness from lived experience
Surprise: Genuine contact with reality that revises understanding

The human role isn't "having knowledge" but being a locus of purpose encountering reality with stakes.

The AI has knowledge without purpose. The human has purpose that requires knowledge. The hybrid architecture works because it combines AI capability with human direction.

Feedback, Steering, and Shaping

Three concepts capture what humans actually do in the hybrid architecture:

Feedback as Substrate: The AI generates within a possibility space, but which possibility space? The human's feedback—explicit and implicit—shapes what counts as success. The test suite is one form of feedback, but so is "that feels off" or "this doesn't match our brand" or "my gut says this will confuse users."

This isn't knowledge the AI lacks. It's preference the AI can't have. The AI generates options; the human decides which options to pursue.

Steering: Even with infinite knowledge, direction requires choice. The AI can present "here are 47 valid architectures for this problem, each with different tradeoffs." The human says "we're optimizing for X, not Y" or "we're betting on this future, not that one."

This isn't about the human knowing more. It's about the human wanting something. The AI is a capability; the human is an intent.

Shaping at the System Level: The most interesting human role: shaping the verification architecture itself. Not just writing tests, but deciding what to test. Not just setting constraints, but deciding which constraints matter.

This is meta-level steering. The AI generates within the constraint space; the human defines the constraint space.

The Specification Problem

Here's where the knowledge/purpose distinction becomes concrete.

A human with purpose can recognize: "These tests pass, but they're testing the wrong thing. The real requirement is X, and these tests don't capture X."

An AI with knowledge but no purpose will optimize for the tests as given. If the tests are wrong, the AI will produce code that passes wrong tests. It has no way to step outside the verification framework and question whether the framework itself is adequate.

This is the specification problem: verification can only tell you that you've met the spec. It cannot tell you whether the spec was right.

The human's theory includes not just "how to solve the problem" but "whether this is the right problem." That's not a knowledge gap—it's a purpose gap.

Why the Hybrid Architecture Works

The hybrid architecture succeeds not because humans know more than AI (increasingly, they don't), but because:

AI has knowledge without stakes. It knows what good architecture looks like. It doesn't care if the architecture fails.
Humans have stakes without complete knowledge. They'll live with the consequences. This shapes what "theory" means—not just explanation, but commitment.
Purpose + knowledge > either alone. The human provides direction and accountability; the AI provides capability and pattern-matching.
"Theory" in human minds is purpose-laden. It's not just "how does sorting work" but "why am I sorting, what do I need, what happens if I'm wrong, and what am I willing to bet."

Naur was right that programming is theory building. But the theory that matters isn't just explanatory knowledge—it's knowledge embedded in a purposeful agent who will face the consequences of being wrong.

The AI can have Naur's theory in the explanatory sense. What it can't have is theory in the committed sense—knowledge that shapes action because the knower has something at stake.

Part V: Practical Implications

For AI Engineering

Build verification into your architecture, not your prompts. Chain-of-thought prompting, self-critique, and reflection don't reliably improve correctness. External verification does. Use test suites, type systems, linters, symbolic planners, and formal provers as your critics.

Treat code generation as the happy path. When an LLM generates Python instead of reasoning in natural language, you gain access to an entire ecosystem of verification: the interpreter catches syntax errors, the type checker catches structural errors, the test suite catches behavioral errors. Code is a forcing function for rigor.

Design for refinement loops. The best-performing systems on hard reasoning tasks aren't single-shot. They're evolutionary—generating many candidates, filtering through verification, iterating. Budget for multiple attempts. Build harnesses that support search.

Invest in verification infrastructure. The systems winning right now—AlphaProof with Lean, coding agents with test suites, ARC solutions with program synthesis—all rely on massive investment in verification. Your verifiers are as important as your generators. Maybe more.

For Teams and Hiring

The shift is from knowledge-work to direction-work. AI increasingly handles the "how." Humans provide the "what" and "why."

Roles that AI amplifies (hire fewer, equip better):

Implementation-focused developers
Manual testers
Documentation writers

Roles that become more valuable (hire more, pay more):

Specification engineers who translate intent into verifiable constraints
Purpose-holders who decide what to build and why
Consequence-aware reviewers with scar tissue from past failures
Specification critics who recognize when constraints themselves are wrong

Development priorities for existing teams:

Train people to write precise tests and constraints—these become the interface between human intent and AI capability
Cultivate consequence-awareness by exposing people to downstream effects
Protect purpose-holding capacity—don't let AI handle decisions where "what to build" matters more than "how"

The Road Ahead

We're at an inflection point.

On one side: pure scaling. The hope that with enough parameters, enough data, enough compute, reasoning will emerge. This path has produced extraordinary systems—but the ARC-AGI-2 results suggest fundamental limits remain.

On the other side: hybrid architectures. The recognition that different cognitive tasks may require different mechanisms, and that the right composition of neural generators with symbolic verifiers can achieve what neither achieves alone.

The evidence increasingly favors the hybrid path. AlphaProof's Lean proofs. AlphaGeometry's symbolic deduction engine. Code agents with test suites and type checkers. Refinement loops that search over solution spaces. Claude Code's single-threaded master loop with hooks, sandboxing, and verification gates.

From an engineering perspective, this is good news. It means we don't have to wait for AGI to build useful systems. We can build them now, by composing the pieces we have: LLMs for generation, external systems for verification, human direction for purpose.

The gap between neural intuition and logical rigor isn't a bug to be fixed. It's a design constraint to be respected.

And the gap between AI capability and human purpose isn't a problem to be solved—it's the foundation on which hybrid systems are built.

Appendix: Implementation Details

MCP: The Universal Verification Interface

The Model Context Protocol (MCP), introduced by Anthropic in November 2024 and donated to the Linux Foundation's Agentic AI Foundation in December 2025, provides the standardized protocol for connecting neural models to external verification systems.

Built on JSON-RPC 2.0, MCP defines three core capabilities:

Tools: Executable functions that perform actions (run tests, execute commands, deploy code)
Resources: Access to data and information (read files, query databases)
Prompts: Templated workflows that guide AI behavior

MCP creates a universal interface where neural models can invoke symbolic verifiers. The 2026 updates added enterprise-grade security: MCP servers now function as OAuth 2.1 Resource Servers with PKCE authentication.

Controlled Parallelism via Subagents

Claude Code implements controlled parallelism through subagents rather than uncontrolled agent swarms.

Subagents are invoked via the Task tool with a description, prompt, and subagent type. Each subagent has independent configuration and maintains separate context from the main agent—preventing information overload. Critical constraint: subagents cannot spawn their own subagents (depth limited to prevent recursive proliferation).

Built-in subagent types include Explore (for codebase discovery), Plan (for task decomposition), and General-purpose (for delegated work items).

Multi-Layer Verification Infrastructure

Claude Code integrates multiple layers of symbolic verification:

Automated Quality Gates via Hooks: Hooks are shell commands that trigger automatically at specific events—run type checks after editing TypeScript, validate security rules before accessing auth files, auto-format after writing new files.

Sandboxed Execution: Claude Code uses OS-level sandboxing (Linux bubblewrap, macOS seatbelt) providing filesystem and network isolation. In Anthropic's internal usage, sandboxing safely reduces permission prompts by 84%.

Test-Driven Development Integration: Claude enters autonomous loops—writing code, running tests, analyzing failures, adjusting, repeating until all tests pass. The best verification criteria are binary: "All tests pass." "Build succeeds." "Zero linter errors."

Browser-Based UI Verification: Boris Cherny (Claude Code creator) revealed: "Claude tests every single change I land to claude.ai/code using the Claude Chrome extension. It opens a browser, tests the UI, and iterates until the code works and the UX feels good."

References

Anthropic. "Claude Sonnet 4.5." Anthropic News, 2025.
Anthropic. "Claude Opus 4.5." Anthropic News, 2025.
Anthropic. "Effective Harnesses for Long-Running Agents." Anthropic Engineering, 2025.
Anthropic. "Claude Code Best Practices." Anthropic Engineering, 2025.
Anthropic. "Making Claude Code More Secure and Autonomous." Anthropic Engineering, 2025.
Anthropic. "Building Agents with the Claude Agent SDK." Anthropic Engineering, 2025.
ARC Prize. "ARC-AGI-2 Benchmark." ARC Prize, 2025.
ARC Prize. "ARC Prize 2025 Results and Analysis." ARC Prize, 2025.
Deutsch, David. "The Beginning of Infinity." Penguin, 2011.
Harmonic AI. "Aristotle Achieves Gold Medal-Level Performance at the International Mathematical Olympiad." Harmonic AI, 2025.
Kambhampati, S. et al. "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks." ICML 2024.
Lean FRO. "Lean Programming Language." Lean FRO, 2025.
Model Context Protocol. "MCP Specification." Anthropic, 2025.
Naur, Peter. "Programming as Theory Building." Microprocessing and Microprogramming, 1985.
Schrittwieser, J. et al. "AlphaProof." Nature, November 2025.
TIME. "AI Is Moving Past Chatbots." TIME, January 2026.
VentureBeat. "The Creator of Claude Code Just Revealed His Workflow." VentureBeat, 2025.
ZenML. "Claude Code Agent Architecture: Single-Threaded Master Loop." ZenML, 2025.

Subscribe to AI Engineering

to get updates in Reader, RSS, or via Bluesky Feed

The Hollow-Middle: Values Flees the Friction-Free Zone

From For-Loops to Colonies: The Architecture of Scaled AI Coding