From For-Loops to Colonies: The Architecture of Scaled AI Coding

Why current AI coding tools, even the hottest ones like Ralph Loops and massive agent swarms will soon hit a wall. And a biological pattern that could take us beyond human-scale software engineering.

January 17, 2026

In April 2025, thirty practitioners gathered in Melbourne to wrestle with a question that had quietly become urgent: not “Will AI change how we build software?” but “How fast can we redesign the entire socio-technical system so humans, machines, and metrics co-evolve responsibly?”

Nine months later, we have insights ~ not from theory, but from practice. Teams have run hundreds of agents for weeks. Single agents have built entire programming languages overnight. I’ve spent many months using coding agents to build full systems, watching patterns emerge from daily use. The approaches that work have become clear. So have the failure modes.

This piece synthesises learnings from five distinct experiments in agentic coding, plus emerging research on what AI augmentation does to human cognition:

Browser Use’s minimal agents — why 99% of the work is in the model
The Ralph Loop — single-agent persistence until verifiable completion
Cursor’s multi-agent scaling — hundreds of agents, millions of lines, weeks of runtime
The Colony pattern — a bio-inspired direction for coordination without centralization
The human layer — cognitive load, tacit knowledge, and when to stay in the loop

Together, they reveal a layered architecture that might actually scale and a caution about what we lose when AI removes the “noise” from our work (yes, this is not a typo; I will get to that towards the end).

The Agent Is Just a For-Loop

Gregor Zunic of Browser Use makes the sharpest case for minimalism:

An agent is just a for-loop of messages. The only state an agent should have is: keep going until the model stops calling tools. You don’t need an agent framework.

His evidence: ‘Browser Use’ threw away thousands of lines of abstractions. Their first agents had complex message managers, planning modules, verification layers. They worked … until the team tried to change anything. Every experiment fought the framework.

The core insight:

“Agent frameworks fail not because models are weak, but because their action spaces are incomplete.”

Instead of defining every possible action up front, start from the opposite assumption: the model can do almost anything. Then restrict. Browser Use now gives the model raw Chrome DevTools Protocol access plus browser extension APIs. When one approach fails, the model routes around it.

Every abstraction encodes assumptions about how intelligence should work. RL breaks those assumptions. The models were trained on millions of examples; these models have seen more patterns than you & I can anticipate. Your “smart” wrappers can very easily become the constraints.

One critical addition from their work: ephemeral messages. Browser state is massive spanning DOM snapshots, screenshots, element indices easily hit 50KB+ per request. Without ephemeral handling, context explodes. Mark certain tool outputs to keep only the last N instances. The model only needs recent state; old snapshots are noise. This is a pattern that different people have discovered the need for. Personally, I think that this is an issue for now & as context management, agent memory techniques become robust we will not have this issue as such as a task for humans to handle (there was a time when we engineers used to load software components in/out of RAM as the entire program would not fit).

The Ralph Loop: Persistence Until Done

The naive for-loop has a problem: agents prematurely finish. They return no tool calls when they’re stuck, not when they’re done.

Enter Ralph Loop, a technique that emerged from Geoff Huntley’s experiments and went viral in late 2025:


while :; do cat PROMPT.md | claude-code ; done

The core idea: continuously input the same prompt, allowing the AI to see its previous work in the file system and Git history. This isn’t “output feedback as input”, rather it is self-referential iteration through external state.

Why does this matter? Because the self-assessment mechanism of LLMs is unreliable. They exit when they think they’re done, not when they meet objectively verifiable standards.

The Ralph Loop fixes this with three elements: clear completion criteria (define machine-verifiable success), Stop Hook interception (forcibly continue if standards aren’t met), and max-iterations safety valve (prevent infinite loops).

The `done()` tool forces explicit completion instead of implicit “I guess we’re done?” Claude Code does this. Gemini CLI does this. Now you know why.

Each iteration is essentially a fresh session. The agent doesn’t read from bloated history; it scans the current project structure and log files directly. State management shifts from the LLM’s memory (token sequence) to disk (file system).

The key files: progress.txt (appended log of attempts, pitfalls, confirmed patterns), prd.json (structured task list with passes: true/false status), and git-history (each successful step commits, giving the next iteration clear change differentials).

This is why Ralph can support continuous development for hours or days. Git history becomes cumulative memory.

The results are striking: a complete new programming language built AFK overnight, a $50K USD contract delivered for $297 in compute costs, multi-week refactoring projects running unattended.

Geoff’s framing captures the philosophy:

“Building software with Ralph requires a great deal of faith and a belief in eventual consistency. Each time Ralph does something bad, Ralph gets tuned like a guitar.”

I was at the April 2025 workshop with Geoff where we discussed these feedback loop ideas. What’s emerged since validates the core intuition: persistence beats perfection when you have verifiable completion criteria.

When Single Agents just are not Enough

Ralph Loops are brilliant for single-agent persistence. But what happens when you need twenty agents on the same codebase? We need agents running in parallel to make larger systems.

Cursor’s research team ran this specific idea as an experiment at scale: hundreds of concurrent agents, weeks of runtime, millions of lines of code.

They built,

a web browser from scratch (1M+ LoC; kinda-sorta renders a webpage showing promise),
migrated Solid to React in their own codebase (+266K/-193K lines over 3 weeks), and
have agents still running on a Windows 7 emulator (1.2M LoC, 14.6K commits; no demos to play with as yet!).

Their first attempts failed instructively.

Initial approach was agents that self-coordinate through a shared file with locking. Twenty agents slowed to the throughput of two or three ~ guess what? most time was spent waiting. Agents held locks too long or forgot to release them. The system was brittle (Steve Yegge & others are working on ideas to resolve this specific problem; they are worth keeping an eye on if you are interested in this area).

Cursor's team tried optimistic concurrency control. Simpler, more robust, but that revealed a deeper problem:

“With no hierarchy, agents became risk-averse. They avoided difficult tasks and made small, safe changes instead. No agent took responsibility for hard problems.”

The solution that worked: Planners continuously explore the codebase and create tasks (they can spawn sub-planners, making planning recursive). Workers pick up tasks and grind until done—they don’t coordinate with other workers. Judges determine at cycle end whether to continue.

This is some structure. But notably, they removed more than they added:

“We initially built an integrator role for quality control & conflict resolution, but found it created more bottlenecks than it solved. Workers were already capable of handling conflicts themselves.”

Model choice matters for long-running tasks. Different models excel at different roles. Prompts matter more than architecture. The right amount of structure is somewhere in the middle: too little creates conflicts and drift, too much creates fragility. Even well-architected systems need periodic fresh starts to combat drift.

The Colony Pattern: A Direction for Decentralised Coordination

Cursor’s planner/worker hierarchy works. But it still has integration points that can become bottlenecks at scale. And when humans need to merge the output, they face cognitive overload.

Consider what happens with twenty agents generating diffs at machine speed: merge conflicts explode across hundreds of files, semantic mismatches emerge that no diff tool catches, one agent optimises for performance while another optimises for readability ~ both “correct”, but incompatible.

Ralph Loops are linear persistence. Swarms are parallel power with centralised merges. We’re building Ferraris and asking a pedestrian to direct traffic moving at 300 Kmph (with no traffic lights).

Consider how ant colonies work. No ant understands the full nest design. Each follows simple local rules. Yet the colony builds bridges, optimises paths, adapts to disasters—without a central planner. Intelligence emerges from organisation, not centralisation.

This is still a work-in-progress ~ hence what is outlined below is more of a direction-setting postulate.

After many many months of hands-on work with coding agents, building full end-to-end systems, I believe this is where we need to head. The patterns that keep failing are centralised: integrators that bottleneck, coordination files that become contention points, humans who become the merge resolution layer for machine-speed output.

The patterns that keep working are distributed: workers handling their own conflicts, planners that decompose recursively, stress that gets resolved locally before propagating.

What would a colony architecture look like? There are some brilliant ideas we can directly borrow from biology.

Agents signal “stress levels” when their goals are threatened. Test agent spikes stress when a proposed change breaks tests ~ before commit (via hooks). Coherence agent flags rising stress when agents drift apart semantically. Signals are local (no central bottleneck), ephemeral (old signals decay), and continuous (always flowing, not checkpoint-based).

Don’t just handle errors, actively defend desired states: codebase compiles, tests pass, invariants hold. When a contribution threatens that, try recovery locally before escalating. This addresses Cursor’s finding that integrators bottleneck. Defense would be distributed.

Each agent has full affordances but only cares about its domain. The Test Defender can do anything but attends to test integrity. Hierarchy emerges from scope, not authority.

LLM communication is lossy. Force agents to state understanding + confidence + ambiguities. This catches the “confident but wrong” divergences that require periodic fresh starts.

The obvious objection: isn’t this more hand-crafted structure that better models will obsolete?

The answer depends on what kind of structure we’re building.

Intelligence structures? (how to think, what to decide) ~ yes, models will obsolete these (obliterate if you prefer a more intense term). This is already in the realm of the hyper-scalers anyway & most of us will never build them.

Infrastructure structures? (how to communicate, when to signal) ~ no, these are protocols, not intelligence & most certainly an area where all of us can contribute to without needing a million GPUs.

TCP/IP doesn’t encode knowledge about what to communicate, it provides protocols for communication to happen. The colony architecture would be infrastructure, not intelligence. Stress signals wouldn’t tell agents what to do. They will provide a channel for coordination to emerge.

Here’s what we often forget: the question isn’t just “can agents work together?” It’s “what happens to the humans in the loop?”

The April 2025 Melbourne unconference surfaced this tension sharply. Every time we outsource a cognitive micro-struggle to a model, we withdraw a tiny deposit from the bank of hard-won tacit knowledge. Accumulate enough withdrawals and the organization becomes fragile.

Ralph Loop practitioners have arrived at a simple framework:

HITL (Human-in-the-Loop ~ Run once, observe, intervene): Use for learning, prompt optimisation/refinement, high-risk tasks. While accepting that humans are going to now be the bottle-neck.
AFK (Away From Keyboard ~ Run in loop, set max iterations); Use for batch work, mechanical tasks, low-risk execution

Start HITL. Develop sufficient understanding of how the system works, optimise prompts, build confidence. HITL is pair programming ~ you work with the AI, reviewing during code creation. Graduate to AFK (away from keyboard) once prompts are stable and the foundation is validated. Set max iterations. Create notification mechanisms. Review commits when you return.

There is a massive risk of pure AFK (away from keyboard). If this is all we every do, we do not develop an intuition for the underlying systems internals ~ just a high-level behaviour. Traditionally, we developed engineers through exposure to debugging loops with stack traces, gaining resilience from seg-faults, re-architecting multiple layers to create new cross-cutting features. If this is all done by a machine ~ the theory of the system in the mind of the engineer is very different from actual code. Now, we do not know if this is still needed. It feels like it is needed ~ but that is my gut feel & that was shaped prior to AI enabled coding -- it may very well be inaccurate for the world now. In time, we will get evidence showing the limits at which humans have to operate with these coding agents.

My research on human-AI augmentation has surfaced something counter-intuitive. When we rolled out LitQuest, a tool to assist with literature review classification, we found it added to cognitive load (subjective experience) while improving productivity (objective speed). The researchers felt more exhausted even as they worked faster. This is not the promised land.

The working hypothesis: in the old workflow, researchers used System 1 (fast, automatic thinking) most of the time to arrive at obvious “No” classifications, occasionally switching to System 2 (slow, effortful thinking) for ambiguous cases. The AI tool, by filtering out easy cases, forced them into System 2 much faster and kept them there.

Drawing from Kahneman’s work on “noise” (if you made it this far, thank you. I hinted at this word in the opening), traditional workflows contain significant random variability ~ as in they are noisy (a term coined by someone that got a nobel prize; so I am not going to debate it .. though personally I don’t quite like it).

Back to the story.

While this seems like a flaw to eliminate, in cognitive workflows this “noise” often manifests as easy decisions ~ cognitive rest intervals. By successfully eliminating the noise, AI inadvertently eliminates the recovery periods.

The user is subjected to a “purified” stream of difficult decisions, keeping them in chronic System 2 activation. The overload comes not because the volume of information increased, but because the average entropy per decision was artificially maximised. The workflow lacks cognitive padding.

Human cognition operates as a scale-free, fractal system. At the neural level, neurons require refractory periods to reset. At the individual level, we oscillate between automatic execution and methodical analysis. At the team level, effective groups oscillate between heads-down work and social context-building.

When AI systems maximise information density, they force humans into a monostable state. The natural fractal topology of work gets flattened. The workflow becomes “smooth” (consistent high difficulty) rather than “rough” (oscillating difficulty).

The result is epistemic brittleness. Without low-entropy valleys in the workload, humans can’t recover from fatigue accumulation. They decouple from the loop ~ not from lack of skill, but from collapse of the cognitive duty cycle (more on that in a different article).

This has direct implications for how we design human-agent systems:

HITL isn’t just for oversight; it’s for cognitive sustainability. The human-in-the-loop pattern serves two functions: catching agent errors, and maintaining the human’s engagement and judgment capability. Remove the second and you get “looks good to me” rubber-stamping (very very quickly).
Intentional friction is necessary. Optimal human-AI interaction requires the AI to re-introduce low-stakes interactions ~ not as error, but as pacing to match the biological operator’s requirements. This might mean showing some easy classifications even when the AI could handle them, or interleaving quick wins with complex decisions.
The Ralph Loop’s HITL/AFK spectrum matters cognitively. Starting in Human-in-the-Loop mode isn’t just about learning prompts ~ it’s about maintaining the cognitive oscillation that keeps humans effective. Graduate to AFK only for mechanical work where human judgment isn’t the bottleneck.

If five PRs an hour are shallow reformats, speed metrics are lying. AI can inflate any vanity metric. As organisations scale agent usage, they’ll inadvertently create metrics around tasks allocated, tokens consumed, or agent utilisation rates. Leaderboards will emerge from IDE logs or Claude Code/Codex/Gemini-CLI/OpenCode usage logs ~ now nicely curated and watched near continuously by a vibe coded over-seerer.

The cognitive load research points to what actually matters: metrics that track whether the human-AI system is producing durable value while remaining cognitively sustainable.

Verification depth is more important than verification speed. How thoroughly did the human actually engage with the output? Surface-level approval after cognitive exhaustion isn’t verification ~ it’s liability (without an insurance policy, which are yet to develop)

Recovery integration is needed & should be tracked. Are there natural valleys in the cognitive workload? Systems that eliminate all easy work maximise short-term throughput while destroying long-term human effectiveness.

Tacit knowledge accumulation remains valuable. Every time we outsource a cognitive micro-struggle to a model, we withdraw from the bank of hard-won tacit knowledge. Are junior engineers still building intuition, or are they becoming prompt-writers who can’t debug without AI assistance? What exact intuition are they building if they use coding agents all the day?

Epistemic resilience is still needed. When assumptions need revision, how costly is the change? Systems with high AI dependency and low human understanding become brittle to distributional shift as the AI’s frozen training distribution diverges from evolving reality, and humans lack the deep knowledge to compensate.

These metrics share a common thread: they focus on human cognitive sustainability and the preservation of human capability, not just system throughput. Every unit of this human capability has to be within the framework that permits and generates human flourishing ~ as in, no point in having a highly capable, value generating human that is miserable.

What emerges from synthesising all these inputs is a layered architecture where each layer addresses a different problem:

At the agent level, Browser Use is showing a viable direction: minimal internals, maximal action space. A for-loop of tool calls. Let the model do the work. Don’t encode assumptions about how intelligence should operate.
For single-agent persistence, the Ralph Loop pattern solves premature termination. Keep going until verifiably done. Use the done() tool for explicit completion. Maintain state through files and Git, not token sequences. Ephemeral messages prevent context explosion.
For multi-agent coordination, Cursor’s planner/worker hierarchy provides necessary structure without over-engineering. Planners decompose; workers execute without coordinating with each other. Remove integrator roles as they bottleneck (ironially aligned to what management research shows happens when you have too many managers). Different models for different roles. Accept that periodic fresh starts are necessary.
For scale beyond human mediation, the colony direction suggests stress signaling and homeostatic defense instead of centralised coordination. This remains a postulate ~ the pattern that might work when current approaches hit their limits. The principle: coordination should emerge from local interactions, not flow through central bottlenecks.
For the human layer, recognise the augmentation paradox. HITL serves cognitive sustainability, not just oversight. Intentional friction preserves the duty cycle. Metrics should track verification depth, tacit knowledge accumulation, and epistemic resilience ~ not just throughput.

What Remains Unsolved

Cursor is honest:

“Multi-agent coordination remains a hard problem. Our current system works, but we’re nowhere near optimal. Planners should wake up when their tasks complete. Agents occasionally run for far too long. We still need periodic fresh starts to combat drift.”

Open questions for the colony pattern: What’s the right stress signal vocabulary? How do you bootstrap effective coordination norms quickly? When does the colony genuinely need human input versus when can it self-recover?

And the deeper question from/for the cognitive load research: Can we design AI systems that enhance human capability rather than replacing human struggle with cognitive exhaustion? The goal isn’t just faster output; rather it’s sustainable augmentation that preserves the tacit knowledge and judgment that make humans valuable.

The Synthesis

In Apr 2025 (9 months ago), we debated whether AI would change software engineering. Now we’re learning how to redesign the socio-technical system.

The architecture is layered: minimal agent internals with maximal affordances; Ralph Loop persistence until verifiable completion; planner/worker hierarchy for multi-agent structure; colony-style coordination as the direction for true scale; and careful attention to human cognitive sustainability throughout.

The Bitter Lesson applies to intelligence, not infrastructure. We’re not encoding how to think ~ we’re building protocols for coordination to emerge.

But there’s a deeper lesson from cognitive load research: AI augmentation isn’t just a technical problem. When we eliminate the “noise” from human workflows, we eliminate cognitive rest. When we maximise information density, we collapse the fractal structure that makes human cognition sustainable. The path forward requires designing systems where humans and machines co-evolve—not systems where machines do everything and humans rubber-stamp the output.

The tools are increasingly open. Browser Use is releasing their agent-sdk. Ralph Loop patterns are documented across frameworks. Cursor’s learnings are public. What matters now is running experiments and paying attention to what happens to the humans in the loop, not just the code that gets produced.

Sources:

Gregor Zunic, Browser Use, “An Agent is Just a For-Loop” 16 Jan 2026

Geoff Huntley, Ralph Wiggum as a Software Engineer, July 2025

DanKun, Alibaba, From ReAct to Ralph Loop, 15 Jan 2026

Wilson Lin, Cursor, Scaling Long-Running Autonomous Coding, Jan 2026

Rajesh Vasa, Melbourne Unconference reflections, April 2025)

If you’re building agentic tools, wrestling with scaling, thinking about sustainable human-AI collaboration, or using AI into real-world systems ~ let’s connect.