← Blog
aiengineeringresearchagentic-aidark-factory

The 43-Point Gap

In January 2025, METR ran a randomised controlled trial on experienced open-source developers. Sixteen engineers. Two hundred and forty-six issues. Proper randomisation — the kind of study design that's still rare in this space.

The result: developers using AI tools were 19% slower than those working without them.

That alone would be worth paying attention to. But here's the part that stuck with me — the same developers believed they were 20% faster. A 43-point gap between perception and reality. They felt like they were flying. They were actually dragging.

That gap explains more about the current state of AI-assisted development than any benchmark or product launch.

The hard numbers

The METR study isn't an outlier. It's one data point in a convergence of independent research that all says the same thing.

A longitudinal study from CMU tracked 807 repositories over 19 months. AI-assisted developers saw an initial productivity spike — then returned to baseline by month three. The novelty wore off. The habits didn't change. A separate CMU study using Cursor found 30% more static analysis warnings and 40% higher code complexity in AI-assisted output. Speed up front, debt downstream.

Google's 2025 DORA report found that increased AI coding adoption correlated with 9% more bugs, 91% longer code review times, and 154% larger pull requests. More output, more problems.

CodeRabbit analysed 470 repositories and found AI-generated code had 1.7x more issues and up to 2.74x more security vulnerabilities than human-written code.

GitClear examined 211 million lines of code and found copy-paste patterns rising from 8.3% to 12.3%, while refactoring dropped from 25% to less than 10%. More duplication, less structural improvement.

And then there's Mo Bitar, who documented two years of "vibe coding" and concluded the ROI was negative — hand-writing the code would have been faster. That post earned 865 points on Hacker News. Not because it was contrarian. Because it matched what a lot of people were quietly discovering.

None of these studies exist in isolation. They're independent teams, different methodologies, different sample sizes — all arriving at the same conclusion from different angles. That convergence is what makes the signal hard to dismiss.

This doesn't mean AI is useless. It means the way most people use it is counterproductive.

I built the experiment

Before I'd read most of these studies, I was running my own version.

I built Dark Factory — a system where AI agents autonomously design, build, test, deploy, and maintain software. Ten thousand lines of TypeScript. Temporal workflows for durable execution. Podman containers for sandboxed agent runs. A train/test split — borrowed from machine learning — ensuring the agent that writes code can never see the scenarios that evaluate it.

It wasn't a product. It was a deliberate stress test of how far autonomous agents could go when you give them structure, isolation, and a quality gate they can't game.

I wrote about the early lessons in Is This the Product, or Is This the Machine? — the addictive quality of watching agents work, the 20-to-40-agent parallelism that collapsed under its own coordination overhead, the painful multi-day cleanup when generated code drifted from the architecture.

Dark Factory taught me things the research later confirmed. And the process of building it — not the system itself — turned out to be the real education.

The audit

I ran a 40-agent architecture review — each agent an independent expert examining a different dimension of the system. They found 28 blockers. Date.now() calls violating Temporal's determinism model. Path traversal vulnerabilities in unvalidated runId parameters. An iteration parameter that was silently passing scenario count instead of iteration number. Race conditions in the human approval signal across iteration boundaries.

You can't just point agents at code and hope. Structure matters. Verification matters. And the agents will confidently produce code that looks correct while carrying subtle, compounding bugs.

The real bugs

The bugs that taught me the most were the ones that only surface when you actually build.

A bucket versus conclusion field mismatch in the CI polling activity — the agent read GitHub's API response using the wrong field name, and the type system didn't catch it because both fields existed. The code worked in tests. It broke in production.

macOS resolves /tmp through a symlink to /private/tmp. Podman mounts don't follow that symlink. The container sees an empty directory. Nothing in the logs suggests why. You only discover this by watching a container silently fail to read its task file.

The CLAUDECODE environment variable, when set, prevents nested Claude CLI sessions from launching. My agent containers inherited it from the host. Every agent run that tried to spawn a sub-agent silently failed. The fix was one line. Finding it took hours.

These aren't failures. They're the kind of edge cases that teach you what abstraction layers actually hide.

The measurement

After five days of Phase 1 implementation — real git operations, real container orchestration, real HTTP scenario evaluation — Dark Factory produced 420 lines of working code at a 29% success rate.

Most people would call that a failure. I'd call it a measurement. Because here's the thing the METR study made visceral: most teams never measure at all. They feel productive. They believe the agents are helping. They never check.

Twenty-five bugs across four sessions. None of them showed up in planning. All of them showed up in practice. The agents confidently wrote code that looked correct at every stage — and each fix revealed another assumption that hadn't survived contact with reality.

I checked. And the numbers told me something I wouldn't have accepted from a blog post alone: for a solo developer, building orchestration infrastructure around AI agents is negative-value work. The orchestration already exists in the tools. My job was to structure the inputs well — not to rebuild the pipeline.

Meanwhile, Indragie Karunaratne shipped a 20,000-line application by giving Claude Code detailed specifications and tight feedback loops. No orchestration infrastructure at all. That isn't "he did it better." It's confirmation of the same principle Dark Factory was converging on — structured inputs beat orchestration infrastructure. The experiment predicted it. The synthesis data confirmed it. The Indragie comparison proved you didn't need 10K lines of TypeScript to get there.

What actually works

Here's what's interesting. Across thirty-plus sources — academic papers, practitioner reports, company retrospectives, open-source tooling — the approaches that produce real results independently converge on the same principles.

Structure inputs, don't build frameworks

Every successful practitioner I've studied structures their inputs rather than building infrastructure around the agent.

Jake Van Clief's Interpreted Context Methodology uses file trees as architecture — 2 to 8K tokens per stage instead of 30 to 50K monolithic dumps. Anthropic's own research concludes with "start by using LLM APIs directly" and identifies six composable patterns, explicitly warning against unnecessary frameworks. The FIC framework (Research → Plan → Implement) maintains 40 to 60% context utilisation — never exhausting the window.

The pattern is consistent: curate aggressively. Every token of irrelevant context is a token of diluted attention. The agent doesn't need to know everything. It needs to know the right things.

Review plans, not code

The FIC framework articulates a leverage hierarchy that matches what I've observed:

  • Bad code produces one bad line.
  • A bad plan produces hundreds of bad lines.
  • Bad research produces thousands of bad lines.

Therefore: concentrate human effort on reviewing research and plans. Let the agent write code from a good plan. Code review is low-leverage. Plan review is high-leverage.

This maps to Indragie's detailed specs before Claude writes anything. It maps to Addy Osmani's "70% effort on problem definition, 30% execution." It maps to what Dark Factory's train/test split was trying to enforce structurally — the quality of the input determines the quality of the output.

The instinct is to review the code. The data says review the plan. The difference in leverage is orders of magnitude.

Less context beats more context

This one was counterintuitive until the evidence became overwhelming.

A study from ETH Zurich found that repository-level context files — the .cursorrules, AGENTS.md, comprehensive guides that everyone recommends — actually reduce task success rates while increasing inference costs by over 20%.

Vercel compressed 40KB of documentation to an 8KB index and went from inconsistent results to a 100% pass rate.

A SWE-Skills-Bench evaluation from March 2026 found that 80% of agent skills (39 out of 49) produce zero improvement. The average skill gain was 1.2%. Three skills actively degraded performance by up to 10%. Token overhead increased up to 451% — for nothing.

Curation beats quantity. For context, for tools, for skills, for everything.

Domain-specific beats general-purpose

The systems that actually work aren't general-purpose coding agents. They're tightly scoped to a domain. AlphaEvolve discovers algorithms. TradingAgents makes financial decisions through multi-agent debate. OpenMAIC builds interactive classrooms.

General-purpose coding agents — including Dark Factory — produce fragile results. Mo Bitar called it "local perfection, systemic incoherence." Each agent produces something that looks right in isolation. The system as a whole drifts. The more you scope the problem, the better the output. This isn't a temporary limitation. It's a reflection of how well you can define success for a given domain.

Where I am with it

My workflow now looks nothing like it did three months ago.

Research → Plan → Implement → Review. That's the loop. Human effort is concentrated on research and planning — the highest-leverage review points. The agent writes code from a plan that's already been validated.

I use 2 to 8K tokens of context per task, not 30 to 50K. I maintain explicit load and don't-load tables — telling the agent what to ignore is as important as telling it what to read. I compact aggressively. I treat context utilisation like a budget, not a buffer.

Human checkpoints go on research and plans — never on generated code. If the plan is right and the tests pass, the code is almost certainly fine. If the plan is wrong, no amount of code review will save you.

The agent is the orchestrator. My job is to give it the right inputs.

I'm honest about the uncertainty. Will model improvements make this workflow obsolete? Maybe. Amodei talks about two to three years to human-level coding. If that happens, the careful context engineering we're all learning may become unnecessary. But "maybe in two years" is not a strategy. It's a hope. And the research on multi-agent systems suggests that as single models improve, the benefits of complex orchestration diminish rather than compound.

We operate in the present. And in the present, the data is unambiguous.

The gap

Forty-three points. That's the distance between what those developers felt was happening and what was actually happening.

I've sat on both sides of that gap. I've watched agents produce output that felt like progress while the system underneath accumulated complexity I'd have to unwind later. I've felt the pull of "it's working" when what I really meant was "it's producing output." Those are not the same thing.

And if you think that gap only applies to the sixteen people in a research study — that your setup is different, your prompts are better, your workflow is more disciplined — consider that every single developer in that study believed the same thing.

The gap doesn't close by feeling more confident. It closes by measuring.

*Sources: METR RCT — AI developer speed study · Google DORA Report 2025 · CMU longitudinal study (arXiv 2511.04427) · Anthropic — Building Effective Agents · GitClear — AI Code Quality 2025 · CodeRabbit — AI vs Human Code Generation Report · ETH Zurich — Evaluating AGENTS.md (arXiv 2602.11988) · *SWE-Skills-Bench (arXiv 2602.12670)