Demos are seductive. You watch an agentic AI system take a prompt, break it into subtasks, write the code, run the tests, and ship. It looks like the future arriving on schedule. Then you try to put it in a loop that runs overnight without supervision.
That's where the cliff is.
The 90% reliability problem
The thing about 90% reliability is that it feels like 100% when you're watching. You run the agent ten times, it works ten times, and you start to trust it. But 90% over ten runs is ten failures waiting in a hundred. Five agents chained at 90% each gives you 59% system reliability. The maths isn't the problem. The problem is that 90% feels like it's almost there.
Gartner predicts that over 40% of agentic AI projects will be cancelled by the end of 2027. METR's research found that experienced developers felt 20% faster using AI tools while actually being 19% slower. Confidence and capability diverged — and the developers themselves didn't know.
In a pipeline of agents, this divergence compounds. A wrong assumption in step two becomes a confident incorrect implementation in step five. And because each agent is fluent and coherent, the output looks right at every stage. The cascading error is invisible until something breaks downstream.
The confidence problem in multi-agent systems
I had 20 to 40 agents working in parallel at times: planners, coders, testers, reviewers, each with their own context and configuration.
The same over-engineering instinct I wrote about in Clever Is Not a Feature got amplified by the agents. We were splitting code generation into 80% deterministic output using Nx generators and 20% probabilistic output through Claude for the business logic. Reasonable in theory. But when I started researching compiler theory patterns to make the deterministic layer more robust, the agents were right there with me — pulling papers, suggesting approaches, producing implementations. Every rabbit hole felt productive because it had coherent output.
Then coordination became the actual problem. Twenty agents with optimistic concurrency slowed to the throughput of two or three. So we started building a heartbeat and orchestration layer — session hooks, agent health checks, structured handoffs. We were building infrastructure to manage agents.
The wake-up call was a painful two-to-three-day cleanup. Documentation had drifted from the codebase. Generated code had accumulated patterns that no single agent had chosen but that emerged from forty agents making locally reasonable decisions.
A paper from February 2026 studying whether repository-level context files improve agent performance found something that matched what we'd already seen: context files tend to reduce task success rates and increase inference costs by over 20%. The recommendation was to describe only minimal requirements. We stripped out the comprehensive guides, the detailed patterns docs, the exhaustive rules. Things started improving substantially.
Three agentic AI projects, three lessons
I'm running three projects at different levels of agent autonomy right now, and the comparison is the clearest signal I have.
Project one is a replica of Twenty from scratch — maximum orchestration, seven agent types, eleven hooks, six skills, Nx generators. Maximum drift. The cleanup was brutal and the lesson was expensive.
Project two is an Android TV launcher NebulaTV — single module, 7,500 lines of Kotlin. Minimal agent context. I steer based on viewable output: does the launcher look right? Does navigation work? Beyond the initial spec and foundation work, it runs mostly on agents. It works.
Project three has no Claude initialisation at all. No CLAUDE.md, no agent configuration, no skills. It's operating at what I'd call level-two agentic development — the agent follows instructions, writes code, and I review. It's the most reliable of the three.
The pattern is uncomfortable: the project with the most agent infrastructure had the most problems. The project with the least had the fewest. Anthropic's research on long-running agents arrived at a similar conclusion — the key to reliability is structured state management and incremental progress, not comprehensive context. Less infrastructure, more verification. Less context, more focus.
Where humans belong in agentic pipelines
In The Mirror Problem, I wrote about keeping humans in the loop "at the points that matter, not just at the points that are convenient." I now have a more honest version of that.
Here's something I haven't seen in the discourse about agentic development: it's addictive. Not metaphorically. I was running 20 to 40 agents, watching results stream in. Ideas generating ideas generating implementations. I wasn't sleeping more than three hours for multiple nights — not from stress, from euphoria. You stop directing the agents and start feeding them. The pipeline becomes the product, and the actual product stalls.
That's the failure mode nobody talks about. The human in the loop can become the human consumed by the loop. The point that actually matters is the output — does the feature work? Does anyone need it? The TV launcher runs on agents with almost no infrastructure, and it works, because I spend my time looking at the screen instead of tuning the machine.
Why the agentic AI gap isn't closing
The common assumption is that agents will get better, and the cliff will flatten. Models will improve. Context windows will grow. Tool use will get more reliable.
All of that is true and none of it addresses the core issue. The cliff isn't a capability problem — it's a verification problem. Making agents more capable doesn't help if you can't verify their output at the speed they produce it.
Google's 2025 DORA Report found that increased AI coding adoption correlates with a 9% climb in bug rates and a 91% increase in code review time. LinearB's benchmarks show 67.3% of AI-generated pull requests get rejected, compared to 15.6% for manual code. Only 11% of organisations are actively using agentic AI in production, while 42% are still developing their strategy. The gap between excitement and deployment is the cliff in aggregate.
The real progress will come from better ways to know when an agent is wrong. Not better agents that are wrong less often — though that helps — but better systems for catching the failures that remain.
Where I am with it
I'm in a better place than I was a month ago. The cleanup forced a reset. But the paradox I'm sitting with is this: the projects where I've built the most infrastructure to make agents trustworthy are the ones I trust least. The CRM needed three days of cleanup. The TV launcher mostly runs on its own. The difference wasn't the model. It was how much I tried to control.
Every piece of agent infrastructure feels like it serves the product — until you step back and realise you've spent two weeks serving the pipeline. The question I'm trying to hold onto: is this the product, or is this the machine?
Sources: Gartner — agentic AI project cancellation · Google DORA Report 2025 · METR — AI developer speed study · Evaluating AGENTS.md — arXiv 2602.11988 · Anthropic — Effective harnesses for long-running agents · Deloitte 2026 Tech Trends · LinearB Software Engineering Benchmarks