Key Takeaways:
  • AI agents fail mostly because small errors multiply over many steps. A 95 percent per-step success rate collapses to about 36 percent over 20 steps.
  • The longer the task, the worse it gets. Success drops sharply past about half an hour, and doubling the task length roughly quadruples the failure rate.
  • That is why agents demo beautifully and break in production, and why reliability is the hard problem of 2026.

The Setup

2026 is the year of the AI agent. Every lab is shipping systems that act on their own: book the trip, write and run the code, work a task from start to finish. The demos are stunning. Then you put one on a real, long job and it falls apart. There is a clean mathematical reason why.

What an Agent Actually Is

An AI agent is a model wrapped in a loop. Instead of one question and one answer, it takes a goal, breaks it into steps, acts, looks at the result, and decides what to do next, again and again until the job is done. Each step is a small decision the model makes. A simple task might be three steps. A real one, like fixing a bug across a codebase or planning a multi-leg trip, can be dozens. That length is the whole problem.

Why Small Errors Become Big Failures

Here is the math nobody puts on the slide. Reliability multiplies across steps. If an agent is right 95 percent of the time on each step, a 20-step task succeeds only about 36 percent of the time, because 0.95 to the power of 20 is roughly 0.36. Drop to 85 percent per step over 10 steps and you fall to around 20 percent. One slip early, with no way to notice or recover, and the whole chain quietly goes wrong. The model did not get dumber. The task just had too many places to fail.

The Time Wall

The pattern shows up as a time limit. Research finds agent success rates fall off sharply once a task runs past roughly half an hour, and that doubling a task's length tends to quadruple the failure rate. There is even a measured 50 percent reliability horizon, the task length at which an agent succeeds half the time. For frontier agents it sat around an hour in early 2025 and is projected to stretch toward several hours by 2027. Useful, and still short of a full workday.

What It Means For Investors

This reframes the agent hype. The bottleneck is whether the system can run twenty steps without derailing, which is a different problem from being smart on one step. So the real value sits in the unglamorous layer: checkpoints, error recovery, verification, and tools that let an agent catch and fix its own mistakes. Companies selling reliability and orchestration may matter more than the next model release. And be skeptical of agent demos, since a benchmark that runs a task once and shows a high score hides how often it fails on the second try.

How They Are Trying to Fix It

The fixes are mostly about structure rather than bigger brains. Break a long job into smaller, checkpointed pieces. Add verification steps so the agent checks its own work. Let it save progress and resume after a failure instead of starting over. Use several agents to cross-check each other. None of this makes the model smarter. It makes the system survivable, which for long tasks is what counts.

FAQ

If models keep getting smarter, won't this just solve itself?
Partly. Higher per-step accuracy helps a lot, since the failure compounds from that number. But even 99 percent per step still fails meaningfully over hundreds of steps, so structure and recovery stay essential. Smarter helps, it does not erase the math.

Why do agent demos look so good then?
Demos are short, curated, and usually run once. Real work is long, messy, and unforgiving, and a single run hides how often the same task fails on a retry. On one benchmark, top agents pass a task alone under half the time and pass it eight times in a row under a quarter.

What should I actually take away?
Judge an agent by long-task reliability rather than a one-shot benchmark score. Ask how it handles a step going wrong, whether it can checkpoint and resume, and how consistent it is across repeated runs. Consistency is the real metric.