Most AI vendors don't benchmark for reliability. A new benchmark from Princeton researchers does.
 ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­  
Tuesday, March 24, 2026
AI agents are getting more capable, but reliability is lagging—and that’s a problem


Hello and welcome to Eye on AI. In this edition…AI’s reliability problem…Trump sends an AI legislation blueprint to Congress…OpenAI consolidates products into a super app and hires up…AI agents that can improve how they improve…and does your AI model experience emotional distress?

Like many of you, I’ve started playing around with AI agents. I often use them for research, where they work pretty well and save me substantial amounts of time. But so-called “deep research” agents have been available for over a year now, which makes them a relatively mature product in the AI world. I’ve also started trying the new crop of computer-using agents for other tasks. And here, my experience so far is that these agents are highly inconsistent.

For instance, Perplexity’s Computer, which is an agentic harness that works in a virtual machine with access to lots of tools, did a great job booking me a drop-off slot at my local recycling center. (It used Anthropic’s Claude Sonnet 4.6 as the underlying reasoning engine.) But when I asked it to investigate flight options for an upcoming business trip, it failed to complete the task—even though travel booking is one of those canonical use cases that the AI companies are always talking about. What the agent did do is eat up a lot of tokens over the course of 45 minutes of trying.

Last week, at an AI agent demo event Anthropic hosted for government and tech policy folks in London, I watched Claude Cowork initially struggle to run a fairly simple data-sorting exercise in an Excel spreadsheet, even as it later created a sophisticated budget forecasting model with seemingly no problems. I also watched Claude Code spin up a simple, text-based business strategy game I asked it to create that looked great on the surface, but whose underlying game logic didn’t make any sense.

Assessing AI agents’ reliability
Unreliability is a major drawback of current AI agents. It’s a point that Princeton University’s Sayash Kapoor and Arvind Narayanan, who cowrote the book AI Snakeoil and now cowrite the “AI As Normal Technology” blog, frequently make. And a few weeks ago they published a research paper, co-authored with four other computer scientists, that tries to think systematically about AI agent reliability and to benchmark leading AI models.

The paper, entitled “Towards a Science of AI Agent Reliability,” notes that most AI models are benchmarked on their average accuracy on tasks, a metric that allows for wildly unreliable performance. Instead, they look at reliability across four dimensions: consistency (if asked to perform the same task in the same way, do they always perform the same?); robustness (can they function even when conditions aren’t ideal?); calibration (do they give users an accurate sense of their certainty?); and safety (when they do mess up, how catastrophic are those mistakes likely to be?).

They further broke these four areas into 14 specific metrics and tested a number of models released in the 18 months prior to late November 2025 (so OpenAI’s GPT-5.2, Anthropic’s Claude Opus 4.5, and Google’s Gemini 3 Pro were the most advanced models tested). They tested the models on two different benchmark tests, one of which is a general benchmark for agentic tasks while the other simulates customer-support queries and tasks. They found that while reliability improved with each successive model release, it did not improve nearly as much as average accuracy figures. In fact, on the general agentic benchmark the rate of improvement in reliability was half that of accuracy, while on the customer service benchmark it was one-seventh!

Reliability metrics depend on the task at hand
Across the four areas of reliability the paper examined, Claude Opus 4.5 and Gemini 3 Pro scored the best, both with an overall reliability of 85%. But if you look at the 14 sub-metrics, there was still plenty of reason for concern. Gemini 3 Pro, for example, was poor judging when its answers were likely accurate, at just 52%, and terrible at avoiding potential catastrophic mistakes, at just 25%. Claude Opus 4.5 was the most consistent in its outcomes, but its score was still only 73% consistent. (I would urge you to check out and play around with the dashboard the researchers created to show the results across all the different metrics.) 

Kapoor, Narayanan, and their co-authors are also sophisticated enough to know that reliability is not one-size-fits all metric. They note that if AI is being used to augment humans, as opposed to fully automating tasks, it might be ok for the AI to be less consistent and robust, since the human can act as a backstop. But “for automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system,” they write. They also note that different kinds of consistency matter in different settings. “Trajectory consistency matters more in domains that demand auditability or process reproducibility, where stakeholders must verify not just what the agent concluded but how it got there,” they write. “It matters less in open-ended or creative tasks where diverse solution paths are desirable.”

Either way, Kapoor, Narayanan, and their co-authors are right to call for benchmarking of reliability and not just accuracy, and for AI model vendors to build their systems for reliability and not just capability. Another study that came out this week shows the potential real-world consequences when that doesn’t happen. AI researcher Kwansub Yun and health consultant Claire Hast looked at what happens when three different AI medical tools are chained together in a system, as might happen in a real health care setting. An AI imaging tool that analyzed mammograms had an accuracy of 90%, a transcription tool that turned an audio recording of a doctor’s examination of a patient into medical notes had an accuracy of 85%, and these were then fed to a diagnostic tool that had a reported accuracy of 97%. And yet when used together their reliability score was just 74%. That means one in four patients might be misdiagnosed!

A foolish consistency may be the hobgoblin of little minds, as Ralph Waldo Emerson famously said. But, honestly, I think I’d prefer that hobgoblin to the chaotic gremlins that currently plague our ostensibly big AI brains. 

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, I want to encourage everyone to read my Fortune colleague Allie Garfinkle’s awesome feature story about Cursor. Cursor is the AI coding startup that as recently as four months ago was a Silicon Valley darling, but which many people now think may be facing an existential threat because of new coding agents, such as Anthropic’s Claude Code, that seemingly obviate the need to use Cursor. Allie’s story lays bare all the contradictions around this company—how it has continued to see record revenue growth, even as many in Silicon Valley now harbor doubts about its survival; how it is racing to train its own coding agents, pivoting from the developer-centric coding interface that made it so popular with programmers in the first place; how its impossibly young CEO Michael Truell works under a portrait of Robert Caro, the biographer whose projects often lasted decades, while Cursor needs to operate in an industry in which a year can feel like a century. Allie’s story is definitely worth the time.

FORTUNE ON AI
AI IN THE NEWS

Trump sends AI legislation blueprint to Congress. The White House has released a light-touch AI policy blueprint that it wants Congress to turn into federal law. The recommended framework places an emphasis on preempting state AI rules that the administration says hinder innovation. The proposal would block states from regulating how models are developed and from penalizing companies for downstream uses of their AI. It also urges Congress not to create any new federal AI regulator. At the same time, it recommends some regulation, such as preserving state laws protecting children, requiring age-gating for models likely to be used by minors, promoting AI skills training, and tracking AI-related job disruption. The plan also seeks to codify Trump’s pledge that tech companies should cover the electricity costs of their data centers. Winning bipartisan support for the blueprint in Congress remains doubtful; Republican leaders are saying some of their members have concerns about trampling on states’ rights, while it is uncertain whether the child-protection measures might be enough to garner support from Democrats. You can read more from Politico here.

OpenAI looks to consolidate products into a super app. That’s according to a story in the Wall Street Journal. OpenAI plans to roll ChatGPT, its Codex coding tool, and its browser into a single desktop “superapp” as it tries to simplify its product lineup and sharpen its focus on engineering and business users. The move, led by applications chief Fidji Simo with support from president Greg Brockman, reflects a retreat from last year’s more sprawling strategy of launching multiple standalone products that often failed to gain traction.

OpenAI also plans to double its workforce to 8,000. That’s according to a report in the Financial Times that cited two sources familiar with OpenAI’s plans. The company plans to double its workforce by year-end, the sources said, with the hiring taking place across product, engineering, research, sales, and customer-facing technical roles. The hiring spree comes as the company shifts more aggressively toward enterprise sales and tries to regain momentum against Anthropic and Google, and as the company eyes a possible IPO within the next 12 months.

And OpenAI hires a veteran Meta ad exec, even as early customers skeptical of ad effectiveness. Meta advertising executive Dave Dugan is joining OpenAI to lead ad sales, the Wall Street Journal reports. The hire shows OpenAI is getting serious about advertising as it looks to find more revenue. But it also comes as The Information reports that some early customers of OpenAI’s in-chat advertising are unsure how effective those ads have been. Clearly Dugan has his work cut ou