AgentWebBench Paper with CMU
- Apr 21
- 2 min read
The web is being rebuilt around AI agents, and we wanted to understand exactly how well that's working in practice.
Together with Carnegie Mellon University, we've released AgentWebBench — the first benchmark to rigorously test how AI agents coordinate with each other to answer questions online. We built this because we care about a healthy ecosystem, not just single-model demos, and the findings challenge some core assumptions about where the agentic web is headed.
The shift is already underway: instead of browsing freely, user-facing AI agents now negotiate with website-specific AI agents to access information through MCP-style protocols, API-walled content, and provider-controlled access.
AgentWebBench stress-tests this model across 100 websites and 18.4 million documents to understand what actually works.
The counterintuitive finding is that decentralized, agent-to-agent coordination currently underperforms traditional search, except on factual Q&A, where it wins outright, and as frontier models scale up, the gap closes surprisingly fast.
Beyond the headline result, the research surfaces design principles we think matter deeply for anyone building in this space: AI agents naturally concentrate traffic on a small set of "safe" sources, which raises real concerns about discoverability and ecosystem health. Planning and reasoning closes the performance gap more than raw model power. And for the first time, teams building agent systems have a proper debugging lens — one that separates user-agent planning failures from content-agent retrieval failures.

This research matters for Anaxi Labs because it shows that the next wave of AI will be driven by coordinated ecosystems rather than isolated models. For Anaxi Labs, which is building a global data supply chain for AI and robotics and a programmable marketplace for datasets, prompts, agents, and workflows, this research validates the need for infrastructure that makes those components reusable, interoperable, and economically measurable.
In robotics specifically, agentic AI increases the importance of high-quality training data, evaluation pipelines, and specialized agents that can support planning, orchestration, and adaptation across complex physical environments. Jensen Huang’s recent public comment (“the idea that an OpenClaw will be running inside a robot is fairly obvious”) and Nvidia Robotics Chief’s view that AI Agents will bring about ChatGPT moment for robotics reinforce our thesis, making our research on this topic strategic.
Agents will increasingly coordinate with other agents via controlled interfaces, and we built AgentWebBench so the industry can measure what works, diagnose what fails, and improve coordination responsibly.


Comments