Introducing EVMbench: A Joint Initiative by OpenAI and Paradigm
Introducing EVMbench: A Joint Initiative by OpenAI and Paradigm
As AI agents rapidly evolve, their ability to navigate complex codebases is becoming a cornerstone of future cybersecurity. We are excited to introduce EVMbench, a new benchmark developed in collaboration with OpenAI to measure how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities.
With over $100B in assets often sitting in open-source crypto contracts, the stakes couldn't be higher. EVMbench provides the visibility needed to understand and manage the security risks—and opportunities—that emerging AI capabilities bring to the Ethereum Virtual Machine (EVM) ecosystem.
Measuring Core AI Capabilities
EVMbench doesn't just ask if an AI can "read" code; it tests whether an agent can perform the end-to-end duties of a security researcher:
Detect: Identifying vulnerabilities in real-world contract code derived from actual audits.
Exploit: Demonstrating the impact by executing realistic attack scenarios in sandboxed environments.
Patch: Creating safe, robust fixes that pass rigorous verification testing.
The Rapid Evolution of AI Agents
The progress in this space is staggering. When this project began, top models could exploit less than 20% of critical bugs. Today, OpenAI's GPT-5.3-Codex is already successfully exploiting over 70%. It is now clear that AI agents will perform a significant portion of future audits, making tools like EVMbench essential for both measurement and acceleration.
A Call to Action for Researchers
We are releasing EVMbench’s tasks, tooling, and evaluation framework to the public. Our goal is to support continued research into AI cyber capabilities and encourage developers to integrate AI-assisted auditing into their defensive workflows today.
"EVMbench serves as both a preview and an accelerant toward a future where security is agent-led."
Explore the benchmark and join the research:
No comments