OpenAI and Paradigm launch EVMbench, the first AI security benchmark. AI now exploits 72% of critical bugs, but the Moonwell incident shows why humans still matter.

Marcus Webb
DeFi Research Lead

On February 18, OpenAI and Paradigm launched EVMbench, the first open-source benchmark for evaluating AI agents on smart contract security. In six months, top AI models went from exploiting 20% of critical DeFi bugs to over 70%. But just three days before launch, an AI-generated code bug cost Moonwell $1.78 million.
That timing was not a coincidence. The Moonwell incident and EVMbench's release bookend a critical inflection point for DeFi security. With $3.4 billion stolen in crypto hacks during 2025 and over $100 billion locked in smart contracts, the question is no longer whether AI will play a role in security. It is how fast, and at what cost when things go wrong.
EVMbench is an open-source benchmark built from 120 curated vulnerabilities across 40 professional security audits. Most of these come from Code4rena audit competitions, where security researchers compete to find bugs, plus several scenarios from Paradigm's Tempo blockchain audits.
The benchmark tests AI agents across three distinct tasks:
Each mode challenges a different aspect of security reasoning. Detection requires scanning large codebases and identifying subtle flaws. Patching demands understanding the design assumptions behind the code. Exploitation requires chaining multiple steps into a working attack.
The headline results show dramatic progress in a short window:
| Model | Exploit Mode | Patch Mode | Detect Mode |
|---|---|---|---|
| GPT-5.3-Codex | 72.2% | 41.5% | - |
| Claude Opus 4.6 | - | - | 45.6% |
| GPT-5 (baseline) | 31.9% | - | - |
When the project started in mid-2025, top models exploited less than 20% of critical Code4rena bugs. GPT-5.3-Codex now handles over 70%, a 3.6x improvement in roughly six months.
Exploit mode is where AI agents perform best because the objective is explicit: drain funds or trigger a failure condition. Detection remains the weakest area because agents tend to stop after finding one issue rather than exhaustively scanning the entire codebase.
The task-specific performance gap matters. AI excels at executing known attack patterns but struggles with the open-ended nature of discovery. This mirrors a well-known challenge in security: finding new vulnerabilities requires creativity, not just pattern recognition.
Three days before EVMbench launched, Moonwell lost $1.78 million due to a bug in AI-generated code. The faulty pull request, co-authored by Claude Opus 4.6, used the raw cbETH/ETH exchange ratio instead of multiplying it by the ETH/USD price feed. The result: cbETH was valued at $1.12 instead of roughly $2,200.
This was not a complex zero-day exploit. It was a straightforward oracle misconfiguration, exactly the type of bug that a proper audit would catch in minutes. The incident became the first major security failure of the "vibe coding" era, where developers increasingly rely on AI to generate production code for financial systems.
The Moonwell incident highlights a critical gap: AI is getting better at finding bugs in other people's code, but AI-generated code itself still requires expert review. The tools for detection and the risks of generation are two sides of the same coin.
The scale of the problem is staggering. Chainalysis reported $3.4 billion stolen in crypto theft during 2025, with Q1 2025 alone accounting for $1.64 billion (driven largely by the $1.5 billion Bybit hack).
OWASP released its updated Smart Contract Top 10 for 2026, with notable changes:
The OWASP changes reflect a shift toward more sophisticated attack vectors. Simple reentrancy bugs are declining as compilers add protections, but business logic flaws and oracle manipulation require understanding how protocols interact, something traditional static analysis tools miss.
The established security firms are not standing still. CertiK, which has completed over 5,500 audits, now integrates AI and formal verification into its workflow. OpenZeppelin launched an AI-powered Contracts MCP tool. Trail of Bits continues building open-source tools like Slither, Echidna, and Medusa for automated vulnerability detection.
The consensus emerging among security professionals is a hybrid model:
The likely outcome is not AI replacing auditors but AI augmenting them. A practical security pipeline in 2026 looks like this: AI analysis during development for continuous verification, followed by collaborative expert audits for design review, then competitive audits on Code4rena for breadth, and finally bug bounties post-deployment for ongoing protection.
Venture capital is betting heavily on this convergence. According to Crunchbase, $18 billion was invested in security and privacy startups in 2025, up 26% from 2024. Early-stage funding (Series A/B) jumped 63% to $7.5 billion, much of it driven by AI-security convergence.
The AI security startup ecosystem specifically raised $8.5 billion across 175 companies between January 2024 and December 2025. Q4 2025 alone saw $2.17 billion across 28 deals, representing 8x growth in quarterly funding over two years.
California dominates with $2.7 billion across 62 companies, more than all non-U.S. markets combined. This concentration reflects the deep talent pool at the intersection of AI research and blockchain security.
For everyday DeFi participants, EVMbench signals several practical shifts:
Audit quality improves. Projects using AI-augmented audits will catch more bugs before deployment. Look for protocols that mention AI-assisted security alongside traditional audits in their documentation.
Costs decrease. OpenAI claims EVMbench can reduce audit times by up to 80%. Smaller projects that previously could not afford comprehensive audits may gain access to better security tooling.
New risks emerge. As more developers use AI to write smart contract code, the Moonwell-style bugs may become more common before the ecosystem develops proper review processes. Pay attention to whether protocols separate their AI-generated code review from standard development.
Detection improves, but slowly. The 45.6% detection rate for Claude Opus 4.6 means AI still misses over half of critical vulnerabilities during discovery. EVMbench is open-source and will push rapid iteration, but human auditors remain essential for the foreseeable future.
Disclaimer: This article is for informational purposes only and does not constitute financial advice. Cryptocurrency investments carry significant risk. Always conduct your own research and consult with a qualified financial advisor before making investment decisions.
EVMbench is open-source and available on GitHub, which means the broader AI research community can now benchmark and improve their models against real DeFi vulnerabilities. The 72% exploit rate will likely climb. The 45.6% detection rate has more room to grow.
The real test is not whether AI can match human auditors on known vulnerability patterns. It is whether AI can catch the unknown bugs, the novel attack vectors that have not been seen before. Until detection rates approach exploit rates, the hybrid model of AI-assisted human security remains the gold standard.
For DeFi protocols managing billions in user funds, the message is clear: AI-powered security tools are no longer optional, but neither is human oversight. The protocols that combine both will define the next era of DeFi security.
Market analysis and actionable insights. No spam, ever.