Blockchain Security Firm Uncovers Critical Flaws in OpenAI’s Smart Contract Testing System
Introduction: A Necessary Reality Check for AI Security Tools
In a development that highlights the growing pains of artificial intelligence in the blockchain space, OpenZeppelin—one of the industry’s most respected security firms—has raised serious concerns about EVMbench, OpenAI’s newly launched benchmark for testing AI security capabilities. Released in mid-February through a collaboration between OpenAI and prominent crypto investment firm Paradigm, EVMbench was designed with an ambitious goal: to evaluate how effectively different artificial intelligence models could identify vulnerabilities in smart contracts, propose fixes for these security holes, and even simulate how they might be exploited by malicious actors. However, what seemed like a promising step forward for blockchain security may have stumbled right out of the gate.
OpenZeppelin’s decision to scrutinize EVMbench wasn’t arbitrary or mean-spirited—it was simply applying the same rigorous standards it uses when auditing the protocols of major decentralized finance projects like Aave, Lido, and Uniswap. In their comprehensive review announced via social media on Monday, OpenZeppelin revealed they had discovered fundamental problems with both the methodology used in testing and the accuracy of the data itself. These aren’t minor quibbles about interpretation; they’re substantial issues that could undermine the entire premise of the benchmark and call into question the reliability of its results. For an industry where a single security vulnerability can lead to millions of dollars in stolen funds, the stakes couldn’t be higher when it comes to ensuring that our security tools are themselves secure and accurate.
The Twin Problems: Contaminated Data and Misclassified Threats
OpenZeppelin’s audit identified two primary categories of concern that strike at the heart of EVMbench’s credibility. The first issue revolves around what security experts call “training data contamination”—essentially, the AI models being tested may have already seen the answers to the test questions during their initial training phase. The second problem involves misclassification of vulnerabilities, where issues labeled as high-severity threats don’t actually pose real risks in practice. Together, these flaws paint a picture of a benchmark that, despite good intentions, may not be measuring what it claims to measure.
The contamination problem is particularly troubling because it defeats the entire purpose of security testing. As OpenZeppelin pointedly noted, the most valuable capability any AI security system can possess is the ability to identify novel vulnerabilities—security flaws that have never been documented before, in code the system has never previously encountered. This is the frontier where AI could genuinely transform blockchain security, catching the creative new attack vectors that human auditors might miss. However, if the AI models being evaluated have essentially already studied for the test, their impressive scores don’t tell us much about their real-world problem-solving abilities. It’s the difference between a student who can genuinely solve complex mathematical problems and one who has simply memorized the answers to specific questions that will appear on the exam.
How the Testing Contamination Occurred
The way this contamination likely happened reveals important lessons about the challenges of testing AI systems fairly. During the EVMbench evaluation process, the AI agents being tested had their internet access deliberately disabled—a sensible precaution meant to ensure they couldn’t simply search online for solutions to the security challenges they were presented with. On the surface, this seemed like a reasonable way to ensure the test measured genuine analytical capability rather than search engine proficiency. However, this safeguard overlooked a critical detail about how these AI systems are built in the first place.
The benchmark dataset used for testing was carefully curated from 120 real security audits conducted between 2024 and mid-2025. Meanwhile, the AI models that performed best on the test—including Anthropic’s Claude Open 4.6, which topped the leaderboard, followed by OpenAI’s OC-GPT-5.2 and Google’s Gemini 3 Pro—all had knowledge cutoff dates around mid-2025. This timing overlap creates a significant problem: these AI agents had almost certainly been exposed to the very vulnerability reports being used to test them during their initial training period. The information wasn’t being looked up during the test; it was already stored in the models’ “memory,” so to speak.
OpenZeppelin was careful to note that this doesn’t necessarily mean the AI models could instantly recall specific vulnerabilities word-for-word. The contamination is more subtle than outright cheating. However, it does mean that the models may have developed pattern recognition and problem-solving approaches specifically calibrated to the types of issues appearing in the test, rather than demonstrating truly general-purpose security analysis capabilities. Furthermore, the relatively limited size of the dataset compounds this problem—with only 120 audits to draw from, the evaluation surface becomes narrow, making the contamination concerns more significant. It’s harder to give a fair test when there simply aren’t enough unique questions to go around, especially when the test-takers may have already encountered similar material during their education.
Invalid Vulnerabilities: When High-Severity Threats Don’t Actually Work
Beyond the contamination concerns, OpenZeppelin identified what might be an even more fundamental problem: the EVMbench dataset contains factual errors that misclassify non-exploitable issues as serious security vulnerabilities. According to their audit, at least four issues labeled as high-severity vulnerabilities are, in OpenZeppelin’s expert assessment, not actually exploitable in real-world conditions. This isn’t a matter of subjective disagreement about how serious a particular vulnerability might be—these are cases where the described exploit fundamentally doesn’t work as claimed.
This finding is deeply problematic for several reasons. First, it means that AI agents were being scored as successful for “finding” vulnerabilities that don’t actually exist. Imagine a medical diagnostic test that gives doctors high marks for identifying diseases that the patients don’t actually have—such a test would be worse than useless, as it would generate false confidence in diagnostic capabilities while potentially leading to unnecessary treatments. Similarly, a security benchmark that rewards identifying phantom vulnerabilities doesn’t help us build better security tools; it just creates an illusion of capability.
Second, this classification problem raises questions about the expertise and methodology that went into creating the benchmark in the first place. OpenZeppelin, with its extensive experience auditing some of the most high-value smart contracts in the ecosystem, clearly has the credibility to make such assessments. When a firm that has helped secure billions of dollars in decentralized finance protocols says that certain “high-severity” issues aren’t actually exploitable, that carries substantial weight. It suggests that EVMbench may have been assembled without sufficient involvement from practitioners who have deep, hands-on experience with the practical realities of smart contract security. The theoretical understanding of vulnerabilities and the practical knowledge of what actually works in exploitation attempts can sometimes diverge significantly.
The Broader Implications for AI in Blockchain Security
Despite these criticisms, OpenZeppelin was clear that they’re not dismissing the potential of AI in blockchain security—quite the opposite. They reiterated their belief that artificial intelligence will indeed transform how smart contract security is conducted. The question, as they framed it, isn’t whether AI will revolutionize this field, but rather whether the tools we’re building and the methods we’re using to evaluate them are being held to appropriately rigorous standards. This distinction is crucial because it positions the critique not as skepticism about AI’s potential, but as an insistence that we apply the same exacting standards to our AI security tools that we apply to the smart contracts those tools are meant to protect.
This perspective reflects a mature understanding of how transformative technologies are successfully integrated into critical infrastructure. Blockchain security isn’t an area where “move fast and break things” is an acceptable philosophy—the financial stakes are too high, and the permanence of blockchain transactions means that exploited vulnerabilities can’t simply be patched after the fact in many cases. When millions or even billions of dollars can be at risk, the security tools we rely on must be beyond reproach. If we’re going to trust AI agents to identify vulnerabilities in code that secures substantial value, we need absolute confidence that these agents are being evaluated based on their genuine capabilities, not inflated scores derived from contaminated tests and misclassified vulnerabilities.
Moving Forward: Building Better Benchmarks for a Secure Future
The OpenZeppelin audit of EVMbench serves as a valuable reminder that the blockchain and AI communities must work together thoughtfully as these technologies converge. The enthusiastic embrace of AI tools for security purposes is understandable—the promise of systems that can tirelessly analyze code, spot subtle patterns that humans might miss, and scale security reviews beyond what traditional auditing firms can handle is genuinely exciting. However, enthusiasm must be tempered with rigor, especially in an industry that has seen numerous high-profile hacks and exploits resulting from security oversights.
The path forward requires several key improvements. First, future AI security benchmarks need larger, more diverse datasets that minimize the risk of training data contamination. This might mean using more recent vulnerabilities that postdate the knowledge cutoffs of the AI models being tested, or creating synthetic but realistic security challenges specifically designed for evaluation purposes. Second, the classification of vulnerabilities must involve deep practical expertise from security professionals who understand not just theoretical attack vectors but the real-world conditions under which exploits actually work. Third, there needs to be greater transparency in how these benchmarks are constructed and evaluated, allowing the security community to scrutinize and improve the testing methodologies collectively. OpenZeppelin’s decision to audit EVMbench and publicly share their findings exemplifies the kind of constructive criticism that can help the entire ecosystem develop more reliable tools. As AI continues to evolve and its role in blockchain security expands, this kind of rigorous, honest assessment will be essential for ensuring that our confidence in these systems is well-founded rather than dangerously misplaced.













