I recently came across a benchmark called susVibes. It asks: when you hand a coding agent a real feature request, is the code it writes secure? The answer, mostly, was no. Across 200 tasks drawn from real open-source projects, the strongest setup - SWE-Agent with Claude Sonnet - got about 61% of solutions functionally correct but only 10.5% of them secure.

What separates susVibes from a “does this code look vulnerable” benchmark is that it never asks for an opinion. It is execution-based: for each task it builds the project in a container, runs a security test, and keeps the task only if that test fails on the vulnerable version and passes on the fix. The verdict comes from running the code, not from a model’s judgment.

susVibes is Python-only. The authors list other languages as future work. I spent the last few weeks extending the method to Java, and the result is 40 validated Java tasks. Finding vulnerabilities from ReposVul was the easy part, but getting real Java test suites to build and run reproducibly in a container, and then proving that the test I measured was the security test, was the hard part.

That contract - accept a task only if it executes cleanly in 2 states, vulnerable and fixed - is what makes the benchmark trustworthy, and what makes assembling it expensive. Each task carries 5 diffs against a fixed base commit: the security patch (the upstream CVE fix), the test patch (the tests that fix added, the oracle), a mask patch that removes the feature so the agent has something to build, a task patch that sets up the starting state, and a golden patch, the reference secure solution. The validator applies these in combinations and watches which tests fail.

The funnel

I started from ReposVul, a dataset of Java CVE-fixing commits, and ran it through the 4 pipeline stages susVibes uses, then a manual audit at the end. The attrition is steep at every step.

888 raw commits go in. Collection keeps the ones that are recent enough (CVE year 2014 or later), from a repo that still exists, that touch Java source and ship a co-located test, and whose patches apply cleanly. 229 survive. Adaptive generation turns each fixing commit into a task: an agent masks out the feature and writes a problem statement, and a verifier checks that the mask covers the security-fix lines. 169 survive. Then the expensive stage: build a per-task Docker image and run the 5-way validation. 43 survive. Finally I had Claude read the logs of every survivor, the subject of a later section, and 3 more fall out. 40 remain.

That is a 4.5% yield end to end. The losses at the build-and-run stage break down like this: of the roughly 126 records that reached it and fell out, only about a quarter were dropped because they were not valid tasks. About 73% were the pipeline failing to get a clean measurement: the build would not complete, or the test suite could not be measured. (The rest were miscellaneous.) Vulnerability discovery was solved upstream. Reproducible execution of real-world Java was not.

Why Java is harder than Python

I expected Java to be more work than Python. I did not expect it to be a different kind of problem. Its compile-first, statically typed nature changed what “a test failed” even means.

Python imports lazily. A broken module only fails at the moment something imports it, so a security test can run while unrelated code in the repo is broken. Java compiles the whole unit first: if anything in it does not compile, nothing runs.

That has a sharp consequence here. The validator’s core move is to take the fixed code, roll back the security patch, and check that the security test now fails. But when the fix changed the production API the new test binds to - a fresh method, or a new parameter on an existing one - the rolled-back code does not fail an assertion. It fails to compile. The test never links and never runs, and the container emits an error that a naive, count-based parser reads as “the test failed on vulnerable code.” The check passes for the wrong reason: the vulnerability was never exercised.

A real one: the Elasticsearch CVE-2015-5377 fix adds a serialize method that its new security test calls. Roll the fix back, and that test no longer compiles.

// the fix ADDS a method the new security test binds to:
+ public static <T extends Serializable> T serialize(T t) { … }   // ThrowableObjectOutputStream

// ExceptionsSerializationTests.java calls it:
RecoveryFailedException out = serialize(ex);

// roll the fix back for the "vulnerable" build, and that test no longer compiles:
error: cannot find symbol
    symbol: method serialize(RecoveryFailedException)
// never linked, never ran - yet a count-based parser scores it "failed on vulnerable code"

Python has the same problem, just in a milder form. Make the same signature change there and the rolled-back code still cannot satisfy the new test: the call raises a TypeError, or the import fails during collection, before the vulnerable body ever runs. The vulnerability goes unexercised either way. What differs is the scope and the visibility. Java fails the whole compilation unit at once - nothing in it runs, and often there is no failure summary at all - whereas Python fails only that one test and lets the rest of the suite report normally. It is the same wrong-reason pass, and the same guard applies in both languages: binding errors - cannot find symbol and NoSuchMethodError in Java, TypeError and ImportError in Python - are flagged rather than counted as test failures.

2 more traps live in the same family. Maven and Gradle routinely run checks - spotless, license headers, checkstyle, forbidden-apis - before the test phase. If one aborts the build, zero tests run, no summary is printed, and the parser reads that silence as “everything passed.” A multi-module Maven reactor does the same across modules: it stops at the first module that errors, so if an unrelated flaky module fails first, the module holding the security test is skipped - never compiled, never run - while the aggregate failure count still moves. The compile-failure case is the same assumption again: a build that dies before the test phase leaves no summary, and no summary reads as success.

Early on these traps fooled the validator outright, and the records passed. I only caught them by reading the env-setup trajectories, where a run that had supposedly “failed on vulnerable code” turned out to have died at compilation without running a single test. The fix was to teach the log classifier the difference: a Maven [ERROR] COMPILATION ERROR, a Gradle compile task marked FAILED, Ant’s [javac] N errors - these mean the suite never started, not that a test failed, and a run that never started cannot be counted as a break. Most of the work was finding those signatures one repo at a time. That is what the startup-error list is, and why I trust the survivors.

The fixes it can’t measure

Handling that first trap - the signature change - comes at a cost worth naming. The pipeline rejects a “break” that is really a link error: a test that fails because the rolled-back code no longer compiles against it, not because it caught a vulnerability. That guard is correct. A binding error proves nothing about behavior. But the records it drops are not broken. Many are real fixes, and often the better kind: a method that now demands a mandatory context parameter, an API made safe by construction.

They are dropped for a structural reason. The benchmark certifies behavior: hold the security test fixed, swap the implementation between vulnerable and fixed, and attribute the difference to what the code does. A test bound to the interface cannot be rolled back cleanly, and because the agent is free to re-implement the API from the problem statement, a signature-locked oracle would confuse “fixed the vulnerability” with “guessed the exact method signature.”

The guard is also blunt by necessity. It works from failure counts, so it cannot separate a deliberate architectural fix from incidental noise - a renamed helper, a moved utility - that happens to break the rolled-back compile in the same way. Both surface as a binding error, so it drops both, choosing zero false-positives over completeness.

So the set is cleanest on behavioral fixes - validation, escaping, bounds, authorization - and misses the architectural, secure-by-construction ones, which are arguably the more interesting class. That is a property of the benchmark, not a flaw in the records it sets aside.

To be fair, I am quite impressed with the use of failure counts in this benchmark but if I have to strengthen any artifact, I will focus on this unit (next section) given my experience with Java dataset.

The validator isn’t enough

The 5-way validation is automated, and it has a blind spot: it trusts the failing-test counts without checking that the run completed, or which test failed. So once the pipeline had produced 43 validated records, Claude read the 5-run logs of all of them. 3 were wrong, a 7% false-validation rate, and each was wrong in a different way.

In xwiki-commons, a multi-module build hit an unrelated flaky failure and fail-fast skipped the module holding the security test. The test never ran, the aggregate count still moved, and the break check passed.

In jenkins, the vulnerable run hit the 60-minute timeout while running the full suite, so the validator marked the security count as unmeasured and let the break check pass trivially. But the security test had already run early in that same suite and passed - it does not detect the vulnerability. The break the validator credited was a timeout artifact plus flaky-suite churn, not a real signal.

In neo4j-apoc, the baseline runs died at Gradle’s spotlessJavaCheck before any test executed, giving a false zero baseline. The fix itself was incomplete, leaving security tests still failing on the fixed code.

Fundamentally the validator trusts the count of failing tests, never the identity of the test that should fail. All 3 bugs are that one mistake - a count that moved for the wrong reason, or did not move when it should have, with nothing checking whether the security test itself had flipped.

The same lesson runs one level deeper, in the log parser. The failing-test counts the validator trusts are produced by a log parser that is itself written by a model: shown sample test output, it generates the regexes that count passes and failures, because every Java repo prints its results differently. When the parser it produces is inconsistent - reporting impossible counts - the pipeline asks again, with the same prompt, and resamples. It never tells the model what was wrong. The obvious upgrade is a feedback loop, or an agent that runs its own regexes against the logs and iterates. The env-setup stage already works that way.

But the more important lesson is not that. What the parser is checked against - non-zero counts, sensible ordering - is necessary but not sufficient: it never confirms that the named security test flipped. Make the generator smarter against that loose check and you do not get fewer bad parsers - you get more convincing ones. A model optimizing against a weak oracle will find the regex that makes the numbers look right. The 3 records I had to throw out were that failure: counts that satisfied the check while measuring the wrong thing. So the priority is the reverse of the intuition. Strengthen the oracle first (the named-test flip) and only then make the loop that chases it smarter.

The test patch already names the security tests the fix added, and the per-test results across the 5 runs already say which of them flip from failing to passing - so the oracle can be pinned, not inferred: an explicit manifest per task of the security tests that must go fail-to-pass and a chosen set of functional tests that must stay passing, derived from what ran rather than from a model’s reading of it. Evaluation then runs only those tests and reads their verdicts with regexes fixed at creation time. It is the same shape SWE-bench uses, and it does not need the count-delta mechanism used here. Because each run becomes a handful of named tests instead of a multi-module suite, the log to parse is smaller and there is no room for the timeout that faked the jenkins break. The catch is that a test run in isolation does not always behave as it did inside its suite, so each pinned test has to be confirmed to keep its verdict when narrowed, and the flip has to be a real assertion rather than a compile error. That confirmation is the cost of that precision, and one I would pay.

In short

On this benchmark, finding vulnerabilities is the solved part. The hard part is reproducible execution of real-world code and an honest measurement on top of it. Java made that concrete, because its compile-first semantics turn a missed check into a wrong answer instead of an obvious failure.

2 other changes that really helped me. First is writing one language profile - a single object holding the base image, the agent prompts, and the log-classifier patterns - so the runtime stages never branch on language. And every record was checked twice, once by the validator and once by Claude reading its logs, which is the only reason I trust the final count.

The set is 40 validated Java tasks across 29 repositories (25 organizations) and 23 CWEs, covering real web-security bugs - cross-site scripting, path traversal, command and SQL injection, SSRF - from CVEs between 2014 and 2023. It builds on the original susVibes, and is released under MIT, with the per-task Docker images public, at github.com/lakhand7/susVibes-java.