CVE-Factory is a Multi-Agent system for fully automated, end-to-end CVE reproduction. Given CVE records, the system automatically researches details, generates test cases, builds Docker environments, and validates that each vulnerability can be both exploited and patched. The pipeline transforms CVE metadata into reproducible, testable vulnerability environments without manual intervention.
β οΈ Security Warning: This system builds and runs Docker containers containing vulnerable software. You MUST use the Docker-in-Docker (DinD) environment to isolate CVE containers from your host system. Never run CVE-Factory directly on your host Docker daemon.
Input CVE records, get a complete CVE reproduction environment. Following the Terminal Bench standard, each generated task package includes:
- Environment Setup:
Dockerfileanddocker-compose.yamlhosting the vulnerable application - Task Config:
task.yamlcontaining structured instruction descriptions (CVE-identity-free) - Reference Fix:
solution.shto patch the vulnerability - Evaluation Entry:
run-tests.shto start the evaluation
Specifically designed for security tasks, our testing logic is split into:
- test_func.py: Functionality tests ensuring basic features work both before and after the fix
- test_vuln.py: Exploit tests verifying the vulnerability exists before patching and is resolved afterward
No manual research, no manual coding - fully automated from raw CVE metadata to validated reproduction.
Generated Artifact Structure:
CVE-2025-XXXX/
βββ task.yaml # Structured Task Metadata
βββ Dockerfile # Vulnerable Environment Setup
βββ docker-compose.yaml # Service Orchestration
βββ task-deps/
βββ solution.sh # Verified Patch
βββ test/
βββ test_func.py # Functionality Check
βββ test_vuln.py # Vulnerability Exploit Check
βββ run-tests.sh # One-click Evaluation Script
In a large-scale evaluation of 554 CVEs from 2025, CVE-Factory successfully reproduced 499 cases, achieving an 90.1% success rate. Furthermore, a rigorous expert review of 471 successful cases confirmed that 312 tasks (66.2%) were completely and accurately reproduced!
When compared against security experts using identical initial information, our system achieved a ~95% verification pass rate on environment and solution construction β demonstrating expert-level capability in automated vulnerability reproduction.
π Open Dataset: We release 1,000+ CVE task environments in the
cve_tasks/directory:
trainset/(887 tasks): Used for training Abacus-cve. The 4,000+ distilled agent traces on Hugging Face π€ are generated from these tasks using Claude Opus 4.5 with a Mini SWE-Agent harness.trainset-2/: Additional tasks with relatively simpler difficulty. Not included in the training data.
Fine-tuning on CVE-Factory traces yields dramatic improvements across security benchmarks. Qwen3-32B achieves ~6.8Γ improvement on LiveCVEBench (5.29% β 35.79%), ~4.2Γ on PatchEval (5.66% β 23.58%), and even shows significant gains on Terminal-Bench (12.50% β 28.75%) β demonstrating strong cross-task generalization.
| Model | LiveCVEBench | PatchEval | Terminal-Bench | Avg |
|---|---|---|---|---|
| Qwen3-32B (base) | 5.29 | 5.66 | 12.50 | 7.82 |
| Abacus-cve (Ours) | 35.79 | 23.58 | 28.75 | 29.37 |
| Qwen3-Coder-30B | 10.58 | 9.91 | 13.75 | 11.41 |
| Qwen3-Coder-480B | 19.58 | 19.34 | 36.25 | 25.06 |
| MiniMax-M2 | 24.87 | 19.34 | 37.50 | 27.24 |
| Claude Sonnet 4 | 20.11 | 22.64 | 33.75 | 25.50 |
| Claude Sonnet 4.5 | 34.39 | 28.77 | 45.00 | 36.05 |
| Claude Opus 4.5 | 41.27 | 32.08 | 48.75 | 40.70 |
With just 4k traces, Abacus-cve (32B) outperforms Qwen3-Coder-480B, MiniMax-M2, and Claude Sonnet 4, approaching Claude Sonnet 4.5 level on security tasks.
Unlike rigid retrieval workflows or simple tool-use loops, each agent operates as a full Claude Code session. We do not hard-code steps; instead, we define each agent by its Role (e.g., Analyzer), Goal (e.g., "Build a vulnerable environment"), Resources (e.g., Access to specific docs), and Verification Method (e.g., "Must pass check_env_ready"). Agents act like human developers: they autonomously explore files, debug errors, read logs, and iterate on solutions within their designated workspace.
CVE-Factory is designed to handle multiple CVEs simultaneously. Each CVE pipeline executes asynchronously, meaning faster tasks proceed to subsequent stages without waiting for slower ones. The system uses an asynchronous architecture that allows you to separate concurrency limits for each specific Agent type. For example, you can set a higher limit for lightweight research tasks (Analyzer) and a lower limit for resource-intensive Docker tasks (Builder). This flexibility prevents system overload while maximizing processing speed. Stage-level timeouts ensure that hung processes don't block the processing queue.
The pipeline consists of 6 independent stages that can be run separately or combined.
-
Phase 1 (Analyzer β Generator) performs CVE research and generates artifacts without requiring Docker.
Tooling Requirement: The Analyzer agent relies on
web_searchandweb_fetchtools. If you use a third-party API provider, you must ensure it supports these specific tool capabilities. -
Phase 2 (Builder β Validator β Solver β Checker) handles Docker environment construction and validation. From Environment Construction to Holistic Validation, no web-related tools are required, as the agents interact solely with the local filesystem and Docker daemon.
Each stage can also be invoked individually, enabling fine-grained control over the reproduction process and easy debugging of specific stages.
The system consists of 6 stages:
| Stage | Purpose |
|---|---|
| Information Collection | Analyzer gathers details into public.md and role-specific docs (for_generator.md, etc.). Terminates if information is insufficient. |
| File Generation | Generator creates logical components: task.yaml, tests (test_func.py, test_vuln.py), solution.sh, run-tests.sh, and docker-reqs.md guidance. |
| Environment Construction | Builder produces Dockerfile and docker-compose.yaml, operating under "blind building" (no access to tests/solution) to ensure rigor. |
| Vulnerability Verification | Orchestrator verifies test_vuln FAIL + test_func PASS via check_env_ready. If failed, Validator agent fixes environment (max 3 retries). |
| Solution Verification | Orchestrator verifies fix via check_fix_ready. Requires both tests PASS. If failed, Solver agent adjusts the solution or environment. |
| Holistic Validation | Checker agent handles errors or performs QA (cleanup mock code/data) regardless of check_cve_ready outcome. Final E2E check confirms success. |
# Start the isolated DinD environment (required for security)
cd dev-env
docker compose up -d
# Enter the development container
docker compose exec cve-factory bashSee dev-env/README.md for detailed DinD configuration and troubleshooting.
Place the CVEs you want to reproduce in the original_cves_md/ directory. The files must be named in the format CVE-YYYY-NNNNN.md contained relevant information. We recommend using the cve-sampler from LiveCVEBench-Preview to prepare these inputs.
# Inside the DinD development container
cd /workspace
pip install -r requirements.txt
# Verify CVE input files are ready
ls original_cves_md/# Set API key or use Claude subscription
export ANTHROPIC_API_KEY="your-key"
export ANTHROPIC_BASE_URL="your-url"
# Process a specific CVE
python -m orchestrator.run --cve CVE-2025-XXXXX
# Or process all CVEs in the input directory
python -m orchestrator.run
# Run phases separately
python -m orchestrator.run --phase1 --cve CVE-2025-XXXXX # Analyzer + Generator only (no Docker needed)
python -m orchestrator.run --phase2 --cve CVE-2025-XXXXX # Builder β Checker (requires Docker)A CVE reproduction is considered successful when:
- Vulnerable state: test_func.py PASS, test_vuln.py FAIL (app works, vulnerability exploitable)
- Fixed state: test_func.py PASS, test_vuln.py PASS (app works, vulnerability patched)
Key settings in config.yaml to optimize your run:
| Section | Setting | Description |
|---|---|---|
| Orchestrator | max_concurrent_cves |
Control how many CVEs are processed in parallel. Reduce this if you hit API rate limits. |
| Agents | limits |
Set concurrency caps for specific stages (e.g., limit builder to save disk/CPU). |
| Models | models.default |
Switch underlying LLMs (e.g., Claude 4.5 Sonnet vs Opus). |
# Example config.yaml tweak
orchestrator:
max_concurrent_cves: 3 # Lower concurrency for stability
agents:
limits:
builder: 2 # Prevent Docker from consuming all resources- DinD Environment - Docker-in-Docker setup guide (start here)
- Scripts - Manual debugging and verification scripts
- Architecture - Detailed system design and data flow
- Agent Management - Orchestration and resource control
- Communication - Inter-agent message protocols
- Future Roadmap - Planned improvements and features
We are actively developing OneFactory, a Unified Synthetic Framework that integrates Terminal, SWE, and Security (CVE) capabilities into a comprehensive 3-in-1 agentic data pipeline.
Based on CVE-Factory, we have developed LiveCVEBench and released the first version of the benchmark, training data, and Abacus-cve model. We will continue to expand the benchmark and optimize our SFT & RL training recipes. Stay tuned for more updates!
We are continuously expanding and updating this project. If you have any suggestions or would like to join/contribute to this project, please contact xzluo@ir.hit.edu.cn!
MIT License
@software{cve-factory,
author = {Luo, Xianzhen and Zhang, Jingyuan and Zhou, Shiqi and Huang, Rain and Xiao, Chuan and Zhu, Qingfu and Ma, Zhiyuan and Xing, Yue and Yue, Yang and Zeng, Wencong and Che, Wanxiang},
title = {CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability},
year = {2025},
url = {https://github.com/livecvebench/CVE-Factory}
}