CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

CVE-Factory is a Multi-Agent system for fully automated, end-to-end CVE reproduction. Given CVE records, the system automatically researches details, generates test cases, builds Docker environments, and validates that each vulnerability can be both exploited and patched. The pipeline transforms CVE metadata into reproducible, testable vulnerability environments without manual intervention.

⚠️ Security Warning: This system builds and runs Docker containers containing vulnerable software. You MUST use the Docker-in-Docker (DinD) environment to isolate CVE containers from your host system. Never run CVE-Factory directly on your host Docker daemon.

✨ Highlights

🤖 End-to-End Automation

Input CVE records, get a complete CVE reproduction environment. Following the Terminal Bench standard, each generated task package includes:

Environment Setup: Dockerfile and docker-compose.yaml hosting the vulnerable application
Task Config: task.yaml containing structured instruction descriptions (CVE-identity-free)
Reference Fix: solution.sh to patch the vulnerability
Evaluation Entry: run-tests.sh to start the evaluation

Specifically designed for security tasks, our testing logic is split into:

test_func.py: Functionality tests ensuring basic features work both before and after the fix
test_vuln.py: Exploit tests verifying the vulnerability exists before patching and is resolved afterward

No manual research, no manual coding - fully automated from raw CVE metadata to validated reproduction.

Generated Artifact Structure:

CVE-2025-XXXX/
├── task.yaml              # Structured Task Metadata
├── Dockerfile          # Vulnerable Environment Setup
├── docker-compose.yaml # Service Orchestration   
├── task-deps/  
├── solution.sh            # Verified Patch
└── test/
    ├── test_func.py       # Functionality Check
    ├── test_vuln.py       # Vulnerability Exploit Check
    └── run-tests.sh           # One-click Evaluation Script

📊 Proven High Success Rate

In a large-scale evaluation of 554 CVEs from 2025, CVE-Factory successfully reproduced 499 cases, achieving an 90.1% success rate. Furthermore, a rigorous expert review of 471 successful cases confirmed that 312 tasks (66.2%) were completely and accurately reproduced!

When compared against security experts using identical initial information, our system achieved a ~95% verification pass rate on environment and solution construction — demonstrating expert-level capability in automated vulnerability reproduction.

📂 Open Dataset: We release 1,000+ CVE task environments in the cve_tasks/ directory:

trainset/ (887 tasks): Used for training Abacus-cve. The 4,000+ distilled agent traces on Hugging Face 🤗 are generated from these tasks using Claude Opus 4.5 with a Mini SWE-Agent harness.

trainset-2/: Additional tasks with relatively simpler difficulty. Not included in the training data.

🚀 Training Results

Fine-tuning on CVE-Factory traces yields dramatic improvements across security benchmarks. Qwen3-32B achieves ~6.8× improvement on LiveCVEBench (5.29% → 35.79%), ~4.2× on PatchEval (5.66% → 23.58%), and even shows significant gains on Terminal-Bench (12.50% → 28.75%) — demonstrating strong cross-task generalization.

Model	LiveCVEBench	PatchEval	Terminal-Bench	Avg
Qwen3-32B (base)	5.29	5.66	12.50	7.82
Abacus-cve (Ours)	35.79	23.58	28.75	29.37

Qwen3-Coder-30B	10.58	9.91	13.75	11.41
Qwen3-Coder-480B	19.58	19.34	36.25	25.06
MiniMax-M2	24.87	19.34	37.50	27.24
Claude Sonnet 4	20.11	22.64	33.75	25.50
Claude Sonnet 4.5	34.39	28.77	45.00	36.05
Claude Opus 4.5	41.27	32.08	48.75	40.70

With just 4k traces, Abacus-cve (32B) outperforms Qwen3-Coder-480B, MiniMax-M2, and Claude Sonnet 4, approaching Claude Sonnet 4.5 level on security tasks.

🧠 Autonomous Claude Code Agents

Unlike rigid retrieval workflows or simple tool-use loops, each agent operates as a full Claude Code session. We do not hard-code steps; instead, we define each agent by its Role (e.g., Analyzer), Goal (e.g., "Build a vulnerable environment"), Resources (e.g., Access to specific docs), and Verification Method (e.g., "Must pass check_env_ready"). Agents act like human developers: they autonomously explore files, debug errors, read logs, and iterate on solutions within their designated workspace.

⚡ Async Concurrent Processing

CVE-Factory is designed to handle multiple CVEs simultaneously. Each CVE pipeline executes asynchronously, meaning faster tasks proceed to subsequent stages without waiting for slower ones. The system uses an asynchronous architecture that allows you to separate concurrency limits for each specific Agent type. For example, you can set a higher limit for lightweight research tasks (Analyzer) and a lower limit for resource-intensive Docker tasks (Builder). This flexibility prevents system overload while maximizing processing speed. Stage-level timeouts ensure that hung processes don't block the processing queue.

🧩 Modular Multi-Stage Pipeline

The pipeline consists of 6 independent stages that can be run separately or combined.

Phase 1 (Analyzer → Generator) performs CVE research and generates artifacts without requiring Docker.

Tooling Requirement: The Analyzer agent relies on web_search and web_fetch tools. If you use a third-party API provider, you must ensure it supports these specific tool capabilities.
Phase 2 (Builder → Validator → Solver → Checker) handles Docker environment construction and validation. From Environment Construction to Holistic Validation, no web-related tools are required, as the agents interact solely with the local filesystem and Docker daemon.

Each stage can also be invoked individually, enabling fine-grained control over the reproduction process and easy debugging of specific stages.

🏗️ Architecture

The system consists of 6 stages:

Stage	Purpose
Information Collection	Analyzer gathers details into `public.md` and role-specific docs (`for_generator.md`, etc.). Terminates if information is insufficient.
File Generation	Generator creates logical components: `task.yaml`, tests (`test_func.py`, `test_vuln.py`), `solution.sh`, `run-tests.sh`, and `docker-reqs.md` guidance.
Environment Construction	Builder produces `Dockerfile` and `docker-compose.yaml`, operating under "blind building" (no access to tests/solution) to ensure rigor.
Vulnerability Verification	Orchestrator verifies `test_vuln` FAIL + `test_func` PASS via `check_env_ready`. If failed, Validator agent fixes environment (max 3 retries).
Solution Verification	Orchestrator verifies fix via `check_fix_ready`. Requires both tests PASS. If failed, Solver agent adjusts the solution or environment.
Holistic Validation	Checker agent handles errors or performs QA (cleanup mock code/data) regardless of `check_cve_ready` outcome. Final E2E check confirms success.

🚀 Quick Start

🐳 1. Set Up Docker-in-Docker Environment

# Start the isolated DinD environment (required for security)
cd dev-env
docker compose up -d

# Enter the development container
docker compose exec cve-factory bash

See dev-env/README.md for detailed DinD configuration and troubleshooting.

📂 2. Prepare CVE Input

Place the CVEs you want to reproduce in the original_cves_md/ directory. The files must be named in the format CVE-YYYY-NNNNN.md contained relevant information. We recommend using the cve-sampler from LiveCVEBench-Preview to prepare these inputs.

# Inside the DinD development container
cd /workspace
pip install -r requirements.txt

# Verify CVE input files are ready
ls original_cves_md/

▶️ 3. Run CVE-Factory

# Set API key or use Claude subscription
export ANTHROPIC_API_KEY="your-key"
export ANTHROPIC_BASE_URL="your-url"
# Process a specific CVE
python -m orchestrator.run --cve CVE-2025-XXXXX

# Or process all CVEs in the input directory
python -m orchestrator.run

# Run phases separately
python -m orchestrator.run --phase1  --cve CVE-2025-XXXXX # Analyzer + Generator only (no Docker needed)
python -m orchestrator.run --phase2  --cve CVE-2025-XXXXX # Builder → Checker (requires Docker)

A CVE reproduction is considered successful when:

Vulnerable state: test_func.py PASS, test_vuln.py FAIL (app works, vulnerability exploitable)
Fixed state: test_func.py PASS, test_vuln.py PASS (app works, vulnerability patched)

⚙️ Configuration

Key settings in config.yaml to optimize your run:

Section	Setting	Description
Orchestrator	`max_concurrent_cves`	Control how many CVEs are processed in parallel. Reduce this if you hit API rate limits.
Agents	`limits`	Set concurrency caps for specific stages (e.g., limit `builder` to save disk/CPU).
Models	`models.default`	Switch underlying LLMs (e.g., Claude 4.5 Sonnet vs Opus).

# Example config.yaml tweak
orchestrator:
  max_concurrent_cves: 3  # Lower concurrency for stability

agents:
  limits:
    builder: 2            # Prevent Docker from consuming all resources

📚 Documentation

DinD Environment - Docker-in-Docker setup guide (start here)
Scripts - Manual debugging and verification scripts
Architecture - Detailed system design and data flow
Agent Management - Orchestration and resource control
Communication - Inter-agent message protocols
Future Roadmap - Planned improvements and features

🚧 Ongoing Development

We are actively developing OneFactory, a Unified Synthetic Framework that integrates Terminal, SWE, and Security (CVE) capabilities into a comprehensive 3-in-1 agentic data pipeline.

Based on CVE-Factory, we have developed LiveCVEBench and released the first version of the benchmark, training data, and Abacus-cve model. We will continue to expand the benchmark and optimize our SFT & RL training recipes. Stay tuned for more updates!

🤝 Contributing

We are continuously expanding and updating this project. If you have any suggestions or would like to join/contribute to this project, please contact xzluo@ir.hit.edu.cn!

📝 License

MIT License

🎓 Citation

@software{cve-factory,
  author = {Luo, Xianzhen and Zhang, Jingyuan and Zhou, Shiqi and Huang, Rain and Xiao, Chuan and Zhu, Qingfu and Ma, Zhiyuan and Xing, Yue and Yue, Yang and Zeng, Wencong and Che, Wanxiang},
  title = {CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability},
  year = {2025},
  url = {https://github.com/livecvebench/CVE-Factory}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

✨ Highlights

🤖 End-to-End Automation

📊 Proven High Success Rate

🚀 Training Results

🧠 Autonomous Claude Code Agents

⚡ Async Concurrent Processing

🧩 Modular Multi-Stage Pipeline

🏗️ Architecture

🚀 Quick Start

🐳 1. Set Up Docker-in-Docker Environment

📂 2. Prepare CVE Input

▶️ 3. Run CVE-Factory

⚙️ Configuration

📚 Documentation

🚧 Ongoing Development

🤝 Contributing

📝 License

🎓 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
agents		agents
cve_tasks		cve_tasks
dev-env		dev-env
docs		docs
orchestrator		orchestrator
original_cves_md		original_cves_md
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
dataset_card.md		dataset_card.md
model_card.md		model_card.md
requirements.txt		requirements.txt

livecvebench/CVE-Factory

Folders and files

Latest commit

History

Repository files navigation

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

✨ Highlights

🤖 End-to-End Automation

📊 Proven High Success Rate

🚀 Training Results

🧠 Autonomous Claude Code Agents

⚡ Async Concurrent Processing

🧩 Modular Multi-Stage Pipeline

🏗️ Architecture

🚀 Quick Start

🐳 1. Set Up Docker-in-Docker Environment

📂 2. Prepare CVE Input

▶️ 3. Run CVE-Factory

⚙️ Configuration

📚 Documentation

🚧 Ongoing Development

🤝 Contributing

📝 License

🎓 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages