Skip to content

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

License

Notifications You must be signed in to change notification settings

Malaeu/Paper2Code

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder Overview

πŸ“„ Read the paper on arXiv

PaperCoder is a multi-agent LLM system that transforms paper into a code repository. It follows a three-stage pipeline: planning, analysis, and code generation, each handled by specialized agents.
Our method outperforms strong baselines on both Paper2Code and PaperBench and produces faithful, high-quality implementations.


πŸ—ΊοΈ Table of Contents


⚑ Quick Start

πŸ”‘ API Keys Setup

First, configure your API keys by creating a .env file:

# Copy the example file and edit with your keys
cp .env.example .env

# Edit .env file with your actual API keys:
# OPENAI_API_KEY=sk-proj-your-openai-key
# ANTHROPIC_API_KEY=sk-ant-api03-your-anthropic-key  
# GEMINI_API_KEY=your-gemini-key

Using OpenAI API

  • πŸ’΅ Estimated cost for using o3-mini: $0.50–$0.70
pip install openai

export OPENAI_API_KEY="<OPENAI_API_KEY>"

cd scripts
bash run.sh

πŸ”€ LLM Router

The router configuration lives in llm_router/config.yaml.

Task Pattern Primary Model Fallback
chat|faq|rag gemini_flash_25 claude_sonnet_35
code|unit_tests claude_sonnet_37 o4mini
long_doc>300k gpt41 claude_sonnet_35
tool_reasoning o4mini gemini_flash_25

Override the config by setting LLM_CFG:

export LLM_CFG=/path/to/custom.yaml

Using Open Source Models with vLLM

  • If you encounter any issues installing vLLM, please refer to the official vLLM repository.
  • The default model is deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct.
pip install vllm

cd scripts
bash run_llm.sh

Output Folder Structure (Only Important Files)

outputs
β”œβ”€β”€ Transformer
β”‚   β”œβ”€β”€ analyzing_artifacts
β”‚   β”œβ”€β”€ coding_artifacts
β”‚   └── planning_artifacts
└── Transformer_repo # Final output repository

πŸ“š Detailed Setup Instructions

πŸ› οΈ Environment Setup

  • πŸ’‘ To use the o3-mini version, make sure you have the latest openai package installed.
  • πŸ“¦ Install only what you need:
    • For OpenAI API: openai
    • For open-source models: vllm
pip install openai 
pip install vllm 
  • Or, if you prefer, you can install all dependencies using pip:
pip install -r requirements.txt

πŸ“„ (Option) Convert PDF to JSON

The following process describes how to convert a paper PDF into JSON format. If you have access to the LaTeX source and plan to use it with PaperCoder, you may skip this step and proceed to πŸš€ Running PaperCoder.

Note: In our experiments, we converted all paper PDFs to JSON format. The original workflow relied on the s2orc-doc2json repository. As of 2025 more capable open-source libraries exist. We provide multiple approaches below.

Option 1: Modern Vision-based Approach (Recommended)

We now provide a modern PDF to JSON converter that uses vision models (Gemini 2.5 Flash) instead of the legacy GROBID approach. This method is:

  • 95% cheaper than traditional approaches
  • Faster (no Java services required)
  • More accurate for complex layouts, formulas, and tables
# Install dependencies
pip install pdf2image pytesseract aiohttp tqdm

# With Gemini API (best quality)
export GEMINI_API_KEY="your-api-key"
python codes/pdf_to_json_modern.py -i paper.pdf -o output.json

# Or use the convenience script
cd scripts
./run_modern_pdf2json.sh ../examples/Transformer.pdf

For more details, see Modern PDF to JSON Documentation.

Option 2: Legacy GROBID Approach

If you prefer the traditional method, you can still use the s2orc-doc2json repository:

  1. Clone s2orc-doc2json and run its processing service:
git clone https://github.com/allenai/s2orc-doc2json.git
cd ./s2orc-doc2json/grobid-0.7.3
./gradlew run
  1. Convert the PDF into JSON format using the bundled script:
mkdir -p ./s2orc-doc2json/output_dir/paper_coder
python ./s2orc-doc2json/doc2json/grobid2json/process_pdf.py \
    -i ${PDF_PATH} \
    -t ./s2orc-doc2json/temp_dir/ \
    -o ./s2orc-doc2json/output_dir/paper_coder

Hybrid approach (recommended for 2025)

  1. Install modern PDF processing libraries.
pip install PyMuPDF pdfplumber layoutparser
  1. Ensure the latest grobid server (v0.8 or later) is running.

  2. Use the script codes/pdf_to_json_hybrid.py to combine page-level text extraction with metadata from grobid and produce a single JSON file:

python codes/pdf_to_json_hybrid.py \
    --pdf_path ${PDF_PATH} \
    --output_json ./paper_coder_output/paper.json \
    --grobid_url http://localhost:8070

This hybrid pipeline leverages modern layout analysis tools for accurate page content while still using grobid for reliable metadata extraction.

Simple approach (no grobid)

  1. Install lightweight dependencies.
pip install PyMuPDF pdf2image pytesseract camelot-py
  1. Run the script codes/pdf_to_json_simple.py:
python codes/pdf_to_json_simple.py \
    --pdf_path ${PDF_PATH} \
    --output_json ./paper_coder_output/paper.json

This method relies solely on PyMuPDF and OCR, optionally using camelot to extract tables.

πŸš€ Running PaperCoder

  • Note: The following command runs example paper (Attention Is All You Need).
    If you want to run PaperCoder on your own paper, please modify the environment variables accordingly.

Using OpenAI API

  • πŸ’΅ Estimated cost for using o3-mini: $0.50–$0.70
# Using the PDF-based JSON format of the paper
export OPENAI_API_KEY="<OPENAI_API_KEY>"

cd scripts
bash run.sh
# Using the LaTeX source of the paper
export OPENAI_API_KEY="<OPENAI_API_KEY>"

cd scripts
bash run_latex.sh

Using Open Source Models with vLLM

  • The default model is deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct.
# Using the PDF-based JSON format of the paper
cd scripts
bash run_llm.sh
# Using the LaTeX source of the paper
cd scripts
bash run_latex_llm.sh

πŸ“¦ Paper2Code Benchmark Datasets

  • Huggingface dataset: paper2code

  • You can find the description of the Paper2Code benchmark dataset in data/paper2code.

  • For more details, refer to Section 4.1 "Paper2Code Benchmark" in the paper.


πŸ–ΌοΈ Enhanced Pipeline with Image Analysis

We've extended the original Paper2Code pipeline with advanced image analysis capabilities, using o4-mini-2025-04-16 for image processing and o3-2025-04-16 for code generation.

Complete Pipeline Steps

  1. Copy and Setup PDF
# Copy your paper to the working directory
cp /path/to/your/paper.pdf ./custom_paper/paper.pdf
  1. Start GROBID in a separate terminal
cd $HOME/grobid-0.7.3 && ./gradlew run

GROBID is required for extracting structured text from scientific PDFs.

  1. Convert PDF to JSON using GROBID
python s2orc-doc2json/doc2json/grobid2json/process_pdf.py -i "custom_paper/paper.pdf" -t custom_paper/temp_dir/ -o custom_paper/

This transforms the PDF into structured JSON with sections, paragraphs, and references.

  1. Preprocess JSON
python codes/0_pdf_process.py --input_json_path custom_paper/paper.json --output_json_path custom_paper/paper_cleaned.json

Cleans and enhances the JSON for better analysis.

  1. Extract and Analyze Images with o4-mini-2025-04-16
python codes/extract_figures.py --pdf_path custom_paper/paper.pdf --json_path custom_paper/paper_cleaned.json --output_dir custom_paper --gpt_version o4-mini-2025-04-16

This step:

  • Extracts all images from the PDF
  • Uses o4-mini-2025-04-16 to create detailed descriptions of each image
  • Adds these descriptions to the JSON, creating enhanced_paper.json
  1. Planning with o3-2025-04-16
python codes/1_planning.py --paper_name YourPaperName --gpt_version o3-2025-04-16 --pdf_json_path custom_paper/enhanced_paper.json --output_dir outputs/YourPaperName_enhanced

Creates a detailed implementation plan using the enriched JSON with image descriptions.

  1. Configuration Extraction
python codes/1.1_extract_config.py --paper_name YourPaperName --output_dir outputs/YourPaperName_enhanced

Extracts configuration parameters from the plan for use in subsequent steps.

  1. Analysis with o3-2025-04-16
python codes/2_analyzing.py --paper_name YourPaperName --gpt_version o3-2025-04-16 --pdf_json_path custom_paper/enhanced_paper.json --output_dir outputs/YourPaperName_enhanced

Performs detailed analysis of system components, creating logical schemas for each module.

  1. Code Generation with o3-2025-04-16
python codes/3_coding.py --paper_name YourPaperName --gpt_version o3-2025-04-16 --pdf_json_path custom_paper/enhanced_paper.json --output_dir outputs/YourPaperName_enhanced --output_repo_dir outputs/YourPaperName_repo_enhanced

Generates the actual code implementing all system components based on planning and analysis results.

One-Step Execution

For convenience, you can use the enhanced script:

./scripts/run_custom_enhanced.sh

This script runs the entire pipeline with the appropriate configuration.

Key Pipeline Features

1. Two-Stage Processing

  • o4-mini-2025-04-16 for image analysis
  • o3-2025-04-16 for planning, analysis, and code generation

2. Cost Optimization via Prompt Caching

  • Static content (text + image descriptions) is placed at the beginning
  • Token caching between consecutive API calls
  • Cost reduction of approximately 50% for cached content

3. Enhanced Image Processing

  • Automatic extraction of all figures from PDF
  • Image analysis using o4-mini-2025-04-16
  • Integration of descriptions into JSON for use by o3-2025-04-16

4. Modular Approach

  • Logical division into stages: planning, analysis, coding
  • Saving intermediate results
  • Ability to restart individual stages

5. Result

  • Structured implementation of the entire system
  • Complete reproduction of the paper methodology
  • Ready-to-use code in output_repo_dir

πŸ“Š Model-based Evaluation of Repositories Generated by PaperCoder

  • We evaluate repository quality using a model-based approach, supporting both reference-based and reference-free settings.
    The model critiques key implementation components, assigns severity levels, and generates a 1–5 correctness score averaged over 8 samples using o3-mini-high.

  • For more details, please refer to Section 4.3.1 (Paper2Code Benchmark) of the paper.

  • Note: The following examples evaluate the sample repository (Transformer_repo).
    Please modify the relevant paths and arguments if you wish to evaluate a different repository.

πŸ› οΈ Environment Setup

pip install tiktoken
export OPENAI_API_KEY="<OPENAI_API_KEY>"

πŸ“ Reference-free Evaluation

  • target_repo_dir is the generated repository.
cd codes/
python eval.py \
    --paper_name Transformer \
    --pdf_json_path ../examples/Transformer_cleaned.json \
    --data_dir ../data \
    --output_dir ../outputs/Transformer \
    --target_repo_dir ../outputs/Transformer_repo \
    --eval_result_dir ../results \
    --eval_type ref_free \
    --generated_n 8 \
    --papercoder

πŸ“ Reference-based Evaluation

  • target_repo_dir is the generated repository.
  • gold_repo_dir should point to the official repository (e.g., author-released code).
cd codes/
python eval.py \
    --paper_name Transformer \
    --pdf_json_path ../examples/Transformer_cleaned.json \
    --data_dir ../data \
    --output_dir ../outputs/Transformer \
    --target_repo_dir ../outputs/Transformer_repo \
    --gold_repo_dir ../examples/Transformer_gold_repo \
    --eval_result_dir ../results \
    --eval_type ref_based \
    --generated_n 8 \
    --papercoder

πŸ“„ Example Output

========================================
🌟 Evaluation Summary 🌟
πŸ“„ Paper name: Transformer
πŸ§ͺ Evaluation type: ref_based
πŸ“ Target repo directory: ../outputs/Transformer_repo
πŸ“Š Evaluation result:
        πŸ“ˆ Score: 4.5000
        βœ… Valid: 8/8
========================================
🌟 Usage Summary 🌟
[Evaluation] Transformer - ref_based
πŸ› οΈ Model: o3-mini
πŸ“₯ Input tokens: 44318 (Cost: $0.04874980)
πŸ“¦ Cached input tokens: 0 (Cost: $0.00000000)
πŸ“€ Output tokens: 26310 (Cost: $0.11576400)
πŸ’΅ Current total cost: $0.16451380
πŸͺ™ Accumulated total cost so far: $0.16451380
============================================

πŸ”€ LLM Router

The router configuration lives in llm_router/config.yaml.

Task Pattern Primary Model Fallback
chat|faq|rag gemini_flash_25 claude_sonnet_35
code|unit_tests claude_sonnet_37 o4mini
long_doc>300k gpt41 claude_sonnet_35
tool_reasoning o4mini gemini_flash_25

Override the config by setting LLM_CFG:

export LLM_CFG=/path/to/custom.yaml

πŸ’΅ Official AI Model API Pricing (May 2025)

The following prices were collected from official documentation in May 2025. All values are shown per million tokens.

OpenAI Models

  • o4-mini-2025-04-16: Input $1.10, Output $4.40 – fast, cost‑efficient reasoning with multimodal support.
  • gpt-4.1-2025-04-14: Input $2.00, Output $8.00 – improved coding and instruction following with a 1M token context window.
  • o3-2025-04-16: Input $10.00 (cached input $2.50), Output $40.00 – OpenAI's most powerful reasoning model with a 200K token context window.

Google Gemini Models

  • Gemini 2.5 Flash (preview):
    • Input: Text/Image/Video $0.15, Audio $1.00
    • Output: Non-thinking mode $0.60, Thinking mode $3.50
    • First Flash model with thinking capabilities (preview).
  • Gemini 2.5 Pro (preview):
    • Input ≀ 200k tokens $1.25, > 200k tokens $2.50
    • Output ≀ 200k tokens $10.00, > 200k tokens $15.00
    • Most advanced Gemini reasoning model with a 1M token context window.

Prices may change as these models move from preview to general availability. Consult the respective provider pages for the latest information.

About

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.9%
  • HTML 14.3%
  • Shell 5.4%
  • Other 1.4%