PaperCoder is an AI-driven research automation system designed to assist researchers in planning, coding, and evaluating machine learning experiments. This project integrates Large Language Models (LLMs) and iterative feedback loops to refine code generation pipelines and enhance downstream system performance.
This work was part of a research project at the National University of Singapore exploring how feedback loops and prompt engineering can optimize AI-driven research assistants.
Classifies system critiques into three stages:
- π§ Planning: Checks for missing or incorrect high-level research designs (e.g., wrong algorithm selection).
- π§ͺ Analyzing: Detects issues in initial code generation, data processing logic, and module implementation.
- π Evaluating: Identifies faults in evaluation metrics, test cases, and reporting.
β
Why it matters:
This allows PaperCoder to pinpoint failure sources and optimize stage-specific processes for better overall code quality.
Introduces an iterative refinement mechanism in the analyzing phase:
- Collects critiques from the first code generation pass.
- Feeds them back into the LLM with tailored prompts.
- Generates improved code for downstream evaluation.
β
Why it matters:
Mimics human researchers refining their code after peer review, enabling PaperCoder to "learn" from its mistakes within one generation cycle.
Customizes prompts sent to the LLM for each stage:
- In Analyzing, prompts request modular, maintainable code with explicit comments and alignment to paper methodologies.
- Added severity-level tagging (High, Medium, Low) for critiques to prioritize critical fixes.
β
Why it matters:
Improves first-try correctness and reduces cascading errors into later stages.
Benchmarks generated code against gold-standard repositories:
- Compares implementation details.
- Tracks severity levels of critiques.
- Outputs structured JSON and LaTeX reports summarizing results.
β
Why it matters:
Provides an objective measure of code fidelity and highlights areas for improvement.
- Integrated and deployed as part of a research project on AI-driven research automation with iterative refinement methods.
- Currently being drafted into a research paper for publication.
- Python: Core backend logic
- OpenAI API: LLM-based code generation
- tqdm, argparse: CLI tooling and progress tracking
- AWS S3: (Optional) Artifact storage and retrieval
PaperCoder/ βββ analyzing/ β βββ analyzing.py β βββ feedback_loop.py βββ utils/ β βββ extract_planning.py β βββ evaluation.py βββ outputs/ β βββ generated_code/ βββ README.md
Mihir Shah
π§ mihirsunilshah@gmail.com
π LinkedIn β’ GitHub