Skip to content

keyuchen21/auto-research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Auto-Research

An automated research paper generation and management system that leverages AI to create comprehensive survey papers and manage academic literature.

πŸš€ Overview

Auto-Research is a comprehensive toolkit for automating academic research workflows, featuring:

  • Automated Survey Paper Generation: Generate high-quality survey papers on any topic using advanced AI models
  • ArXiv Paper Synchronization: Download and manage papers from ArXiv with intelligent categorization
  • Vector-Based Paper Search: Find relevant papers using semantic similarity search
  • IEEE-Style LaTeX Formatting: Generate publication-ready papers in IEEE format

πŸ“ Project Structure

auto-research/
β”œβ”€β”€ sync_paper/          # ArXiv paper synchronization and database management
β”‚   β”œβ”€β”€ src/            # Core synchronization modules
β”‚   └── test/           # Test suites
β”œβ”€β”€ write_paper/        # Automated paper generation system
β”‚   β”œβ”€β”€ src/            # Paper generation pipeline
β”‚   β”‚   β”œβ”€β”€ models/     # Data models
β”‚   β”‚   β”œβ”€β”€ nodes/      # Processing nodes
β”‚   β”‚   └── providers/  # LLM providers (Ollama, OpenAI, etc.)
β”‚   └── IEEE_Conference_Template/  # IEEE LaTeX templates
└── Cline/              # MCP server implementations
    └── MCP/
        β”œβ”€β”€ Ollama-mcp/        # Ollama MCP server
        └── mcp-server-firecrawl/  # Firecrawl MCP server

🎯 Key Features

1. Paper Synchronization (sync_paper)

  • Automated ArXiv Download: Fetches papers from ArXiv dataset via Kaggle API
  • Smart Categorization: Filters papers by ML/AI categories (cs.AI, cs.CL, cs.CV, cs.LG, stat.ML)
  • PostgreSQL Integration: Stores papers with pgvector for efficient similarity search
  • Duplicate Detection: Prevents re-uploading existing papers

2. Survey Paper Generation (write_paper)

Standard Pipeline

  • Topic analysis and outline generation
  • Vector similarity search for relevant papers
  • Content synthesis using LLMs
  • IEEE LaTeX formatting

AutoSurvey Pipeline (Advanced)

A comprehensive 4-stage pipeline for high-quality surveys:

  1. Initial Retrieval & Outline Generation: Creates structured hierarchical outline
  2. Subsection Drafting: Targeted retrieval and drafting for each section
  3. Integration & Refinement: Refines and integrates sections cohesively
  4. Rigorous Evaluation: Iterative improvement based on quality metrics

3. MCP Servers

  • Ollama MCP: Integration with Ollama for local LLM inference
  • Firecrawl MCP: Web scraping and content extraction capabilities

πŸ› οΈ Installation

Prerequisites

  • Python 3.8+ (3.12+ recommended)
  • Docker and Docker Compose
  • PostgreSQL with pgvector extension
  • Ollama (for local LLM inference)
  • Node.js (for MCP servers)

Quick Start

  1. Clone the repository
git clone https://github.com/keyuchen21/auto-research.git
cd auto-research
  1. Set up PostgreSQL with pgvector
docker compose up -d
  1. Install Python dependencies

For paper synchronization:

cd sync_paper
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

For paper generation:

cd write_paper
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
  1. Configure Kaggle API (for ArXiv sync)
  1. Start Ollama (for paper generation)
ollama run llama2  # or your preferred model

πŸ“– Usage

Sync ArXiv Papers

cd sync_paper
uv run src/upload_paper.py

Generate Survey Paper

Standard pipeline:

cd write_paper
python -m src.main --topic "Large Language Models" \
                   --output output_directory \
                   --model llama2 \
                   --reference-num 1500

AutoSurvey pipeline (recommended):

python -m src.main --topic "Multimodal Learning" \
                   --output output_directory \
                   --model llama2 \
                   --reference-num 1500 \
                   --autosurvey

Parameters

  • --topic: Research topic for the survey (required)
  • --output: Output directory (default: "output")
  • --model: Ollama model to use (default: "llama2")
  • --reference-num: Number of papers to consider (default: 1500)
  • --autosurvey: Enable advanced AutoSurvey pipeline

πŸ“Š Database Schema

The system uses PostgreSQL with pgvector for efficient similarity search:

CREATE TABLE papers (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    authors TEXT[],
    abstract TEXT,
    categories TEXT[],
    url TEXT,
    published_date DATE,
    embedding VECTOR(1536)  -- For similarity search
);

πŸ”§ Configuration

Environment Variables

Create a .env file in the project root:

# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=research_db
DB_USER=postgres
DB_PASSWORD=postgres

# Ollama
OLLAMA_HOST=http://localhost:11434

# OpenAI (optional)
OPENAI_API_KEY=your_api_key_here

πŸ“ Output Format

Generated papers include:

  • LaTeX Source: IEEE-formatted .tex file
  • Structured Content:
    • Title and abstract
    • Introduction with background
    • Methodology sections
    • Results and discussion
    • Conclusions
    • References in IEEE style
  • Metadata: Generation parameters and statistics

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • ArXiv for providing open access to research papers
  • Ollama team for local LLM inference
  • PostgreSQL and pgvector for efficient vector storage
  • IEEE for LaTeX templates

πŸ“ž Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing documentation in subdirectory READMEs

🚦 Status

Build Status Python Version License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published