Skip to content

pat-walther/LoL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Markdown Extractor

This project extracts content from the 2025-CPG-DIGITAL_Final_Secure.pdf file and converts it into both markdown and structured JSON formats.

Features

  • Page-by-page extraction: Processes each page individually
  • Table detection: Automatically detects and preserves table structures
  • Markdown output: Clean, readable markdown format
  • JSON database: Structured data for programmatic access
  • Section detection: Identifies headers and sections

Setup

  1. Install dependencies:

    npm install
  2. Upload your PDF: Place the 2025-CPG-DIGITAL_Final_Secure.pdf file in the root directory of this project.

Usage

Run the extraction:

npm run extract

Output

The script generates three files in the output/ directory:

  1. extracted-content.md: Full markdown document with all pages, sections, and tables
  2. extracted-content.json: Complete page-by-page JSON structure with raw text
  3. database.json: Structured database format with separated sections, tables, and content

Database Structure

The database.json file provides a structured format:

{
  "document": {
    "id": "cpg-digital-2025",
    "title": "2025 CPG Digital",
    "totalPages": 100,
    "extractedAt": "2025-11-13T..."
  },
  "sections": [
    {
      "page": 1,
      "title": "SECTION TITLE"
    }
  ],
  "tables": [
    {
      "id": "page5_table1",
      "page": 5,
      "headers": ["Column 1", "Column 2"],
      "data": [["value1", "value2"]]
    }
  ],
  "content": [
    {
      "page": 1,
      "text": ["paragraph 1", "paragraph 2"]
    }
  ]
}

Table Extraction

The script uses intelligent table detection that:

  • Identifies rows with multiple columns (separated by 2+ spaces or tabs)
  • Validates table structure consistency
  • Formats tables in markdown format
  • Preserves table data in JSON for easy querying

Next Steps

Once you upload the PDF file, run npm run extract and the script will:

  1. Read the PDF page by page
  2. Detect tables and sections
  3. Generate all three output files
  4. Display progress and completion status

Requirements

  • Node.js 14+ (for ES modules support)
  • The PDF file in the root directory

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published