PDF to Markdown Extractor

This project extracts content from the 2025-CPG-DIGITAL_Final_Secure.pdf file and converts it into both markdown and structured JSON formats.

Features

✅ Page-by-page extraction: Processes each page individually
✅ Table detection: Automatically detects and preserves table structures
✅ Markdown output: Clean, readable markdown format
✅ JSON database: Structured data for programmatic access
✅ Section detection: Identifies headers and sections

Setup

Install dependencies:
```
npm install
```
Upload your PDF: Place the 2025-CPG-DIGITAL_Final_Secure.pdf file in the root directory of this project.

Usage

Run the extraction:

npm run extract

Output

The script generates three files in the output/ directory:

extracted-content.md: Full markdown document with all pages, sections, and tables
extracted-content.json: Complete page-by-page JSON structure with raw text
database.json: Structured database format with separated sections, tables, and content

Database Structure

The database.json file provides a structured format:

{
  "document": {
    "id": "cpg-digital-2025",
    "title": "2025 CPG Digital",
    "totalPages": 100,
    "extractedAt": "2025-11-13T..."
  },
  "sections": [
    {
      "page": 1,
      "title": "SECTION TITLE"
    }
  ],
  "tables": [
    {
      "id": "page5_table1",
      "page": 5,
      "headers": ["Column 1", "Column 2"],
      "data": [["value1", "value2"]]
    }
  ],
  "content": [
    {
      "page": 1,
      "text": ["paragraph 1", "paragraph 2"]
    }
  ]
}

Table Extraction

The script uses intelligent table detection that:

Identifies rows with multiple columns (separated by 2+ spaces or tabs)
Validates table structure consistency
Formats tables in markdown format
Preserves table data in JSON for easy querying

Next Steps

Once you upload the PDF file, run npm run extract and the script will:

Read the PDF page by page
Detect tables and sections
Generate all three output files
Display progress and completion status

Requirements

Node.js 14+ (for ES modules support)
The PDF file in the root directory

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
UPLOAD_PDF_HERE.txt		UPLOAD_PDF_HERE.txt
extract-pdf.js		extract-pdf.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Markdown Extractor

Features

Setup

Usage

Output

Database Structure

Table Extraction

Next Steps

Requirements

About

Uh oh!

Releases

Packages

Languages

pat-walther/LoL

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Extractor

Features

Setup

Usage

Output

Database Structure

Table Extraction

Next Steps

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages