This project extracts content from the 2025-CPG-DIGITAL_Final_Secure.pdf file and converts it into both markdown and structured JSON formats.
- ✅ Page-by-page extraction: Processes each page individually
- ✅ Table detection: Automatically detects and preserves table structures
- ✅ Markdown output: Clean, readable markdown format
- ✅ JSON database: Structured data for programmatic access
- ✅ Section detection: Identifies headers and sections
-
Install dependencies:
npm install
-
Upload your PDF: Place the
2025-CPG-DIGITAL_Final_Secure.pdffile in the root directory of this project.
Run the extraction:
npm run extractThe script generates three files in the output/ directory:
extracted-content.md: Full markdown document with all pages, sections, and tablesextracted-content.json: Complete page-by-page JSON structure with raw textdatabase.json: Structured database format with separated sections, tables, and content
The database.json file provides a structured format:
{
"document": {
"id": "cpg-digital-2025",
"title": "2025 CPG Digital",
"totalPages": 100,
"extractedAt": "2025-11-13T..."
},
"sections": [
{
"page": 1,
"title": "SECTION TITLE"
}
],
"tables": [
{
"id": "page5_table1",
"page": 5,
"headers": ["Column 1", "Column 2"],
"data": [["value1", "value2"]]
}
],
"content": [
{
"page": 1,
"text": ["paragraph 1", "paragraph 2"]
}
]
}The script uses intelligent table detection that:
- Identifies rows with multiple columns (separated by 2+ spaces or tabs)
- Validates table structure consistency
- Formats tables in markdown format
- Preserves table data in JSON for easy querying
Once you upload the PDF file, run npm run extract and the script will:
- Read the PDF page by page
- Detect tables and sections
- Generate all three output files
- Display progress and completion status
- Node.js 14+ (for ES modules support)
- The PDF file in the root directory