pdf-ai

pdf-ai is a simple PHP library that makes extracting data from PDFs for large language models easy. It uses a single dependency, the Symfony Process Component, to interface with the Poppler command line tools from the xpdf library.

Installation

Install the library using Composer:

composer require 1tomany/pdf-ai

Installing Poppler

Before beginning, ensure the pdfinfo, pdftoppm, and pdftotext binaries are installed and located in the $PATH environment variables.

macOS

brew install poppler

Debian and Ubuntu

apt-get install poppler-utils

Usage

This library has three main features:

Read PDF metadata such as the number of pages
Rasterize one or more pages to JPEG or PNG images
Extract text from one or more pages

Extracted data is stored in memory and can be written to the filesystem or converted to a data: URI. Because extracted data is stored in memory, this library returns a \Generator object for each page that is extracted or rasterized.

Using the library is easy, and you have two ways to interact with it:

Direct Instantiate the OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient class and call the methods directly. This method is easier to use, but comes with the cost that your application will be less flexible and testable.
Actions Create a container of OneToMany\PDFAI\Contract\Client\ExtractorClientInterface objects, and use the OneToMany\PDFAI\Factory\ExtractorClientFactory class to instantiate them.

Note: A Symfony bundle is available if you wish to integrate this library into your Symfony applications with autowiring and configuration support.

Direct usage

<?php

require_once __DIR__ . '/vendor/autoload.php';

use OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient;
use OneToMany\PDFAI\Contract\Enum\OutputType;
use OneToMany\PDFAI\Request\ExtractDataRequest;
use OneToMany\PDFAI\Request\ExtractTextRequest;
use OneToMany\PDFAI\Request\ReadMetadataRequest;

$filePath = '/path/to/file.pdf';

// Construct the Poppler wrapper
$client = new PopplerExtractorClient();

// Construct and execute a request to read the PDF metadata
$metadata = $client->readMetadata(new ReadMetadataRequest($filePath));

vprintf("The PDF '%s' has %d page(s).\n", [
    $filePath, $metadata->getPages(),
]);

// Construct a request to rasterize all pages as 150 DPI JPEGs
$request = new ExtractDataRequest($filePath, 1, null, OutputType::Jpg, 150);

foreach ($client->extractData($request) as $image) {
    // $image->getData() or $image->toDataUri()
    printf("MD5: %s\n", md5($image->getData()));
}

// Extract text from pages 3 and 4
$request = new ExtractTextRequest($filePath, 3, 4);

foreach ($client->extractData($request) as $text) {
    // $text->getData()
    printf("Length: %d\n", strlen($text->getData()));
}

Test suite

Run the test suite with PHPUnit:

./vendor/bin/phpunit

Static analysis

Run static analysis with PHPStan:

./vendor/bin/phpstan

Credits

Vic Cherubini, 1:N Labs, LLC

License

The MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
examples		examples
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.php-cs-fixer.dist.php		.php-cs-fixer.dist.php
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpstan.dist.neon		phpstan.dist.neon
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-ai

Installation

Installing Poppler

macOS

Debian and Ubuntu

Usage

Direct usage

Test suite

Static analysis

Credits

License

About

Uh oh!

Releases 10

Uh oh!

Languages

License

1tomany/pdf-ai

Folders and files

Latest commit

History

Repository files navigation

pdf-ai

Installation

Installing Poppler

macOS

Debian and Ubuntu

Usage

Direct usage

Test suite

Static analysis

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Uh oh!

Languages