Skip to content

taoo0316/plain_text_from_PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

plain_text_from_PDF

This guide describes how to set up ScienceBeam using Docker, process PDF files to extract text, and convert ALTO XML to plain text using Python.

🚀 Step 1: Install Docker

ScienceBeam runs inside a Docker container. If you haven’t installed Docker yet:

  1. Download Docker Desktop from https://www.docker.com/products/docker-desktop.
  2. Install and start Docker.
  3. Verify Docker is installed by running:
    docker --version

🏃 Step 2: Run ScienceBeam (PdfAlto Mode)

Run the following command to pull and start the ScienceBeam container:

docker run -p 8070:8070 --rm elifesciences/sciencebeam-parser

This will start the ScienceBeam server at http://localhost:8070/.

📂 Step 3: Process a PDF File

Use curl to send a PDF file for processing:

curl -X POST "http://localhost:8070/api/pdfalto" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -o /Users/zwt2000/Desktop/output.json

Input: A PDF file. Output: ALTO XML format (saved as output.json).

🔄 Step 4: Convert ALTO XML to Plain Text

Since ScienceBeam returns ALTO XML, we extract the text using Python.

python3 extract_from_json.py

This will generate a plain text file (extracted_text_json.txt) on your Desktop.

Alternatively: Process a PDF File as TEI XML (with sections):

curl -X POST "http://localhost:8070/api/processFulltextDocument" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -F "output=tei" \
     -o /Users/zwt2000/Desktop/output.xml

We can also extract the text using Python.

python3 extract_from_xml.py

This will generate a plain text file (extracted_text_xml.txt) on your Desktop.

About

generating plain text from PDF files using the ScienceBeam parser

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages