Introduction

pdf2htmlEX renders PDF files in HTML, accurately.

Why HTML ?

Read while downloading
Open and flexible
Theme
Layout
Behaviour
CSS / JavaScript
Plugin-free
Beautiful
Secure

PDF and HTML are quite different formats.

PDF focus on the capability of accurate representing and printing all kinds of documents.
HTML focus on multimedia, user interaction, network optimization.

What would you do if you want to show some PDF-quality contents online ?

[Lazy] To let user download or read with the PDF/Flash/other plugins, often leads to terrible user experience, inconsistent themes etc.
[Easy] To convert PDF pages into images, then embed into HTML pages, which produces either blurred text or monster sized files/network cost. Also text are logically lost, readers cannot perform searching or copying.
[Clever] Still to convert PDF into images, but with hidden text layer and images with different resolutions. This is actually a pretty good solution (only) when you have powerful server and high-capacity networks.
[Good#1] To use Javascript to parse and render the PDF. Server is relieved, but all burden comes to the client-side, if you care.
[Good#2] To convert to HTML. One-time conversion, similar file size as the original PDF, preserved text, light weight for both server & client. Happy ending!

This page compares pdf2htmlEX with other similar approaches.

So you say "convert to HTML" is best, why there are still PDF files ?

No I didn't, I say "good" only. What I didn't say is that all above but the last one can support full PDF features. Because after all HTML and PDF are different formats, there is no lossless conversion between them.
However, HTML is capable enough most of the time. Because most PDF files are simple. Think about things that you have seen in PDF but never in HTML.

Finally, HTML+CSS+Javascript nowadays means nothing impossible.

For providers: full control
- Documents embedded with consistent theme and behavior with your own site.
- Cross references to other documents, as simple as hyperlinks in HTML.
- Access statistics, dynamic contents...
For readers: better experience
- Read while downloading: just like old times (with HTML)
- Plugin-free: No more ugly plugins!
- Better UI, better integration = better website.

(much) easier & better than with PDF huh?

What's special with pdf2htmlEX ?

Accuracy Text preservation.

If you are worried about misplaced text with ugly fonts, you must have tried to convert PDF files into HTML, desperately. Don't worry and take a look at the demo pages in the home page of pdf2htmlEX, which might be able to change your mind.

pdf2htmlEX was first designed especially for scientific papers, which contain different fonts and complicated formulas and figures.

Also Javascript is optional, the converted HTML file should work without Javascript.

When page images are stored as WebP in base64 format instead of PNG, the resulting PDF size is significantly reduced. If the images are called externally as WebP instead of embedding them as base64, the size is reduced by approximately 30% more. Below, I’m sharing an example BASH code block that converts PNGs to WebP and embeds the base64-encoded WebP images into all pages.

# Loop through all .png images in the specified directory (bg*.png)
for img in /path/to/your/directory/bg*.png; do

    # Extract the image filename without the extension (.png)
    img_name=$(basename "$img" .png)

    # Convert the .png image to .webp format with quality 75 and save it in the same directory
    convert "$img" -quality 75 "/path/to/your/directory/$img_name.webp"
done

# Set the folder path variable to the directory containing the images and other files
folder_path="/path/to/your/directory"

# Loop through all .page files in the specified folder
for file in "$folder_path"/*.page; do
  # Check if the file is a regular file (not a directory)
  if [[ -f "$file" ]]; then
    # Extract the src URL of the image in the .page file and replace the .png extension with .webp
    x=$(grep -oP 'src="\K[^"]+' $file | sed 's/\.png$//') && x="$x.webp"
    
    # Encode the .webp image file to base64 and save it to encode.txt
    base64 /path/to/your/directory/$x > /path/to/your/directory/encode.txt
    
    # Remove any newlines from the base64-encoded content and save to a temporary file
    cat /path/to/your/directory/encode.txt | tr -d '\n' > /path/to/your/directory/temp_base64.txt
    
    # Update the .page file to use the .webp extension instead of .png
    sed -i 's/\(src="[^"]*\)\.png"/\1.webp"/g' "$file"
    
    # Replace the image src in the .page file with the base64-encoded data URI for the .webp image
    awk -v x="$x" 'NR==FNR{base64=$0; next} {gsub(x, "data:image/webp;base64," base64)}1' \
        /path/to/your/directory/temp_base64.txt $file \
        > /path/to/your/directory/temp.page \
        && mv /path/to/your/directory/temp.page $file 
  fi
done

Introduction

pdf2htmlEX renders PDF files in HTML, accurately.

Why HTML ?

What's special with pdf2htmlEX ?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally