Pentimento

The goal of this project is to create an end-to-end system for parsing historical documents pertaining to the Emma B. Andrews Diary Project. The tools provided in this project can be used to greatly speed up the processing of historical documents by automating the process while still allowing for human oversight.

The system has three components: OCR, named entity recognition and normalization, and XML generation. Additionally, there are two user interfaces. The first is to aid the user in running OCR and correcting any mistakes. The second is to aid in named entity recognition and XML markup.

Although this tool was developed specifically for use by interns working on the Emma B. Andrews Diary Project, it should also be extensible to other historical datasets written in English.

License

This project is licensed under the GNU General Public License v3.0 (see LICENSE.txt for full details).

User Guide

Batch OCR Processing

This component transforms pdf files into text.

All code for this component is located under ocr/

All data generated by this component is located under data/

Experience with Linux and knowledge of programming is strongly recommended in order to run batch OCR processing yourself. This knowledge is assumed in this section of the user guide.

This process is broken into two parts:

pdfToTxt.sh Run an open-source OCR tool to transform pdfs to txt files.
ocrPostProcess.sh Do some post-processing on the txt files to automatically correct some mistakes.

How to setup:

You need to have python3 and pip installed

Dependencies for pdfToTxt.sh:

pdfseparate
imagemagick
tesserocr (python OCR library. To install: "pip install tesserocr")
tesseract-ocr libtesseract-dev libleptonica-dev (dependencies of tesserocr. To install: "apt-get install tesseract-ocr libtesseract-dev libleptonica-dev"). For more info, see tesserocr's README

Dependencies for ocrPostProcess.sh:

nltk

Before running ocrPostProcess.sh, you also need to unzip this file: WikipediaTitles/titleWordCount.zip

How to use pdfToTxt.sh

sh pdfToTxt.sh pathToInputPdf outputDataDirectory

By convention, the outputDataDirectory is a subdirectory of data/

How to use ocrPostProcess.sh

sh ocrPostProcess.sh outputDataDirectory

Be sure to use the same outputDataDirectory in both scripts!

How to add additional OCR output alternatives

If you want to add more raw OCR data to use in the post processing, you can do so by creating a subdirectory under outputDataDirectory/raw_ocr/. Be sure to use identical filenames as those under outputDataDirectory/raw_ocr/200 and outputDataDirectory/raw_ocr/500. We've already done this by adding data from Our colleague's project (example of his data in our project)

In this case, your workflow would be this

sh pdfToTxt.sh pathToInputPdf outputDataDirectory
# manually add the additional raw_ocr files...
sh ocrPostProcess.sh outputDataDirectory

Post-OCR Editing Interface

We have a website for editing output of the batch OCR processing results. After OCR, there's a lot of mistakes in the texts that we need human proofreaders to correct. Please go here for documentation on how to use.

NER and XML Generation Tool

The tool can be accessed here: http://www.pentimento.dreamhosters.com/teigenerator/

The web interface is divided into 4 sections, the top left section is for user input, you can type any text or upload a file from local disk by clicking “choose file”, then picking a .txt file from local disk then clicking “upload”. For example upload a text file from emma’s diary

After that, click “Generate Markup”, and wait for couple of seconds until the output appears in the output section in the top right section of the page, the output can be saved to local disk as .xml file by clicking “Export”

In the lower left section you can find the TEI header section, there is a default header that you can modify or you can upload your header .xml file from local disk by clicking “choose file”, then picking a .xml file, then clicking “upload”. This header will be included in the output xml.

In the lower right section you can find the location name variations database, which you can “export” as xml file. This database is used to populate the ref attribute of placeName element in the generated markup, it tries to unify the ref attribute of different location spelling variations found in the text (e.g. Assouan and Aswân are both referring to the same name Aswan). More details about the schema of this database in the technical documentation below

For developers : in order to run the tool locally (i.e. localhost) you will need: (1) Download free trial of PyCharm professional, (2) From the IDE open this project, (3) From terminal run: “python manage.py runserver”, (4) You should see the tool available on http://localhost:8000/teigenerator/

Deployments : in order to make changes to the tool running online, you will need access to this repo: https://github.com/Eslam-Elsawy/teigeneratortool, we decided to make a separate small size repo to deploy from to save storage on the hosting server

The tool is deployed to Dreamhost, to get access run "ssh newbook@william-few.dreamhost.com" from your terminal, contact Sarah for password, the tool can be found under "/home/newbook/pentimento.dreamhosters.com/"

Batch Processing NER

Batch processing for NER is not compatible with the user interface. However, it can be done via commandline. Store the documents you want to process in the ner_input directory. These documents must be plain text files. If they are image files or pdfs, they will need to go through the OCR component as described above.

From the ner directory, run the following command:

./ner.sh

The marked-up files will be available in the ner_output directory. Markup will include the following named entity XML labels:

persName placeName orgName name type="hotel" name type="vessel"

The persName tags will have a ref attribute with a normalized version of the name.

Technical Documentation

Batch OCR Processing

This process is broken into two parts:

pdfToTxt.sh Run an open-source OCR tool to transform pdfs to txt files.
ocrPostProcess.sh Do some post-processing on the txt files to automatically correct some mistakes.

pdfToText.sh

This script performs these steps

Separate pdf into pages using the command 'pdfseparate'
Convert each page into a .tiff file. This is the input for the OCR engine.
run do_ocr.py, which is just a wrapper around the library tesserocr
Remove the .tiff files because they are really big, and we don't really need them
Also convert pdfs to png, since that's useful for the interface

ocrPostProcess.sh

This has two components: normalize.py and merge.py

The input should be the two alternative text outputs of pdfToText.sh, or some collection of folders in the same format. There should be at least two input sets. Otherwise, merge.py will not run.

normalize.py corrects spacing errors by performing tokenize and detokenize sequentially. Also, nonstandard characters are replaced with their standard equivalents (e.g. curly quotes are changed to ascii quotes).
Usage is

./normalize.py < inputFile > outputFile

merge.py takes the two alternative OCR results, and combines them via an algorithm that chooses the best alternative everywhere where the two OCR results disagree.

Here is an overview of the rules applied, with their corresponding functions

doSpellingMerge(): Is only one alternative a valid English word? If so, choose valid word. ("the" vs. "tlo")
doCapitalizationMerge(): Is only one alternative capitalized correctly? If so, choose correct capitalization. ("I know what to do" vs. "I know What to do")
doUnigramFrequncyMerge(): Is one alternative a much more common word than the other alternative? If so, choose the more common word (e.g. "must" vs. "mash")
doPunctuationGarbageMerge(): Is one alternative blank, and the other has only punctuation characters? If so, choose the blank (e.g. "" vs ".")
doExtraPunctuationMerge(): Do both alternatives have non-punctuation, and does one alternative have less punctuation characters than the other? If so, choose the one with less punctuation (e.g. "the" vs. "'the.").

After the two alternatives, are merged, there's some post-processing, namely via the markNonWords() method, we highlight any words that are probably misspelled.

merge.py outputs either text or html. The text can be color coded if desired. Blue is where we think we chose the correct alternative, red is where we either guessed, or this is a word that's spelled wrong.
Usage is

./merge.py inputDirectory1 inputDirectory2 outputDirectory [options]

Options must not be separated by spaces. For example 'hc', not 'h c'. Options are as follows:

h: output html instead of text
c: output color-coded
d: output debug view (same as color coded, except shows the rejected alternative crossed out in yellow; only supported for plain text)

Post-OCR Editing Interface

See README of pentimenti.github.io

The output of the merge step can be used to add new books to this project. The required pentimenti book directories can be copied from these places

pentimenti.github.io/bookTitle/htmls/: data/outputDataDir/merged_ocr/step[n]-html/
pentimenti.github.io/bookTitle/txts/: data/outputDataDir/merged_ocr/step[n]-txt/
pentimenti.github.io/bookTitle/pngs/: data/outputDataDir/pngs/

(Where n is the biggest number available)

Named Entity Recognition

The NER module uses the Stanford CoreNLP library to generate persName, placeName, and orgName tags. The Stanford CoreNLP output goes through a post-processing step to ensure that all named entities are on the same line (for compatibility with the XML generation component). There is also a rule-based post-processing step to include personal titles in the named entities (e.g. Mr., Mrs., Lord, etc.). Finally, there is another rule-based post-processing step that will reclassify named entities as vessels or hotels where applicable.

Named Entity Normalization (People)

The named entity normalization step for people consists of two steps: lexical similarity and semantic similarity. The lexical similarity of named entities is computed using Levenshtein distance.

The semantic similarity is computed using cosine distance between the contexts in which any two named entities appear. Context is defined to be the paragraph in which the named entity appears. The paragraphs are vectorized using a tf-idf vectorizer and then reduced to a 200-dimensional vector via SVD.

The lexical similarity and semantic similarity are added and then clustered using affinity propagation. The most frequently occurring named entity in the cluster is used as the ref attribute.

Named Entity Normalization (Places)

Places names normalization is happening using places variations database

The database file is in .xml format, with element as the root, each must have one element which has the name of the wikipedia page that corresponds to this location

Each has zero or more elements, which were automatically extracted from parsing wikipedia dumps

Each can have zero or more elements, which can be added manually by the annotator

Each can have one element, which is added manually by the annotator

During generating the xml markup, we look for the elements in the variations database, if we found a or that matches the place name, we update the ref attribute of the element with the if it exists otherwise we use

See resolveVariations function in teigenerator.py script

XML Generation

The tei generator script can be found here

The script takes as input: (1) the input text lines, (2) the header file and (3) the variation database

The output is a valid xml that has teiHeader, div, p, title, date, placeName, orgName, persName elements

We are using dateutil library for parsing dates

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
WikipediaTitles		WikipediaTitles
data		data
dateparser		dateparser
ner		ner
ner_input		ner_input
ner_lexical_coref		ner_lexical_coref
ner_markup		ner_markup
ner_output		ner_output
ocr		ocr
ocr_carter		ocr_carter
spellchecker		spellchecker
stanford_ner		stanford_ner
teigenerator		teigenerator
teigeneratortool		teigeneratortool
wikimysql		wikimysql
wikipediaxmlreader		wikipediaxmlreader
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
e2e.sh		e2e.sh
ocrPostProcess.sh		ocrPostProcess.sh
pdfToTxt.sh		pdfToTxt.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pentimento

License

User Guide

Batch OCR Processing

How to setup:

How to use pdfToTxt.sh

How to use ocrPostProcess.sh

How to add additional OCR output alternatives

Post-OCR Editing Interface

NER and XML Generation Tool

Batch Processing NER

Technical Documentation

Batch OCR Processing

pdfToText.sh

ocrPostProcess.sh

Post-OCR Editing Interface

Named Entity Recognition

Named Entity Normalization (People)

Named Entity Normalization (Places)

XML Generation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Linguistics575/pentimento

Folders and files

Latest commit

History

Repository files navigation

Pentimento

License

User Guide

Batch OCR Processing

How to setup:

How to use pdfToTxt.sh

How to use ocrPostProcess.sh

How to add additional OCR output alternatives

Post-OCR Editing Interface

NER and XML Generation Tool

Batch Processing NER

Technical Documentation

Batch OCR Processing

pdfToText.sh

ocrPostProcess.sh

Post-OCR Editing Interface

Named Entity Recognition

Named Entity Normalization (People)

Named Entity Normalization (Places)

XML Generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages