The goal of this project is to create an end-to-end system for parsing historical documents pertaining to the Emma B. Andrews Diary Project. The tools provided in this project can be used to greatly speed up the processing of historical documents by automating the process while still allowing for human oversight.
The system has three components: OCR, named entity recognition and normalization, and XML generation. Additionally, there are two user interfaces. The first is to aid the user in running OCR and correcting any mistakes. The second is to aid in named entity recognition and XML markup.
Although this tool was developed specifically for use by interns working on the Emma B. Andrews Diary Project, it should also be extensible to other historical datasets written in English.
This project is licensed under the GNU General Public License v3.0 (see LICENSE.txt for full details).
This component transforms pdf files into text.
All code for this component is located under ocr/
All data generated by this component is located under data/
Experience with Linux and knowledge of programming is strongly recommended in order to run batch OCR processing yourself. This knowledge is assumed in this section of the user guide.
This process is broken into two parts:
- pdfToTxt.sh Run an open-source OCR tool to transform pdfs to txt files.
- ocrPostProcess.sh Do some post-processing on the txt files to automatically correct some mistakes.
You need to have python3 and pip installed
Dependencies for pdfToTxt.sh:
- pdfseparate
- imagemagick
- tesserocr (python OCR library. To install: "pip install tesserocr")
- tesseract-ocr libtesseract-dev libleptonica-dev (dependencies of tesserocr. To install: "apt-get install tesseract-ocr libtesseract-dev libleptonica-dev"). For more info, see tesserocr's README
Dependencies for ocrPostProcess.sh:
- nltk
Before running ocrPostProcess.sh, you also need to unzip this file: WikipediaTitles/titleWordCount.zip
sh pdfToTxt.sh pathToInputPdf outputDataDirectory
By convention, the outputDataDirectory is a subdirectory of data/
sh ocrPostProcess.sh outputDataDirectory
Be sure to use the same outputDataDirectory in both scripts!
If you want to add more raw OCR data to use in the post processing, you can do so by creating a subdirectory under outputDataDirectory/raw_ocr/. Be sure to use identical filenames as those under outputDataDirectory/raw_ocr/200 and outputDataDirectory/raw_ocr/500. We've already done this by adding data from Our colleague's project (example of his data in our project)
In this case, your workflow would be this
sh pdfToTxt.sh pathToInputPdf outputDataDirectory
# manually add the additional raw_ocr files...
sh ocrPostProcess.sh outputDataDirectory
We have a website for editing output of the batch OCR processing results. After OCR, there's a lot of mistakes in the texts that we need human proofreaders to correct. Please go here for documentation on how to use.
The tool can be accessed here: http://www.pentimento.dreamhosters.com/teigenerator/
The web interface is divided into 4 sections, the top left section is for user input, you can type any text or upload a file from local disk by clicking “choose file”, then picking a .txt file from local disk then clicking “upload”. For example upload a text file from emma’s diary
After that, click “Generate Markup”, and wait for couple of seconds until the output appears in the output section in the top right section of the page, the output can be saved to local disk as .xml file by clicking “Export”
In the lower left section you can find the TEI header section, there is a default header that you can modify or you can upload your header .xml file from local disk by clicking “choose file”, then picking a .xml file, then clicking “upload”. This header will be included in the output xml.
In the lower right section you can find the location name variations database, which you can “export” as xml file. This database is used to populate the ref attribute of placeName element in the generated markup, it tries to unify the ref attribute of different location spelling variations found in the text (e.g. Assouan and Aswân are both referring to the same name Aswan). More details about the schema of this database in the technical documentation below
For developers : in order to run the tool locally (i.e. localhost) you will need: (1) Download free trial of PyCharm professional, (2) From the IDE open this project, (3) From terminal run: “python manage.py runserver”, (4) You should see the tool available on http://localhost:8000/teigenerator/
Deployments : in order to make changes to the tool running online, you will need access to this repo: https://github.com/Eslam-Elsawy/teigeneratortool, we decided to make a separate small size repo to deploy from to save storage on the hosting server
The tool is deployed to Dreamhost, to get access run "ssh newbook@william-few.dreamhost.com" from your terminal, contact Sarah for password, the tool can be found under "/home/newbook/pentimento.dreamhosters.com/"
Batch processing for NER is not compatible with the user interface. However, it can be done via commandline. Store the documents you want to process in the ner_input directory. These documents must be plain text files. If they are image files or pdfs, they will need to go through the OCR component as described above.
From the ner directory, run the following command:
./ner.sh
The marked-up files will be available in the ner_output directory. Markup will include the following named entity XML labels:
persName placeName orgName name type="hotel" name type="vessel"
The persName tags will have a ref attribute with a normalized version of the name.
This process is broken into two parts:
- pdfToTxt.sh Run an open-source OCR tool to transform pdfs to txt files.
- ocrPostProcess.sh Do some post-processing on the txt files to automatically correct some mistakes.
This script performs these steps
- Separate pdf into pages using the command 'pdfseparate'
- Convert each page into a .tiff file. This is the input for the OCR engine.
- run do_ocr.py, which is just a wrapper around the library tesserocr
- Remove the .tiff files because they are really big, and we don't really need them
- Also convert pdfs to png, since that's useful for the interface
This has two components: normalize.py and merge.py
The input should be the two alternative text outputs of pdfToText.sh, or some collection of folders in the same format. There should be at least two input sets. Otherwise, merge.py will not run.
normalize.py corrects spacing errors by performing tokenize and detokenize sequentially. Also, nonstandard characters are replaced with their standard equivalents (e.g. curly quotes are changed to ascii quotes).
Usage is
./normalize.py < inputFile > outputFile
merge.py takes the two alternative OCR results, and combines them via an algorithm that chooses the best alternative everywhere where the two OCR results disagree.
Here is an overview of the rules applied, with their corresponding functions
- doSpellingMerge(): Is only one alternative a valid English word? If so, choose valid word. ("the" vs. "tlo")
- doCapitalizationMerge(): Is only one alternative capitalized correctly? If so, choose correct capitalization. ("I know what to do" vs. "I know What to do")
- doUnigramFrequncyMerge(): Is one alternative a much more common word than the other alternative? If so, choose the more common word (e.g. "must" vs. "mash")
- doPunctuationGarbageMerge(): Is one alternative blank, and the other has only punctuation characters? If so, choose the blank (e.g. "" vs ".")
- doExtraPunctuationMerge(): Do both alternatives have non-punctuation, and does one alternative have less punctuation characters than the other? If so, choose the one with less punctuation (e.g. "the" vs. "'the.").
After the two alternatives, are merged, there's some post-processing, namely via the markNonWords() method, we highlight any words that are probably misspelled.
merge.py outputs either text or html. The text can be color coded if desired. Blue is where we think we chose the correct alternative, red is where we either guessed, or this is a word that's spelled wrong.
Usage is
./merge.py inputDirectory1 inputDirectory2 outputDirectory [options]
Options must not be separated by spaces. For example 'hc', not 'h c'. Options are as follows:
- h: output html instead of text
- c: output color-coded
- d: output debug view (same as color coded, except shows the rejected alternative crossed out in yellow; only supported for plain text)
See README of pentimenti.github.io
The output of the merge step can be used to add new books to this project. The required pentimenti book directories can be copied from these places
- pentimenti.github.io/bookTitle/htmls/: data/outputDataDir/merged_ocr/step[n]-html/
- pentimenti.github.io/bookTitle/txts/: data/outputDataDir/merged_ocr/step[n]-txt/
- pentimenti.github.io/bookTitle/pngs/: data/outputDataDir/pngs/
(Where n is the biggest number available)
The NER module uses the Stanford CoreNLP library to generate persName, placeName, and orgName tags. The Stanford CoreNLP output goes through a post-processing step to ensure that all named entities are on the same line (for compatibility with the XML generation component). There is also a rule-based post-processing step to include personal titles in the named entities (e.g. Mr., Mrs., Lord, etc.). Finally, there is another rule-based post-processing step that will reclassify named entities as vessels or hotels where applicable.
The named entity normalization step for people consists of two steps: lexical similarity and semantic similarity. The lexical similarity of named entities is computed using Levenshtein distance.
The semantic similarity is computed using cosine distance between the contexts in which any two named entities appear. Context is defined to be the paragraph in which the named entity appears. The paragraphs are vectorized using a tf-idf vectorizer and then reduced to a 200-dimensional vector via SVD.
The lexical similarity and semantic similarity are added and then clustered using affinity propagation. The most frequently occurring named entity in the cluster is used as the ref attribute.
Places names normalization is happening using places variations database
The database file is in .xml format, with element as the root, each must have one element which has the name of the wikipedia page that corresponds to this location
Each has zero or more elements, which were automatically extracted from parsing wikipedia dumps
Each can have zero or more elements, which can be added manually by the annotator
Each can have one element, which is added manually by the annotator
During generating the xml markup, we look for the elements in the variations database, if we found a or that matches the place name, we update the ref attribute of the element with the if it exists otherwise we use
See resolveVariations function in teigenerator.py script
The tei generator script can be found here
The script takes as input: (1) the input text lines, (2) the header file and (3) the variation database
The output is a valid xml that has teiHeader, div, p, title, date, placeName, orgName, persName elements
We are using dateutil library for parsing dates