In development, all of this can change.
Parapipeline is a pipeline for POS tagging of texts in multiple languages, sentence alignment, word alignment, and transliteration.
Make sure you have following programs installed
- Python 3.8 or later (not tested on Python 3.7 and earlier)
wget- All prerequisites for Hunalign
- Polyglot for transliteration requires
python-numpylibicu-dev. (apt-get python-numpy libicu-dev) git-lfs
Run git lfs install.
Run make to install necessary packages, compile taggers, aligners, download models, ...
The installation process is not thoroughly tested on various systems (it should work on Ubuntu 18.04), if you encounter an error it's likely caused by a missing prerequisite.
There are scripts tag, transliterate, align, wordalign and run
All scripts have the same arguments as run.
All scripts expect line delimited sentences in utf-8 encoded files.
The name of these files is NAME_LANG[_ID][.ext], where NAME is arbitrary text not containing _, LANG is iso-639-3 language code,
optional ID distinguished between more variants of the same text (e.g. different translations), .ext is also optional.
Inputs have to be case insensitive (if you have file NAME and Name, it will cause errors).
See files in examples folder for some example input files.
run script outputs tagged texts in XML files.
And when possible also sentence and word alignment files in XML.
See examples/outputs for example output of this pipeline.
Aligned sentences are represented by link tag.
- Attribute
typedenotes the number of sentences from source and target text. - Attribute
xtargetsis the alignment itself:6 7;5meaning sentences6,7from source are aligned with sentence5from target.
Each link represents an "aligned block" - aligned sentences.
Attribute xtargets contains is a space-separated list of alignments... 1:2;3:4 means that word 2 from sentence 1 in source text is aligned with word 4 in sentence 3 in the target text.
usage: run.py [-h] [-o OUTPUT_DIR] N [N ...]
Run pipeline.
positional arguments:
N List of files to be processed. Format: NAME_LANG[_ID][.ext], for example Hobbit_eng.txt
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Directory to which to write the output files.
| Done | Language | Done | Language | Done | Language |
|---|---|---|---|---|---|
| ✅ | Afrikaans | ✅ | French | ✅ | Norwegian |
| ✅ | Albanian | ✅ | Georgian | ✅ | Polish |
| ✅ | Armenian | ✅ | German | ✅ | Portuguese |
| ✅ | Belarusian | ✅ | Hebrew | ✅ | Romanian |
| ❌ | Bosnian | ✅ | Hungarian | ✅ | Russian |
| ✅ | Bulgarian | ✅ | Italian | ✅ | Serbian |
| ✅ | Catalan | ✅ | Japanese | ✅ | Slovak |
| ✅ | Chinese | ❌ | Kashubian | ✅ | Slovenian |
| ✅ | Croatian | ✅ | Korean | ✅ | Spanish |
| ✅ | Czech | ✅ | Latvian | ✅ | Swedish |
| ✅ | Danish | ✅ | Lithuanian | ✅ | Turkish |
| ✅ | Dutch | ❌ | Lower Sorbian | ✅ | Ukrainian |
| ✅ | English | ✅ | Macedonian | ✅ | Upper Sorbian |
| ✅ | Estonian | ✅ | Modern Greek | ❌ | Yiddish |
| ✅ | Finnish | ❌ | Molise Slavic |
- Edit
.config/config.json, follow the structure of the other languages in the file to add a new one.- For treetagger,
.parfile has to be in./pipeline/taggers/treetagger/lib/. - For UDPipe,
.udpipefile has to be in./pipeline/taggers/udpipe/models/. Note that this uses UDPipe version 1, UDPipe version 2 models will not work.
- For treetagger,
This section is about upgrading models
In order to update UDPipe models, change ./pipeline/taggers/Makefile, section models to download desired models (and extract them ...).
Then change config/config.json so that each language which uses UDPipe points to correct filename.
Change ./pipeline/taggers/treetagger/Makefile to download version of treetagger you wish to use.
You can also add scripts to download more models and so on.
- UDPipe: Straka Milan, Hajič Jan, Straková Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016
- Treetagger: Helmut Schmid (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK
- Hunalign: D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005). Parallel corpora for medium density languages In Proceedings of the RANLP 2005, pages 590-596.
- BTagger: https://github.com/agesmundo/BTagger
- Georgian Treetagger model comes from here: http://corpus.leeds.ac.uk/serge/mocky/ka.par
- Eflomal: https://github.com/robertostling/eflomal
CC BY-NC-SA
This work mainly depends on trained UDPipe models which are licesed under CC BY-NC-SA.