Releases · bitextor/bitextor

30 May 10:31

lpla

v8.3

ee2c7ca

Bitextor 8.3: Snake Runner, the Sentence Retirer Latest

Latest

I've seen things you people wouldn't believe. Roy Batty, The Preverticant

What's Changed

Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
CI and tests updates and fixes by @cgr71ii in #238
Range of paragraphs count using option paragraphIdentification by @lpla in #241
Document pair output file by @aarongaliano in #242
Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
Remove Linguacrawl from Bitextor by @aarongaliano in #248
- It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
Metadata code refactorization by @cgr71ii in #245
Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check directories and directioriesFile documentation, by @aarongaliano in #247
PDFprocessingoption (previously PDFextract). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247
Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
New Bitextor multilangoption (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247
New Bitextor argument bicleanerExtraArgs to pass extra arguments to Bicleaner(-AI) by @lpla in #250
Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
Updated Python requirements, submodules, and documentation.
Minor bug fixes and changes (including #253)

New Contributors

@aarongaliano made their first contribution in #242
@aliciannz made their first contribution in #249

Full Changelog: v8.2...v8.3

Notes

bitextor-v8.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.3.zip tarball or cloning the repo v8.3 tag.

We will support Bitextor 8.x branch until the next major version is released.

Contributors

lpla, cgr71ii, and 2 other contributors

Assets 3

0 Join discussion

12 Apr 16:35

lpla

v8.2

1b6e770

Bitextor 8.2: Snow White and the Hunspell

I told you to run. , The Huntsman

What's Changed

Prevertical2text integration by @cgr71ii in #223
Paragraph identification by @cgr71ii in #225
Change default sentence splitter (now it is Loomchild's Segment) and Bicleaner AI integration by @cgr71ii in #226
Use headers for descriptive column names in TSV input/output files by @cgr71ii in #227
Add pip optional dependencies by @cgr71ii in #229

Full Changelog: v8.1.1...v8.2

Notes

bitextor-v8.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.2.zip tarball or cloning the repo v8.2 tag.

We will support Bitextor 8.x branch until the next major version is released.

Contributors

cgr71ii

Assets 3

0 Join discussion

18 Oct 07:44

lpla

v8.1.1

987c0df

v8.1.1

Added support for Fedora installation. Check INSTALL.md for dnf commands.
Fixed tests/run-tests.sh to run those tests in both sequential (low resource server, using bash variable CI="true") or parallel.
Removed default file type filter in wget crawler, as it has issues with URLs without extension.
Bicleaner model training and dictionary generation options reworked:
- bicleaner will enable or disable Bicleaner, and bicleanerModel will contain the path to the model.
- Bicleaner model training will need to be explicitly enabled with bicleanerGenerateModel instead of checking out if the model provided through bicleanerModel config setting exists or not.
- Dictionary generation will need to be set through generateDic instead of checking out whether the dictionary exists or not.
Updated Python requirements.
Minor bug fixes.

Notes

bitextor-v8.1.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.1.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

Assets 3

24 Sep 09:59

lpla

v8.1

31cae7a

The Lost Word: Jurassic Warc (PG-8.1 rating)

"Oh my God! A snake! Help me!", Dr. Robert Burke

v8.1 Changelog

Major rework on paths and installation folders to allow Bitextor to be installed in a specific location
- Check out installation instructions and details in INSTALL.md
Replaced Tensorflow and Keras in the dictionary-based document aligner with scikit-learn
General clean up of Python code
Updated submodules and Python requirements versions

Notes

bitextor-v8.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

Assets 3

07 Jul 09:57

lpla

v8.0.1

06acec8

v8.0.1

Deferred crawling standoff annotation reconstruction script has been rewritten for better performance
- This one benefits from LRU dict as a limited-size hash memory-based cache
- Uses native warcio and Moses sentence splitter (Python port)
Fix bitextor-buildTMX.py dedup option
- Dedup was keeping sentences strings from the best score from Bifixer, but the other columns from the last occurrence (url, deferred crawling standoff annotation, bicleaner score...)
Bitextor now validates if a provided host is not valid
Updated submodules
- warc2text removed URLs lowercasing
Added more tests to the CI, including Bitextor with deferred crawling standoff annotation and its reconstruction.
Updated requirements and submodules to their latest stable version.

Notes

bitextor-v8.0.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.1.zip tarball or cloning the repo v8.0.1 tag.

We will support Bitextor 8.x branch until the next major version is released.

Assets 3

20 Apr 14:13

lpla

v8.0.0

cd869ca

Kill Bill-ingual: Vol. 8

"We have unfinished business.", Beatrix

v8.0 Changelog

Deep rewrite of Bitextor Snakefile for a vast performance improvement.
- Some config parameters and intermediate generated files also changed, so reusing old config files and transient or permanent folders from old runs would introduce issues.
- Snakemake project structure now matches the standard.
- Now we are listed on a comprehensive catalog of standards compliant, public, Snakemake workflows from the official Snakemake developers
- All features from previous Bitextor version work.
  - Machine translation system training now should be performed manually.
Added a new crawler: linguacrawl, specialized in full TLD crawling.
Added a new method for deferred crawling only using Murmurhash hashes at the sentence alignment step.
- A reconstructor is also provided: deferred-annotation-reconstructor.sh
Added sharding, which groups domains into 1 GB shards for a more balanced job running, done via giashard (Golang Internet Archive SHARDing).
A new WARC processor has been implemented in C++: warc2text
- It is faster than the previous text extraction tool giawarc (now deprecated) and warc2preprocess.
- Although it has the same features as giawarc, it still lacks features like PDF processing or boilerplate removal that are available in warc2preprocess.
Multiple improvements to bitextor-warc2htmlwarc.py and bitextor-warc2preprocess.py:
- Added lxml text extraction parsing library option, and html5lib as optional and additional parsing
  - html5lib is the cleanest supported parser but also the slowest
- Deleted alcazar as all code and references from upstream vanished.
- Fixed ‘simple’ text extraction parser for some table tags and new HTML5 tags.
- ftfy is now disabled by default.
New translation based document aligner written in C++ (document-aligner folder)
- Faster and less memory requirements than the previous Python code.
Moses tokenizers are now used by default through an efficient wrapper.
- This will run by default if "wordTokenizers" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by Moses.
Moses sentence splitter original script has been replaced with a faster port by Mediacloud.
- This will run by default if "sentenceSplitters" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by the latest Moses release version of the sentence splitter script.
Added support for Biroamer
Deprecated autotools and replaced them with CMake.
Refactored and updated requirements and submodules for lots of performance and security improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
- Updated Snakemake to v6.0.5:
- Refactored bleualign-cpp code to improve efficiency and memory requirements.
- pdf-extract now processes text with sentence-join (consult Bitextor documentation for instructions)
- Deleted old and deprecated files and folders, like slurm, nmt workflow for MarianNMT or pdf-extract (replaced by wrappers in WARC processors).
General system stability improvements to enhance the user's experience.

Conda release builds are up.
Docker builds have the same automatic build system, adding nightlies from Github master branch pushes (edge tag in Dockerhub).
Continuous integration has been activated through Github Actions.
Discussions are now open in Github! Use them to chat about releases or topics that don't fit in issues section.
Discord server is also up for a more live chat with other users and developers! Also there are some bots to keep you updated with some news about Bitextor development and related projects.

Notes

bitextor-v8.0.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

Assets 3

0 Join discussion

30 Jul 18:00

lpla

v8.0.0-pre

65b0cd1

pre-8.0.0 Paracrawl release Pre-release

Pre-release

v8.0.0-pre Changelog

Deep rewrite of Bitextor Snakefile for a vastly performance improve.
- Still missing dictionary-based document aligner and hunalign options and rules, will be integrated soon.
- We recommend revising Bitextor README.md to check new option naming or formats.
- Some intermediate files also changed, so reusing old runs would introduce issues.
Added sharding mode, which groups domains into 1 GB shards for a more balanced job running.
- It uses giashard tool.
Added lxml text extraction parsing library option to bitextor-warc2htmlwarc.py" and html5lib` optional and additional parsing.
- This is needed for proper deferred crawling in newest Bitextor code.
  - Deferred crawling is still only supported under warc2preprocess preprocessor.
- html5lib is the cleanest supported parser (like a web browser) but also the slowest.
Fixed simple text extraction parser in bitextor-warc2preprocess.py for some table tags and new HTML5 tags.
ftfy is now disabled by default.
Moses sentence splitter and tokenizer are now used by default through an efficient Python wrapper.
- This will happen if wordTokenizers and sentenceSplitters are not defined.
- This is the recommended option if your language is supported by these scripts.
Updated README.md.
Refactored and updated requirements and submodules for lots of performance improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
- Deferred crawling functions now can be easily imported.
- Refactored bleualign-cpp code.
  - Faster and less memory requirements.
- New translation based document aligner written in C++.
  - Faster and less memory requirements than the previous Python code.
- New base64 scripts from kpu/preprocess and cache fixes.
- Bifixer now filters sentence pairs if one side has with more than 1024 characters.
General system stability improvements to enhance the user's experience.

Notes

Docker image will be updated once v8.0.0 gets released.

bitextor-v8.0.0-pre.zip tarball does include submodules code, you still need to compile binaries like bleualign. If you start compiling the project after cloning from the git repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.0-pre.zip tarball or cloning the repo v8.0.0-pre tag.

Assets 3

06 Mar 11:31

lpla

v7.3.2

058d06f

v7.3.2

Fixed warc2htmlwarc.py optional non-compressed output.
Fixed bicleaner and bifixer cached call from Bitextor, improving performance.
Fixed paths in test files.
Fixed heritrix waiting time while creating initial crawling files.
Fixed some deprecation errors from exceptions and old options.
Fixed TMX and TXT deduplicated output, now writes first occurrence text of a deduplicated sentence.
Fixed reproducibility issues using bicleaner cached call by creating a Bitextor optional parameter called bicleanerCacheWithSents.
Updated submodules to fix some bugs.
- Bifixer: fixed crash on empty segments.
- Bicleaner: version 0.13, less aggressive hardrules for short sentences (3-word sentences).
Fixed cld3 input in bitextor-warc2preprocess.py, making most documents being detected as 'English'.
Fixed extracted text from <span> by adding a space after their content, in the warc2preprocess text extractor simple.
Updated some requirements.txt for security and dependency issues.
Updated latest docker image and tagged as v7.3.2.

Notes

We started integrating Bitextor 8.0 development branches into master branch. If you don't need latest features but a more stable code, please use released versions/tags or the stable branch 7.x.

bitextor-v7.3.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.2.zip tarball or cloning the repo v7.3.2 tag.

We will support Bitextor 7.x branch until Bitextor 8 is released.

Assets 3

20 Feb 14:29

lpla

v7.3.1

6981aaa

v7.3.1

Fixed example and test config files typos and new Bicleaner model filenames
Fixed tilde paths (~ as /home/user) when used in config files
Fixed warcio HTTPHeader modification without recalculating content length (reported upstream for more details)
Fixed bitextor-warc2htmlwarc.py stdin and stdout run mode.

Notes

bitextor-v7.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.zip tarball or cloning the repo v7.3 tag.

We will support Bitextor 7.x branch until Bitextor 8 is released.

Assets 3

12 Feb 10:21

lpla

v7.3

711646e

Morty Python: Crawling Corpus, S07E03

"Always look on the end (side) of life", PEP-373

v7.3 Changelog

Added support for Heritrix crawler and installation instructions.
Added plainTextHashes option (incremental recrawling using mmh3).
WARC files read and write processes are standarized now (individually compressed records in gzip format).
Integrated cld3 in both giawarc and warc2preprocess WARC processors, with optional install and use instructions.
Added several optional WARC preprocessing variables like onlyPreprocessing, preprocessLangs and targetLangs to allow processing more than two languages in the same run.
- This changed some basic variables type, like wordTokenizers and sentenceSplitters and added new ones like reverseOutputPair.
Added morphological analysers option (like Apertium) to improve document and hunalign sentence alignment.
Restructured the output and temporary files folders.
- We added dataDir as a folder with the data produced during WARC preprocessing step.
- Preprocessing documents they are now in a file-per-language.
Automated bicleaner moder training if not provided by the user.
Updated README.md.
Updated Docker image and added dockerfile.
Updated requirements and submodules for lots of performance improvements.
Dropped support of EOL Python 2.
General system stability improvements to enhance the user's experience.

Notes

We will support Bitextor 7.x branch until Bitextor 8 is released.

Assets 3

Releases: bitextor/bitextor

Bitextor 8.3: Snake Runner, the Sentence Retirer

What's Changed

New Contributors

Notes

Contributors

Uh oh!

Bitextor 8.2: Snow White and the Hunspell

What's Changed

Notes

Contributors

Uh oh!

v8.1.1

Notes

Uh oh!

The Lost Word: Jurassic Warc (PG-8.1 rating)

v8.1 Changelog

Notes

Uh oh!

v8.0.1

Notes

Uh oh!

Kill Bill-ingual: Vol. 8

v8.0 Changelog

Notes

Uh oh!

pre-8.0.0 Paracrawl release

v8.0.0-pre Changelog

Notes

Uh oh!

v7.3.2

Notes

Uh oh!

v7.3.1

Notes

Uh oh!

Morty Python: Crawling Corpus, S07E03

v7.3 Changelog

Notes

Uh oh!