Releases: bitextor/bitextor
Bitextor 8.3: Snake Runner, the Sentence Retirer
I've seen things you people wouldn't believe. Roy Batty, The Preverticant
What's Changed
- Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
- CI and tests updates and fixes by @cgr71ii in #238
- Range of paragraphs count using option
paragraphIdentificationby @lpla in #241 - Document pair output file by @aarongaliano in #242
- Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
- Remove Linguacrawl from Bitextor by @aarongaliano in #248
- It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
- Metadata code refactorization by @cgr71ii in #245
- Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check
directoriesanddirectioriesFiledocumentation, by @aarongaliano in #247 PDFprocessingoption (previouslyPDFextract). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247- Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
- New Bitextor
multilangoption (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247 - New Bitextor argument
bicleanerExtraArgsto pass extra arguments to Bicleaner(-AI) by @lpla in #250 - Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
- Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
- New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
- Updated Python requirements, submodules, and documentation.
- Minor bug fixes and changes (including #253)
New Contributors
- @aarongaliano made their first contribution in #242
- @aliciannz made their first contribution in #249
Full Changelog: v8.2...v8.3
Notes
bitextor-v8.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.3.zip tarball or cloning the repo v8.3 tag.
We will support Bitextor 8.x branch until the next major version is released.
Bitextor 8.2: Snow White and the Hunspell
I told you to run. , The Huntsman
What's Changed
- Prevertical2text integration by @cgr71ii in #223
- Paragraph identification by @cgr71ii in #225
- Change default sentence splitter (now it is Loomchild's Segment) and Bicleaner AI integration by @cgr71ii in #226
- Use headers for descriptive column names in TSV input/output files by @cgr71ii in #227
- Add pip optional dependencies by @cgr71ii in #229
Full Changelog: v8.1.1...v8.2
Notes
bitextor-v8.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.2.zip tarball or cloning the repo v8.2 tag.
We will support Bitextor 8.x branch until the next major version is released.
v8.1.1
- Added support for Fedora installation. Check INSTALL.md for
dnfcommands. - Fixed
tests/run-tests.shto run those tests in both sequential (low resource server, using bash variableCI="true") or parallel. - Removed default file type filter in
wgetcrawler, as it has issues with URLs without extension. - Bicleaner model training and dictionary generation options reworked:
bicleanerwill enable or disable Bicleaner, andbicleanerModelwill contain the path to the model.- Bicleaner model training will need to be explicitly enabled with
bicleanerGenerateModelinstead of checking out if the model provided throughbicleanerModelconfig setting exists or not. - Dictionary generation will need to be set through
generateDicinstead of checking out whether the dictionary exists or not.
- Updated Python requirements.
- Minor bug fixes.
Notes
bitextor-v8.1.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.1.zip tarball or cloning the repo v8.0 tag.
We will support Bitextor 8.x branch until the next major version is released.
The Lost Word: Jurassic Warc (PG-8.1 rating)
"Oh my God! A snake! Help me!", Dr. Robert Burke
v8.1 Changelog
- Major rework on paths and installation folders to allow Bitextor to be installed in a specific location
- Check out installation instructions and details in INSTALL.md
- Replaced Tensorflow and Keras in the dictionary-based document aligner with scikit-learn
- General clean up of Python code
- Updated submodules and Python requirements versions
Notes
bitextor-v8.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.zip tarball or cloning the repo v8.0 tag.
We will support Bitextor 8.x branch until the next major version is released.
v8.0.1
- Deferred crawling standoff annotation reconstruction script has been rewritten for better performance
- This one benefits from LRU dict as a limited-size hash memory-based cache
- Uses native warcio and Moses sentence splitter (Python port)
- Fix
bitextor-buildTMX.pydedup option- Dedup was keeping sentences strings from the best score from Bifixer, but the other columns from the last occurrence (url, deferred crawling standoff annotation, bicleaner score...)
- Bitextor now validates if a provided host is not valid
- Updated submodules
warc2textremoved URLs lowercasing
- Added more tests to the CI, including Bitextor with deferred crawling standoff annotation and its reconstruction.
- Updated requirements and submodules to their latest stable version.
Notes
bitextor-v8.0.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.1.zip tarball or cloning the repo v8.0.1 tag.
We will support Bitextor 8.x branch until the next major version is released.
Kill Bill-ingual: Vol. 8
"We have unfinished business.", Beatrix
v8.0 Changelog
- Deep rewrite of Bitextor Snakefile for a vast performance improvement.
- Some config parameters and intermediate generated files also changed, so reusing old config files and transient or permanent folders from old runs would introduce issues.
- Snakemake project structure now matches the standard.
- Now we are listed on a comprehensive catalog of standards compliant, public, Snakemake workflows from the official Snakemake developers
- All features from previous Bitextor version work.
- Machine translation system training now should be performed manually.
- Added a new crawler: linguacrawl, specialized in full TLD crawling.
- Added a new method for deferred crawling only using Murmurhash hashes at the sentence alignment step.
- A reconstructor is also provided:
deferred-annotation-reconstructor.sh
- A reconstructor is also provided:
- Added sharding, which groups domains into 1 GB shards for a more balanced job running, done via giashard (Golang Internet Archive SHARDing).
- A new WARC processor has been implemented in C++: warc2text
- It is faster than the previous text extraction tool
giawarc(now deprecated) andwarc2preprocess. - Although it has the same features as giawarc, it still lacks features like PDF processing or boilerplate removal that are available in
warc2preprocess.
- It is faster than the previous text extraction tool
- Multiple improvements to
bitextor-warc2htmlwarc.pyandbitextor-warc2preprocess.py:- Added
lxmltext extraction parsing library option, andhtml5libas optional and additional parsinghtml5libis the cleanest supported parser but also the slowest
- Deleted
alcazaras all code and references from upstream vanished. - Fixed ‘simple’ text extraction parser for some table tags and new HTML5 tags.
ftfyis now disabled by default.
- Added
- New translation based document aligner written in C++ (
document-alignerfolder)- Faster and less memory requirements than the previous Python code.
- Moses tokenizers are now used by default through an efficient wrapper.
- This will run by default if "wordTokenizers" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by Moses.
- Moses sentence splitter original script has been replaced with a faster port by Mediacloud.
- This will run by default if "sentenceSplitters" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by the latest Moses release version of the sentence splitter script.
- Added support for Biroamer
- Deprecated autotools and replaced them with CMake.
- Refactored and updated requirements and submodules for lots of performance and security improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
- Updated Snakemake to v6.0.5:
- Refactored bleualign-cpp code to improve efficiency and memory requirements.
- pdf-extract now processes text with sentence-join (consult Bitextor documentation for instructions)
- Deleted old and deprecated files and folders, like
slurm,nmtworkflow for MarianNMT orpdf-extract(replaced by wrappers in WARC processors).
- General system stability improvements to enhance the user's experience.
- Conda release builds are up.
- Docker builds have the same automatic build system, adding nightlies from Github master branch pushes (
edgetag in Dockerhub). - Continuous integration has been activated through Github Actions.
- Discussions are now open in Github! Use them to chat about releases or topics that don't fit in issues section.
- Discord server is also up for a more live chat with other users and developers! Also there are some bots to keep you updated with some news about Bitextor development and related projects.
Notes
bitextor-v8.0.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.zip tarball or cloning the repo v8.0 tag.
We will support Bitextor 8.x branch until the next major version is released.
pre-8.0.0 Paracrawl release
v8.0.0-pre Changelog
- Deep rewrite of Bitextor Snakefile for a vastly performance improve.
- Still missing dictionary-based document aligner and
hunalignoptions and rules, will be integrated soon. - We recommend revising Bitextor README.md to check new option naming or formats.
- Some intermediate files also changed, so reusing old runs would introduce issues.
- Still missing dictionary-based document aligner and
- Added sharding mode, which groups domains into 1 GB shards for a more balanced job running.
- It uses giashard tool.
- Added
lxmltext extraction parsing library option tobitextor-warc2htmlwarc.py" andhtml5lib` optional and additional parsing.- This is needed for proper deferred crawling in newest Bitextor code.
- Deferred crawling is still only supported under
warc2preprocesspreprocessor.
- Deferred crawling is still only supported under
html5libis the cleanest supported parser (like a web browser) but also the slowest.
- This is needed for proper deferred crawling in newest Bitextor code.
- Fixed
simpletext extraction parser inbitextor-warc2preprocess.pyfor some table tags and new HTML5 tags. ftfyis now disabled by default.- Moses sentence splitter and tokenizer are now used by default through an efficient Python wrapper.
- This will happen if
wordTokenizersandsentenceSplittersare not defined. - This is the recommended option if your language is supported by these scripts.
- This will happen if
- Updated README.md.
- Refactored and updated requirements and submodules for lots of performance improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in
requirements.txtbefore installing them. - Deferred crawling functions now can be easily imported.
- Refactored
bleualign-cppcode.- Faster and less memory requirements.
- New translation based document aligner written in C++.
- Faster and less memory requirements than the previous Python code.
- New base64 scripts from
kpu/preprocessandcachefixes. - Bifixer now filters sentence pairs if one side has with more than 1024 characters.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in
- General system stability improvements to enhance the user's experience.
Notes
Docker image will be updated once v8.0.0 gets released.
bitextor-v8.0.0-pre.zip tarball does include submodules code, you still need to compile binaries like bleualign. If you start compiling the project after cloning from the git repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.0-pre.zip tarball or cloning the repo v8.0.0-pre tag.
v7.3.2
- Fixed
warc2htmlwarc.pyoptional non-compressed output. - Fixed
bicleanerandbifixercached call from Bitextor, improving performance. - Fixed paths in test files.
- Fixed
heritrixwaiting time while creating initial crawling files. - Fixed some deprecation errors from exceptions and old options.
- Fixed TMX and TXT deduplicated output, now writes first occurrence text of a deduplicated sentence.
- Fixed reproducibility issues using
bicleanercached call by creating a Bitextor optional parameter calledbicleanerCacheWithSents. - Updated submodules to fix some bugs.
- Bifixer: fixed crash on empty segments.
- Bicleaner: version 0.13, less aggressive hardrules for short sentences (3-word sentences).
- Fixed
cld3input inbitextor-warc2preprocess.py, making most documents being detected as 'English'. - Fixed extracted text from
<span>by adding a space after their content, in thewarc2preprocesstext extractorsimple. - Updated some
requirements.txtfor security and dependency issues. - Updated latest docker image and tagged as
v7.3.2.
Notes
We started integrating Bitextor 8.0 development branches into master branch. If you don't need latest features but a more stable code, please use released versions/tags or the stable branch 7.x.
bitextor-v7.3.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.2.zip tarball or cloning the repo v7.3.2 tag.
We will support Bitextor 7.x branch until Bitextor 8 is released.
v7.3.1
- Fixed example and test config files typos and new Bicleaner model filenames
- Fixed tilde paths (~ as
/home/user) when used in config files - Fixed warcio HTTPHeader modification without recalculating content length (reported upstream for more details)
- Fixed
bitextor-warc2htmlwarc.pystdin and stdout run mode.
Notes
bitextor-v7.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.zip tarball or cloning the repo v7.3 tag.
We will support Bitextor 7.x branch until Bitextor 8 is released.
Morty Python: Crawling Corpus, S07E03
"Always look on the end (side) of life", PEP-373
v7.3 Changelog
- Added support for Heritrix crawler and installation instructions.
- Added
plainTextHashesoption (incremental recrawling using mmh3). - WARC files read and write processes are standarized now (individually compressed records in gzip format).
- Integrated
cld3in bothgiawarcandwarc2preprocessWARC processors, with optional install and use instructions. - Added several optional WARC preprocessing variables like
onlyPreprocessing,preprocessLangsandtargetLangsto allow processing more than two languages in the same run.- This changed some basic variables type, like
wordTokenizersandsentenceSplittersand added new ones likereverseOutputPair.
- This changed some basic variables type, like
- Added morphological analysers option (like Apertium) to improve document and
hunalignsentence alignment. - Restructured the output and temporary files folders.
- We added
dataDiras a folder with the data produced during WARC preprocessing step. - Preprocessing documents they are now in a file-per-language.
- We added
- Automated bicleaner moder training if not provided by the user.
- Updated README.md.
- Updated Docker image and added dockerfile.
- Updated requirements and submodules for lots of performance improvements.
- Dropped support of EOL Python 2.
- General system stability improvements to enhance the user's experience.
Notes
bitextor-v7.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.zip tarball or cloning the repo v7.3 tag.
We will support Bitextor 7.x branch until Bitextor 8 is released.