Releases: bitextor/monotextor
Releases · bitextor/monotextor
Monotextor v1.1: Title in the Sky
Added
- Apply Monofixer to document titles.
- Detect sensitive data in paragraphs.
- Compressed preverticals support.
- New paragraph id format (prevertical2text).
- Remove tabs, endlines and carriage return that generate additional lines or fields when normalization is disabled (Monofixer).
- Detect Serbo-Croatian script (FastSpell).
- Automatic installation of Hunspell dictionaries (FastSpell).
Fixed
- Python 3.10 compatibility
- Check that Monocleaner model exists.
- Snakemake always running everything despite no file changes.
- Fix issue with encoding errors in sentence splitting making unexpected offsets in document metadata
- Fix warning format when paragraph id > total paragraphs
- Monotextor imports in
bitextor_split
- Correct names in stat files.
Changed
- Group Serbo-Croatian under
hbs
(FastSpell). - Better langid coverage for Icelandic (FastSpell).
- Filter sentences by Monocleaner score and language id.
- Remove hardcoded Monocleaner threshold.
- Use pigz in rules that are parallelized.
- Updated installation instructions.
- Update Snakemake.
- Update lxml.
Monotextor v1.0: Princess Monotextor
- First implementation of the pipeline to process monolingual data from prevert files.
- Monocleaner and monofixer integration.
- Skip sentence splitting to process paragraph level data.
- Propagate document and paragraph metadata from prevert files at paragraph level format.