Releases: LanguageMachines/ticcltools
v0.10
[Ko van der Sloot]
- LDcalc:
- No longer filter out n-grams with common parts. Was too aggressive
- Removed some more outcommented old code
- chainclean: added a --caseless option. (Default is true)
- Removed Roaring versions of the code. Lacked maintenance for years.
- internally shifting towards UnicodeString in general
- a lot of C++ cleanup, with some refactoring, splitting up long blobs of code
v0.9
v0.8
v0.7.1
v0.7
[Martin Reynaert]
- updated man pages
- updated README.md
[Ko vander Sloot]
Numerous bug fixes and additions. Added a .so for common functions
The bitType is changed to uint64_t (for the biggest int possible) which
triggered some code adaptations. (values < 0 are not possible)
-
TICCL-unk:
- some changes in UNK detection
- added a --hemp option
- create a .fore.clean file when a background corpus is merged in
-
TICCL-stats:
- added a -n option to use a newline as delimiter
-
TICCL-indexer(NT):
- better and faster implementation
- added --confstats option
-
TICCL-LDcalc:
- added a --follow option for debugging purposes
- fix for #30
- added --low and --high parameters
-
TICCL-rank:
- added a --follow option for debugging purposes
- added --subtractartifrqfeature1 and --subtractartifrqfeature2 options
- replaced pairs_combined ranking by median ranking
- added an n-garm filter
-
TICCL-chain:
-
TICCL-chainclean: new module to clean chain ranked files
-
TICCL-anahash:
- accept lexicons without frequencies too. (also simple word lists)
- added a -o option
v0.6
Intermediate release, with a lot of new code to handle N-grams
Also a lot of refactoring is done, for more clear and maintainable code.
This is work in progress still.
-
TICCL-unk:
- more extensive acronym detection
- fixed artifreq problems in 'clean' punctuated words
- added filters for 'unwanted' characters
- added a ligature filter to convert evil ligatures
- normalize all hyphens to a 'normal' one (-)
- use a better definition of punctuation (unicode character class is not
good enough to decide)
-
TICCL-lexstat:
- the 'separator' symbol should get freq=0, so it isnt counted
- the clip value is added to the output filename
-
TICCL-indexer:
- indexer and indexerNT now produce the same output, using different
strategies when a --foci files is used.
- indexer and indexerNT now produce the same output, using different
-
TICCL-LDcalc:
major overhaul for n-grams- added a ngram point column to the output (so NOT backward compatible!)
- produce a '.short' list for short word corrections
- produce a '.ambi' file with a list of n-grams related to short words
- prune a lot of ngrams from the output
-
TICCL-rank:
- output is sorted now
- honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
-
TICCL-chain: new module to chain ranked files
-
TICCL-lexclean:
-added a -x option for 'inverse' alphabet -
TICCL-anahash:
- added a --list option to produce a list of words and anagram values
-
added metadata file: codemeta.json
v0.5
- updated configuration. also for Mac OSX
- use of more ticcutils stuff: diacriticsfilter
- added a TICCL-mergelex program
- the OMP_THREAD_LIMIT environment variable was ignored sometimes
- TICCL-unk:
- fixed a problem in artifreq handling
- changed acronym detection (work in progress)
- added -o option
TICCL-lexstat: - added TTR output
- added -o option
TICCL-indexer - now also handles --foci file. with some speed-up
- added a -t option
TICCL-LDcalc: - be less picky on a few wrong lines in the data
- added some tests
- when libroaring is installed we built roaring versions of some modules (experimental)
- updated man pages