You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cache management works now for our Data-Juicer due to the new serialization method being applied.
Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
Display the config table after config parsing is ready. #17
Others
Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14