07 Mar 12:24

HYLcool

156ed20

Release v0.2.0: Multimodal Support & DJ-SORA Latest

Latest

New Features

🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter

video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
video_resolution_filter: filters samples according to the resolution of videos in them. #227
video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper

video_split_by_scene_mapper: splits videos into scene clips. #227
video_split_by_duration_mapper: splits videos by specified duration interval. #227
video_split_by_key_frame_mapper: splits videos by their keyframes. #227
video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator

video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio

audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image

image_blur_mapper: adds random noises to images to blur them. #180
image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates

"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed

Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the predict.py tool gets stuck on the help page. #183
Fix face_area_filter: constrains the detection coordinates within the image. #202
Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219

Others

Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

Contributors

co63oc, liuyanyi, and zhenqincn

Assets 3

05 Jan 09:31

HYLcool

v0.1.3

a3c8310

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160

New OPs

Text

chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
text_action_filter: keeps samples containing action verbs in their texts. #122
text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149

Image

image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72

Multimodal

image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139

Bugs fixed

Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
Fix the bug of commandline arguments parsing error in some cases. #108 #165
Store simhash value as string type to avoid errors from PyArrow. #168 #170

Others

Dependency importing optimization: only require and import some dependencies when using. #35 #82
Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
Optimize the cache directory selection logic. #43
Support limiting the number of samples when mixing datasets. #86
Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151

Acknowledgement

Here we thank public contributors for their PRs to make Data-Juicer better!

@JONGSKY helps to remove some unnecessary code. #85
@xuruidong helps to fix several broken links in the README doc. #142

Contributors

xuruidong and JONGSKY

Assets 3

28 Sep 06:32

HYLcool

v0.1.2

5bd715d

Release v0.1.2: more core functions are available now.

New OPs

nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
- Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
- A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
- Display the config table after config parsing is ready. #17

Others

Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
Docs updates. #15 #16
PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
Docker building is available now. The official docker image for Docker Hub is in progress. #23
Deploy the unit tests for Data-Juicer. #29

Assets 3

11 Aug 05:31

yxdyc

v0.1.0

d4ab729

Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

Data-Juicer: A Data-Centric Text Processing System for Large Language Models
Table of Contents
- Features
- Prerequisites
- Installation
- Quick Start
  - Data Processing
  - Data Analysis
  - Data Visualization
  - Build Up Config Files
  - Preprocess raw data (Optional)
- Documentation | 文档
- Data Recipes
- Demos
- License
- Contributing
- References

Features

Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

New OPs

Multimodal

Video

Filter

Mapper

Deduplicator

Audio

Image

Document Updates

Bugs Fixed

Others

Acknowledgment

Contributors

New Features

New OPs

Text

Image

Multimodal

Bugs fixed

Others

Acknowledgement

Contributors

New OPs

New features

Others

Summarization - Table of Contents

Features

Releases: modelscope/data-juicer

Release v0.2.0: Multimodal Support & DJ-SORA

New Features

New OPs

Multimodal

Video

Filter

Mapper

Deduplicator

Audio

Image

Document Updates

Bugs Fixed

Others

Acknowledgment

Contributors

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

New OPs

Text

Image

Multimodal

Bugs fixed

Others

Acknowledgement

Contributors

Release v0.1.2: more core functions are available now.

New OPs

New features

Others

Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

Features