Skip to content

Latest commit

 

History

History
74 lines (59 loc) · 23.9 KB

File metadata and controls

74 lines (59 loc) · 23.9 KB

Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Text Dataset

subset #samples before #samples after keep ratio config link data link source
arXiv 1,724,497 1,655,259 95.99% redpajama-arxiv-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Books 205,182 195,983 95.51% redpajama-book-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Wikipedia 29,834,171 26,990,659 90.47% redpajama-wiki-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
C4 364,868,892 344,491,171 94.42% redpajama-c4-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2019-30 81,085,420 36,557,283 45.08% redpajama-cc-2019-30-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2020-05 90,850,492 42,612,596 46.90% redpajama-cc-2020-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2021-04 98,878,523 44,724,752 45.23% redpajama-cc-2021-04-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2022-05 94,058,868 42,648,496 45.34% redpajama-cc-2022-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2023-06 111,402,716 50,643,699 45.46% redpajama-cc-2023-06-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Github Code 73,208,524
+ 21,387,703
49,279,344 52.09% redpajama-code-refine.yaml
stack-code-refine.yaml
redpajama-stack-code-deduplicate.yaml
Aliyun
ModelScope
HuggingFace
Redpajama
The Stack
StackExchange 45,447,328 26,309,203 57.89% redpajama-pile-stackexchange-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
The Pile
EuroParl 69,814 61,601 88.23% pile-europarl-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
FreeLaw 3,562,015 2,942,612 82.61% pile-freelaw-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
HackerNews 373,027 371,331 99.55% pile-hackernews-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
NIH ExPorter 939,661 858,492 91.36% pile-nih-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PhilPapers 32,782 29,117 88.82% pile-philpaper-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Abstracts 15,518,009 15,009,325 96.72% pile-pubmed-abstract-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Central 3,098,930 2,694,860 86.96% pile-pubmed-central-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
USPTO 5,883,024 4,516,283 76.77% pile-uspto-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile

Before and after refining for Alpaca-CoT Dataset

subset #samples before #samples after keep ratio config link data link source
Alpaca-Cot EN 136,219,879 72,855,345 54.48% alpaca-cot-en-refine.yaml Aliyun
ModelScope
HuggingFace
39 Subsets of Alpaca-CoT
Alpaca-Cot ZH 21,197,246 9,873,214 46.58% alpaca-cot-zh-refine.yaml Aliyun
ModelScope
HuggingFace
28 Subsets of Alpaca-CoT

Before and after refining for Multimodal Dataset

subset #samples before #samples after keep ratio config link data link source
LLaVA pretrain (LCS-558k) 558,128 500,380 89.65% llava-pretrain-refine.yaml Aliyun
ModelScope
HuggingFace
LLaVA-1.5
Data-Juicer (T2V, 147k) 1,217,346 147,176 12.09% data-juicer-sandbox-optimal.yaml Aliyun
ModelScope
HuggingFace
InternVid (606k)
Panda-70M (605k)
MSR-VTT (6k)
Data-Juicer (DJ, 228k) 3,408,553 227,867 8.15% data-juicer-sandbox-self-evolution.yaml Aliyun
ModelScope
InternVid (606k)
Panda-70M (2,599k)
Pexels (198k)
MSR-VTT (6k)

Evaluation Results

  • LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
model VQAv2 GQA VizWiz SQA TextVQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
LLaVA-1.5-13B
(baseline)
80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 72.5 36.1
LLaVA-1.5-13B
(refined pretrain dataset)
79.94 63.5 54.09 74.20 60.82 86.67 1565.53 68.2 63.9 61.8 75.9 37.4
  • Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models trained with refined dataset outperforms the baseline (T2V-Turbo) on VBench. T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to Sandbox for more detail.
model Total Score Quality Score Semantic Score subject consistency background consistency temporal flickering motion smoothness dynamic degree aesthetic quality
T2V-Turbo 81.01 82.57 74.76 96.28 97.02 97.48 97.34 49.17 63.04
Data-Juicer (T2V, 147k) 82.10 83.14 77.93 97.32 99.03 96.60 96.51 51.67 68.92
Data-Juicer (DJ, 228k) 82.53 83.38 79.13 97.92 99.27 98.14 97.77 38.89 67.39
model imaging quality object class multiple objects human action color spatial relationship scene appearance style temporal style overall consistency
T2V-Turbo 72.49 93.96 54.65 95.20 89.90 38.67 55.58 24.42 25.51 28.16
Data-Juicer (T2V, 147k) 70.42 95.85 61.63 95.60 94.06 46.95 57.57 24.42 26.34 28.90
Data-Juicer (DJ, 228k) 70.41 96.44 64.51 95.40 95.51 47.17 57.30 25.55 26.82 29.25

For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:

  • Text-Only: to improve the dataset quality according to the video captions.
  • Video-Only: to improve the dataset quality according to the video features.
  • Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.