We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.
We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
arXiv | 1,724,497 | 1,655,259 | 95.99% | redpajama-arxiv-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Books | 205,182 | 195,983 | 95.51% | redpajama-book-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Wikipedia | 29,834,171 | 26,990,659 | 90.47% | redpajama-wiki-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
C4 | 364,868,892 | 344,491,171 | 94.42% | redpajama-c4-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | redpajama-cc-2019-30-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | redpajama-cc-2020-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | redpajama-cc-2021-04-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | redpajama-cc-2022-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | redpajama-cc-2023-06-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Github Code | 73,208,524 + 21,387,703 |
49,279,344 | 52.09% | redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml |
Aliyun ModelScope HuggingFace |
Redpajama The Stack |
StackExchange | 45,447,328 | 26,309,203 | 57.89% | redpajama-pile-stackexchange-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama The Pile |
EuroParl | 69,814 | 61,601 | 88.23% | pile-europarl-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
FreeLaw | 3,562,015 | 2,942,612 | 82.61% | pile-freelaw-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
HackerNews | 373,027 | 371,331 | 99.55% | pile-hackernews-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
NIH ExPorter | 939,661 | 858,492 | 91.36% | pile-nih-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PhilPapers | 32,782 | 29,117 | 88.82% | pile-philpaper-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | pile-pubmed-abstract-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Central | 3,098,930 | 2,694,860 | 86.96% | pile-pubmed-central-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
USPTO | 5,883,024 | 4,516,283 | 76.77% | pile-uspto-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | alpaca-cot-en-refine.yaml | Aliyun ModelScope HuggingFace |
39 Subsets of Alpaca-CoT |
Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | alpaca-cot-zh-refine.yaml | Aliyun ModelScope HuggingFace |
28 Subsets of Alpaca-CoT |
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | llava-pretrain-refine.yaml | Aliyun ModelScope HuggingFace |
LLaVA-1.5 |
Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | data-juicer-sandbox-optimal.yaml | Aliyun ModelScope HuggingFace |
InternVid (606k) Panda-70M (605k) MSR-VTT (6k) |
Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | data-juicer-sandbox-self-evolution.yaml | Aliyun ModelScope |
InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k) |
- LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
model | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-13B (baseline) |
80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
LLaVA-1.5-13B (refined pretrain dataset) |
79.94 | 63.5 | 54.09 | 74.20 | 60.82 | 86.67 | 1565.53 | 68.2 | 63.9 | 61.8 | 75.9 | 37.4 |
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models trained with refined dataset outperforms the baseline (T2V-Turbo) on VBench. T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to Sandbox for more detail.
model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | 51.67 | 68.92 |
Data-Juicer (DJ, 228k) | 82.53 | 83.38 | 79.13 | 97.92 | 99.27 | 98.14 | 97.77 | 38.89 | 67.39 |
model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
---|---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 72.49 | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | 95.60 | 94.06 | 46.95 | 57.57 | 24.42 | 26.34 | 28.90 |
Data-Juicer (DJ, 228k) | 70.41 | 96.44 | 64.51 | 95.40 | 95.51 | 47.17 | 57.30 | 25.55 | 26.82 | 29.25 |
We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:
- Text-Only: to improve the dataset quality according to the video captions.
- Video-Only: to improve the dataset quality according to the video features.
- Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.