Skip to content

Release v0.1.0, the first internal version for open-source

Compare
Choose a tag to compare
@yxdyc yxdyc released this 11 Aug 05:31
· 222 commits to main since this release
d4ab729

Summarization - Table of Contents

  • Data-Juicer: A Data-Centric Text Processing System for Large Language Models
  • Table of Contents
    • Features
    • Prerequisites
    • Installation
    • Quick Start
      • Data Processing
      • Data Analysis
      • Data Visualization
      • Build Up Config Files
      • Preprocess raw data (Optional)
    • Documentation | 文档
    • Data Recipes
    • Demos
    • License
    • Contributing
    • References

Features

  • Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.

  • Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.

  • Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.

  • Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.

  • Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.

  • User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

  • Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

  • Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.