Skip to content
@opendatalab

OpenDataLab

OpenDataLab provides access to numerous significant open-source datasets.

English🌎|简体中文🀄

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

🌟Extensive open data resources for AI Model

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision
● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

✨Open-source data processing toolkits

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

💫Dataset description language

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.


📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us [email protected].

Popular repositories Loading

  1. MinerU MinerU Public

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Python 25.2k 1.9k

  2. PDF-Extract-Kit PDF-Extract-Kit Public

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

    Python 6.5k 435

  3. labelU labelU Public

    Data annotation toolbox supports image, audio and video data.

    Python 980 99

  4. DocLayout-YOLO DocLayout-YOLO Public

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

    Python 804 58

  5. LabelLLM LabelLLM Public

    The Open-Source Data Annotation Platform

    TypeScript 641 53

  6. WanJuan1.0 WanJuan1.0 Public

    万卷1.0多模态语料

    555 28

Repositories

Showing 10 of 38 repositories
  • OHR-Bench Public

    OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

    opendatalab/OHR-Bench’s past year of commit activity
    Python 58 11 1 0 Updated Feb 1, 2025
  • MinerU Public

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    opendatalab/MinerU’s past year of commit activity
    Python 25,171 AGPL-3.0 1,904 44 5 Updated Jan 27, 2025
  • labelU Public

    Data annotation toolbox supports image, audio and video data.

    opendatalab/labelU’s past year of commit activity
    Python 980 Apache-2.0 99 14 0 Updated Jan 22, 2025
  • labelU-Kit Public

    Data annotation component library --provided as NPM packages

    opendatalab/labelU-Kit’s past year of commit activity
    TypeScript 73 Apache-2.0 19 2 2 Updated Jan 22, 2025
  • LabelLLM Public

    The Open-Source Data Annotation Platform

    opendatalab/LabelLLM’s past year of commit activity
    TypeScript 641 Apache-2.0 53 10 0 Updated Jan 17, 2025
  • OmniDocBench Public

    A Comprehensive Benchmark for Document Parsing and Evaluation

    opendatalab/OmniDocBench’s past year of commit activity
    Python 218 Apache-2.0 20 11 0 Updated Jan 17, 2025
  • DocLayout-YOLO Public

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

    opendatalab/DocLayout-YOLO’s past year of commit activity
    Python 804 AGPL-3.0 58 7 1 Updated Jan 16, 2025
  • WanJuan3.0 Public

    WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB

    opendatalab/WanJuan3.0’s past year of commit activity
    11 MIT 0 0 0 Updated Jan 10, 2025
  • VHM Public

    VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

    opendatalab/VHM’s past year of commit activity
    Python 73 Apache-2.0 6 0 1 Updated Jan 8, 2025
  • UrBench Public

    [AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”

    opendatalab/UrBench’s past year of commit activity
    Python 9 1 0 0 Updated Jan 7, 2025

Top languages

Loading…

Most used topics

Loading…