A curated list of resources dedicated to Synthetic Data
If you want to contribute to this list, read the contribution guidelines first. Please add your favourite synthetic data resource by raising a pull request
Also, a listed repository should be deprecated if:
- Repository's owner explicitly says that "this library is not maintained".
- Not committed for a long time (2~3 years).
- Research Summaries and Trends
- Tutorials
- Libraries
- Academic Papers
- Services
- Prominent Synthetic Data Research Labs
- Datasets
Introductions and Guides to Synthetic Data
Blogs and Newsletters
- The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy's intro to RNNs.
Videos and Online Courses
Open Source Generative Synthetic Data Models, Libraries and Frameworks | Back to Top
- gretel-synthetics - Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
- SDV - Synthetic Data Generator for tabular, relational, and time series data.
- Synthea - Synthetic Patient Population Simulator.
- ydata-synthetic - Synthetic structured data generators.
- Contrastive Unpaired Translation - Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan.
- StyleGAN 3 - Official PyTorch implementation of StyleGAN3 from NeurIPS 2021.
- Jukebox - OpenAI's Jukebox- A Generative Model for Music.
- AirSim - AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
- Nvidia Dataset Synthesizer - NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
- OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
- Unity Perception Perception toolkit for sim2real training and validation in Unity.
- Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]
- Generative Modeling by Estimating Gradients of the Data Distribution (2021) Yang Song [pdf]
- Diffusion Models are Autoencoders S. Dielman (2021) [pdf]
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) J Sohl-Dickstein et al. [pdf]
- Deep Learning with Differential Privacy (2016) Abadi et al. [pdf]
- An Efficient DP-SGD Mechanism for Large Scale NLP Models (2021) Dupuy et al. [pdf]
- PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (2018) Jordon et al. [pdf]
- Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence (2021) Cao et al. [pdf]
- Differentially Private Fine-tuning of Language Models (2022) Yu et al. [pdf]
Synthetic Data as API with higher level functionality such model training, fine-tuning, and generation | Back to Top
- List of Synthetic Data Startups in 2021 - Not all of these necessarily have APIs.
- HuggingFace Datasets - Library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
- Google Cloud Public Datasets - Publicly available and free machine learning and analytics datasets.
- Kaggle Datasets - Data science and machine learning datasets.
- /r/datasets - A place to share, find, and discuss Datasets.
License - CC0