Skip to content

πŸ“– A curated list of resources dedicated to synthetic data

License

Notifications You must be signed in to change notification settings

amysteier/awesome-synthetic-data

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

awesome-synthetic-data

Awesome

A curated list of resources dedicated to Synthetic Data

If you want to contribute to this list, read the contribution guidelines first. Please add your favourite synthetic data resource by raising a pull request

Also, a listed repository should be deprecated if:

  • Repository's owner explicitly says that "this library is not maintained".
  • Not committed for a long time (2~3 years).

Contents

Research Summaries and Trends

Back to Top

Tutorials

Back to Top

Reading Content

Back to Top

Introductions and Guides to Synthetic Data

Blogs and Newsletters

Videos and Online Courses

Videos and Online Courses

Back to Top

Libraries

Open Source Generative Synthetic Data Models, Libraries and Frameworks | Back to Top

Text, Tabular and Time-Series

  • gretel-synthetics - Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
  • SDV - Synthetic Data Generator for tabular, relational, and time series data.
  • Synthea - Synthetic Patient Population Simulator.
  • ydata-synthetic - Synthetic structured data generators.

Image

Audio

  • Jukebox - OpenAI's Jukebox- A Generative Model for Music.

Simulation

  • AirSim - AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
  • Nvidia Dataset Synthesizer - NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
  • OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
  • Unity Perception Perception toolkit for sim2real training and validation in Unity.

Academic Papers

Back to Top

Language Models

Generative Adversarial Networks (GANs)

  • Modeling Tabular Data using Conditional GAN (2019) Xu et al. [pdf]

Diffusion Models

  • Generative Modeling by Estimating Gradients of the Data Distribution (2021) Yang Song [pdf]
  • Diffusion Models are Autoencoders S. Dielman (2021) [pdf]
  • Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) J Sohl-Dickstein et al. [pdf]

Fair AI

Algorithmic Privacy

  • Deep Learning with Differential Privacy (2016) Abadi et al. [pdf]
  • An Efficient DP-SGD Mechanism for Large Scale NLP Models (2021) Dupuy et al. [pdf]
  • PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees (2018) Jordon et al. [pdf]
  • Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence (2021) Cao et al. [pdf]
  • Differentially Private Fine-tuning of Language Models (2022) Yu et al. [pdf]

Services

Synthetic Data as API with higher level functionality such model training, fine-tuning, and generation | Back to Top

Prominent Synthetic Data Research Labs

Back to Top

Datasets

Back to Top

  • HuggingFace Datasets - Library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
  • Google Cloud Public Datasets - Publicly available and free machine learning and analytics datasets.
  • Kaggle Datasets - Data science and machine learning datasets.
  • /r/datasets - A place to share, find, and discuss Datasets.

License

License - CC0

About

πŸ“– A curated list of resources dedicated to synthetic data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published