From f6fef18c66674b01d602e321a8954198397e4a9a Mon Sep 17 00:00:00 2001 From: Jim Schwoebel <30424731+jim-schwoebel@users.noreply.github.com> Date: Sat, 12 Jun 2021 18:25:47 -0400 Subject: [PATCH] Update README.md --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index ce6477d..49cdc3d 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,10 @@ There are two main types of audio datasets: speech datasets and audio event/musi * [DAPS Dataset](https://archive.org/details/daps_dataset) - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker). * [Deep Clustering Dataset](https://www.merl.com/demos/deep-clustering) - Training deep discriminative embeddings to solve the cocktail party problem. * [DEMoS](https://zenodo.org/record/2544829) - 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt. +* [DES](http://kom.aau.dk/~tb/speech/Emotions/) - 4 speakers (2 males and 2 females); 5 emotions: neutral, surprise, happiness, sadness and anger. * [DIPCO](https://arxiv.org/abs/1909.13447) - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. +* [EEKK](https://metashare.ut.ee/repository/download/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/) - 26 text passage read by 10 speakers; 4 main emotions: joy, sadness, anger and neutral. +* [Emo-DB](http://emodb.bilderbar.info/index-1280.html) - 800 recording spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust. * [EmoFilm](https://zenodo.org/record/1326428) - 1115 audio instances sentences extracted from various films. * [EmoSynth](https://zenodo.org/record/3727593) - 144 audio file labelled by 40 listeners; Emotion (no speech) defined in regard of valence and arousal. * [Emotional Voices Database](https://github.com/numediart/EmoV-DB) - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy). @@ -33,11 +36,15 @@ There are two main types of audio datasets: speech datasets and audio event/musi * [EmotionTTS](https://github.com/emotiontts/emotiontts_open_db) - Recordings and their associated transcriptions by a diverse group of speakers - 4 emotions: general, joy, anger, and sadness. * [Emov-DB](https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg!mYwUnI4K) - Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust and amused. * [EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html) - 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness. +* [eNTERFACE05](http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip) - Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust. * [Free Spoken Digit Dataset](https://github.com/Jakobovski/free-spoken-digit-dataset) -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations. * [Flickr Audio Caption](https://groups.csail.mit.edu/sls/downloads/flickraudio/) - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size. * [GEMEP corpus](https://www.unige.ch/cisa/gemep) - 10 actors portraying 10 states; 12 emotions: amusement, anxiety, cold anger (irritation), despair, hot anger (rage), fear (panic), interest, joy (elation), pleasure(sensory), pride, relief, and sadness. Plus, 5 additional emotions: admiration, contempt, disgust, surprise, and tenderness. +* [IEMOCAP](https://sail.usc.edu/iemocap/iemocap_release.htm) - 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration and neutral. * [ISOLET Data Set](https://data.world/uci/isolet) - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task. * [JL corpus](https://www.kaggle.com/tli725/jl-corpus) - 2400 recording of 240 sentences by 4 actors (2 males and 2 females); 5 primary emotions: angry, sad, neutral, happy, excited. 5 secondary emotions: anxious, apologetic, pensive, worried, enthusiastic. +* [Keio-ESD](http://research.nii.ac.jp/src/en/Keio-ESD.html) - A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joyful, disgusting, downgrading, funny, worried, gentle, relief, indignation, shameful, etc. +* [LEGO Corpus](https://www.ultes.eu/ressources/lego-spoken-dialogue-corpus/) - 347 dialogs with 9,083 system-user exchanges; emotions classified as garbage, non-angry, slightly angry and very angry. * [Libriadapt](https://github.com/akhilmathurs/libriadapt) - It is primarily designed to faciliate domain adaptation research for ASR models, and contains the following three types of domain shifts in the data. * [Libri-CSS](https://github.com/chenzhuo1011/libri_css) - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. * [LibriMix](https://github.com/JorisCos/LibriMix) - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments. @@ -58,6 +65,8 @@ There are two main types of audio datasets: speech datasets and audio event/musi * [The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)](https://zenodo.org/record/1188976#.XrC7a5NKjOR) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. * [sample_voice_data](https://github.com/jim-schwoebel/sample_voice_data) - 52 audio files per class (males and females) for testing purposes. * [SAVEE Dataset](http://kahlan.eps.surrey.ac.uk/savee/) - 4 male actors in 7 different emotions, 480 British English utterances in total. +* [SEMAINE](https://semaine-db.eu/) - 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions; 5 FeelTrace annotations: activation, valence, dominance, power, intensity. +* [SER Datasets](https://github.com/SuperKogito/SER-datasets) - A collection of datasets for the purpose of emotion recognition/detection in speech. * [SEWA](https://db.sewaproject.eu/) - more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal. * [ShEMO](https://github.com/mansourehk/ShEMO) - 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers; 6 emotions: anger, fear, happiness, sadness, neutral and surprise. * [SparseLibriMix](https://github.com/popcornell/SparseLibriMix) - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version. @@ -67,6 +76,7 @@ There are two main types of audio datasets: speech datasets and audio event/musi * [Spoken Wikipeida Corpora](https://nats.gitlab.io/swc/) - 38 GB in size available in both audio and without audio format. * [Tatoeba](https://tatoeba.org/eng/downloads) - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community. * [Ted-LIUM](https://www.openslr.org/51/) - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial). +* [TESS](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 recording by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. * [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts/) - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average. * [TIMIT dataset](https://catalog.ldc.upenn.edu/LDC93S1) - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay). * [URDU-Dataset](https://github.com/siddiquelatif/urdu-dataset) - 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.