From 80a500a3c1e92a1ef89e1f563a6e2401837cc26e Mon Sep 17 00:00:00 2001 From: Wannaphong Phatthiyaphaibun Date: Sun, 12 May 2024 19:18:03 +0700 Subject: [PATCH] Add Thai Dialect Corpus --- docs/tasks/speech-recognition.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/tasks/speech-recognition.md b/docs/tasks/speech-recognition.md index 6555ff7..71b4577 100644 --- a/docs/tasks/speech-recognition.md +++ b/docs/tasks/speech-recognition.md @@ -13,7 +13,8 @@ | Lotus Cell | Thai Speech corpus over the phone. (not full corpus) | 11 hours | CC BY-SA-NC 3.0 | NECTEC | [Mirror from @korakot: GitHub](https://github.com/korakot/corpus/releases/download/v1.0/LOTUS-cell-v1.0.zip) | | Thai Elderly Speech dataset by Data Wow and VISAI | Thai Elderly Speech dataset, consisting of 17 hours 11 minutes (19,200 files). The files are divided into 2 categories: Health care (health issues and services) and Smart Home (using Smart Home devices in household contexts). | 17 hours 11 minutes | CC BY-SA 4.0 | VISAI AI Company Limited and Data Wow Company Limited | [VISAI AI Company Limited and Data Wow Company Limited](https://github.com/VISAI-DATAWOW/Thai-Elderly-Speech-dataset/releases/tag/v1.0.0) | | FLEURS | Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. | | CC BY | Google | [huggingface](https://huggingface.co/datasets/google/fleurs) | -| XTREME-S | The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. | | CC BY | Google | [huggingface](https://huggingface.co/datasets/google/xtreme_s) | +| XTREME-S | The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. | | CC-BY-SA 4.0 | Google | [huggingface](https://huggingface.co/datasets/google/xtreme_s) | +| Thai Dialect Corpus | Corpus of Central Thai dialect and three other Thai dialects (Khummuang, Korat, and Pattani). | | CC BY | Chulalongkorn University | [[Github](https://github.com/SLSCU/thai-dialect-corpus) | ### Software