This repo contains several learning resources for speech recognition, including courses, books, tutorials, papers and toolkits.(continuously updating)
- (Recommended) Automatic Speech Recognition (ASR) 2018-2019 Lectures, School of Informatics, University of Edingburgh [Website]
- Speech recognition, EECS E6870 - Spring 2016, Columbia University [Website]
- CS224N: Natural Language Processing with Deep Learning, Stanford [Website] [Video(Winter 2021)] [Video(Winter 2017)]
- CS224S: Spoken Language Processing (Winter 2021), Stanford [Website]
- DLHLP: DEEP LEARNING FOR HUMAN LANGUAGE PROCESSING, 2020 SPRING, Hung-yi Lee [Website] [Video(Spring 2020)]
- Microsoft DEV287x: Speech Recognition Systems, 2019 [Website]
- 语音识别从入门到精通,2019,谢磊 (NOT FREE) [Website]
- 數位語音處理概論,国立台湾大学,李琳山 [Website]
- Fundamentals of speech recognition, Lawrence Rabiner, Being-Hwang Juang, 1993 [Book]
- Spoken language processing: A guide to theory, algorithm, and system levelopment, xuedong Huang, Alex acero, hsiao-wuen Hon, 2001 [Book]
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky & James H. Martin [Website] [Book 3rd Ed]
- Automatic speech recognition: A Deep Learning Approach, Dong Yu and Li Deng, Springer, 2014 [Book]
- Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 [Website] [Book]
- 《解析深度学习:语音识别实践》,俞栋,邓力,电子工业出版社
- 《Kaldi 语音识别实战》,陈果果,电子工业出版社
- 《语音识别:原理与应用》,洪青阳,电子工业出版社
- 《语音识别基本法》,汤志远,电子工业出版社
- 《统计学习方法》李航,清华大学出版社
- 《语音信号处理》韩继庆,清华大学出版社
- 《语音信号处理》赵力,机械工业出版社
- HMM: Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [Paper]
- EM: Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J]. International Computer Science Institute, 1998, 4(510): 126. [Paper]
- CTC: Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. [Paper]
- WFST
- An Introduction to Weighted Automata in Machine Learning, Awni Hannun, 2021. [PDF]
- k2
listed in no particular order
- kaldi [Github] [Doc]
- next-gen Kaldi [Github]
- k2: FSA/FST algorithms, differentiable, with PyTorch compatibility. [Github] [Doc]
- icefall: Speech recognition recipes using k2. [Github] [Doc]
- sherpa: Streaming and non-streaming ASR server for next-gen Kaldi. [Github] [Doc]
- sherpa-onnx: Real-time speech recognition using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go. [Github] [Doc]
- sherpa-ncnn: Real-time speech recognition using next-gen Kaldi with ncnn without Internet connection. Support iOS, Android, Raspberry Pi, VisionFive2, etc. [Github] [Doc]
- lhotse: Tools for handling speech data in machine learning projects. [Github] [Doc]
snowfall(deprecated)[Github]
- FunASR [Github] [Doc]
- A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models.
- Gao, Z., Li, Z., Wang, J., Luo, H., Shi, X., Chen, M., ... & Zhang, S. (2023). FunASR: A Fundamental End-to-End Speech Recognition Toolkit. arXiv preprint arXiv:2305.11013.
- espnet/espnet2 [Github]
- Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.
- wenet [Github]
- Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
- Zhang B, Wu D, Yao Z, et al. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[J]. arXiv preprint arXiv:2012.05481, 2020.
- Wu D, Zhang B, Yang C, et al. U2++: Unified two-pass bidirectional end-to-end model for speech recognition[J]. arXiv preprint arXiv:2106.05642, 2021.
- NeMo [Github] [Doc]
- NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS).
- Fairseq [Github] [Doc]
- Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
- speechbrain [Github] [Doc]
- SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch.
- paddlespeech [Github] [Doc]
- PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
- eesen R.I.P. [Github]
- Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015: 167-174.
- warp_ctc [Github]
- A fast parallel implementation of CTC, on both CPU and GPU.
- htk
- sphinx