Unified Speech Recognition

Introduction

The implementation of the code is based on PyTorch Lightning. We use Hydra for configuration and Weights and Biases for logging.

Preparation

Installation

conda env create -f environment.yml. Please change the environment prefix, if necessary, and follow the instructions in fairseq_manual/setup_data_utils.py to setup fairseq.

Data

LRS3, VoxCeleb2, and LRS2 can be downloaded using the following links:
- LRS3
- VoxCeleb2
- LRS2
The landmarks for face alignment can be obtained using the following face detection and face alignment repos: RetinaFace and 2-D FAN. Alternatively, they can be downloaded from this repo.
The following command can be run to crop the mouths (please ensure that the name of the landmark files matches the corresponding video (except for the file extension)):
```
python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}
```

The videos and the audio should be placed in directories with names as defined in conf/data/default.yaml (alternatively the arguments for the directory names can be changed). The root of the directory paths should be set in data/dataset/paths. 4. Download the data paths with the tokenised labels for LRS3 test set, LRS3 trainval set, LRS3 training set, LRS3 validation test, and LRS3 + VoxCeleb2.

Demo

The following command can be run to produce the transcription for the example video in the repo (example.avi and example.wav). Please make sure to download a Base model from below and provide the path to the .pth file as shown below:

python demo.py model.pretrained_model_path=${MODEL_PATH}

Testing

We provide the checkpoints necessary to reproduce the main results (see Table 3). The corresponding bash scripts can be used to run the inference code. Please fill out the missing values .sh script (e.g., place the path of the downloaded test data csv for the "data.dataset.test_csv" argument)

Low-resource

Model	Pre-training Dataset	V (%)	A (%)	AV (%)	Checkpoint	Bash Script
Base	LRS3	36.0	3.2	3.0	Download	scripts/test/base_low_resource_lrs3.sh
Base Plus	LRS3+Vox2-en	28.4	2.6	2.5	Download	scripts/test/baseplus_low_resource_lrs3vox2.sh
Large	LRS3+Vox2-en	26.9	2.4	2.4	Download	scripts/test/large_low_resource_lrs3vox2.sh

High-resource

Model	Pre-training Dataset	V (%)	A (%)	AV (%)	Checkpoint	Bash Script
Base	LRS3	34.3	1.9	1.6	Download	scripts/test/base_high_resource_lrs3.sh
Base Plus	LRS3+Vox2-en	26.5	1.6	1.3	Download	scripts/test/baseplus_high_resource_lrs3vox2.sh
Large	LRS3+Vox2-en	22.3	1.2	1.1	Download	scripts/test/large_high_resource_lrs3vox2.sh

Self-Supervised Pre-training

The following scripts can be used to pre-train each of our models using our self-supervised approach. Please note that we use the CTC and attention losses to "probe" the performance of the representations throughout training. Crucially, no corresponding gradient is passed back to the encoder, which would corrupt the self-supervised nature of the task.

Model	Pre-training Dataset	Checkpoint	Bash Script
Base	LRS3	Download	scripts/self/base_lrs3.sh
Base Plus	LRS3+Vox2-en	Download	scripts/self/baseplus_lrs3vox2.sh
Large	LRS3+Vox2-en	Download	scripts/self/large_lrs3vox2.sh

Semi-Supervised Training

The following scrips can be used to train each of the models for the low- and high-resource settings. To obtain the best results (those presented in Table 3 of the paper), the path for the self-supervised pre-trained model needs to be specified in the bash scrips ("model.pretrained_model_path"). The model can be also trained from scratch by leaving that argument blank.

Low-resource

Model	Dataset	Bash Script
Base	LRS3	scripts/semi/base_low_resource_lrs3.sh
Base Plus	LRS3+Vox2-en	scripts/semi/baseplus_low_resource_lrs3vox2.sh
Large	LRS3+Vox2-en	scripts/semi/large_low_resource_lrs3vox2.sh

High-resource

Model	Dataset	Bash Script
Base	LRS3	scripts/semi/base_high_resource_lrs3.sh
Base Plus	LRS3+Vox2-en	scripts/semi/baseplus_high_resource_lrs3vox2.sh
Large	LRS3+Vox2-en	scripts/semi/large_high_resource_lrs3vox2.sh

Citation

If you find this repo useful for your research, please consider citing the following:

@article{haliassos2024unified,
  title={Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs},
  author={Haliassos, Alexandros and Mira, Rodrigo and Chen, Honglie and Landgraf, Zoe and Petridis, Stavros and Pantic, Maja},
  journal={arXiv preprint arXiv:2411.02256},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
conf		conf
data		data
espnet		espnet
fairseq_manual		fairseq_manual
models		models
preprocessing		preprocessing
schedulers		schedulers
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
environment.yml		environment.yml
example.avi		example.avi
example.wav		example.wav
learner_ft.py		learner_ft.py
main.py		main.py
main_ft.py		main_ft.py
main_pre.py		main_pre.py
main_self.py		main_self.py
metrics.py		metrics.py
overview.png		overview.png
pre_learner.py		pre_learner.py
semi_learner.py		semi_learner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Speech Recognition

Introduction

Preparation

Installation

Data

Demo

Testing

Low-resource

High-resource

Self-Supervised Pre-training

Semi-Supervised Training

Low-resource

High-resource

Citation

About

Releases

Packages

Languages

ahaliassos/usr

Folders and files

Latest commit

History

Repository files navigation

Unified Speech Recognition

Introduction

Preparation

Installation

Data

Demo

Testing

Low-resource

High-resource

Self-Supervised Pre-training

Semi-Supervised Training

Low-resource

High-resource

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages