Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reproduction for SLAM-Omni #189

Closed
wants to merge 170 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
170 commits
Select commit Hold shift + click to select a range
d66af4a
init
cwx-worst-one Sep 18, 2024
dedd657
9.20
cwx-worst-one Sep 20, 2024
b6a8478
9.20
cwx-worst-one Sep 20, 2024
f751db8
9.21
cwx-worst-one Sep 21, 2024
d6d6d97
9.22
cwx-worst-one Sep 22, 2024
76b439a
9.23
cwx-worst-one Sep 23, 2024
0f3abac
Update finetune_cwx.sh script
cwx-worst-one Sep 23, 2024
3233ce5
Update finetune_cwx.sh script: Adjust dataset split size and enable f…
cwx-worst-one Sep 23, 2024
e4dde82
9.23
cwx-worst-one Sep 23, 2024
9f49036
9.23
cwx-worst-one Sep 23, 2024
bf0d4c6
9.23
cwx-worst-one Sep 23, 2024
1c19ec8
generate
cwx-worst-one Sep 23, 2024
96721e2
9.23 推理有点问题
cwx-worst-one Sep 23, 2024
d54a5eb
9.24
cwx-worst-one Sep 24, 2024
e755212
fix padding_token in snac_utils.py
cwx-worst-one Sep 24, 2024
b0e404f
9.24
cwx-worst-one Sep 24, 2024
6e89942
track for layer loss
cwx-worst-one Sep 25, 2024
4ea1cc7
docker
cwx-worst-one Sep 25, 2024
8d1be75
Update Dockerfile and pyproject.toml
cwx-worst-one Sep 25, 2024
0c3e407
Update pyproject.toml to allow direct references in metadata
cwx-worst-one Sep 25, 2024
cf3d4a6
Update requirements.txt
cwx-worst-one Sep 25, 2024
9481720
Update requirements.txt to remove unused dependencies
cwx-worst-one Sep 25, 2024
4d96076
Remove unused dependencies from requirements.txt
cwx-worst-one Sep 25, 2024
5eddade
1
cwx-worst-one Sep 25, 2024
11ab973
1
cwx-worst-one Sep 25, 2024
bea2e89
1
cwx-worst-one Sep 25, 2024
f0311b7
1
cwx-worst-one Sep 25, 2024
7ae8d7a
9.25
cwx-worst-one Sep 25, 2024
e87dc3d
9.25
cwx-worst-one Sep 25, 2024
bae1714
9.26
cwx-worst-one Sep 26, 2024
14faf30
9.26
cwx-worst-one Sep 26, 2024
38fba5b
9.26
cwx-worst-one Sep 26, 2024
efe6351
9.29
cwx-worst-one Sep 29, 2024
fdd82ec
9.29
cwx-worst-one Sep 29, 2024
7c973ba
9.29
cwx-worst-one Sep 29, 2024
621e92a
9.29
cwx-worst-one Sep 29, 2024
3fc8b61
9.30
cwx-worst-one Sep 30, 2024
6fc689d
9.30
cwx-worst-one Sep 30, 2024
a708b19
9.30
cwx-worst-one Sep 30, 2024
5e18879
9.30
cwx-worst-one Sep 30, 2024
59dcd16
10.7
cwx-worst-one Oct 7, 2024
7e84708
10.7
cwx-worst-one Oct 7, 2024
e4a6003
10.7
cwx-worst-one Oct 7, 2024
1362515
10.7
cwx-worst-one Oct 7, 2024
cef91e7
10.8
cwx-worst-one Oct 8, 2024
b29bb9f
10.8
cwx-worst-one Oct 8, 2024
df15ad0
10.8
cwx-worst-one Oct 8, 2024
a2fce40
10.8
cwx-worst-one Oct 8, 2024
d145773
10.8
cwx-worst-one Oct 8, 2024
ac6155c
debug !!!
cwx-worst-one Oct 9, 2024
141d87e
slam-omni v0
cwx-worst-one Oct 9, 2024
d8a1d54
10.9
cwx-worst-one Oct 9, 2024
195c10c
10.9
cwx-worst-one Oct 9, 2024
3fbe0c6
wonderful day!
cwx-worst-one Oct 9, 2024
888e209
10.10
cwx-worst-one Oct 10, 2024
edcf096
10.10
cwx-worst-one Oct 10, 2024
1e6d5f1
10.10
cwx-worst-one Oct 10, 2024
1871b47
10.10
cwx-worst-one Oct 10, 2024
2626a0a
10.10
cwx-worst-one Oct 10, 2024
f501eda
10.10
cwx-worst-one Oct 10, 2024
aabfd02
10.10
cwx-worst-one Oct 10, 2024
3b62b5e
10.11
cwx-worst-one Oct 11, 2024
0a4f6d1
10.12
cwx-worst-one Oct 12, 2024
0c15b6e
10.12
cwx-worst-one Oct 12, 2024
1dc1905
life
cwx-worst-one Oct 12, 2024
e1c311f
sadness
cwx-worst-one Oct 12, 2024
ed8bfb0
10.14
cwx-worst-one Oct 14, 2024
655a0f6
Merge pull request #151 from cwx-worst-one/main
cwx-worst-one Oct 14, 2024
ec86c78
Merge pull request #152 from X-LANCE/main
cwx-worst-one Oct 14, 2024
c1a270a
enjoy yourself
cwx-worst-one Oct 14, 2024
4b418b0
pride
cwx-worst-one Oct 14, 2024
c98f1c2
1
cwx-worst-one Oct 14, 2024
c72da32
whisper support
cwx-worst-one Oct 14, 2024
fc0983a
Gluttony
cwx-worst-one Oct 14, 2024
664e30f
Gluttony
cwx-worst-one Oct 14, 2024
7d5e142
Gluttony
cwx-worst-one Oct 14, 2024
b24808d
lust
cwx-worst-one Oct 15, 2024
832e9ad
[cwx-worst-one] add streaming inference
cwx-worst-one Oct 15, 2024
4372ac1
Merge pull request #155 from X-LANCE/s2s
cwx-worst-one Oct 16, 2024
ab22dcb
[cwx-worst-one] Agony
cwx-worst-one Oct 16, 2024
fee1421
sloth
cwx-worst-one Oct 18, 2024
cecd5d7
Merge branch 'dev-cwx-my' into dev-cwx
cwx-worst-one Oct 18, 2024
ff8d2ee
Merge pull request #158 from X-LANCE/dev-cwx
cwx-worst-one Oct 18, 2024
ba419af
agony
cwx-worst-one Oct 19, 2024
2d39d26
agony
cwx-worst-one Oct 19, 2024
ec10aa2
greed
cwx-worst-one Oct 22, 2024
18fc658
greed
cwx-worst-one Oct 22, 2024
bb478fe
greed
cwx-worst-one Oct 22, 2024
6b8f264
greed
cwx-worst-one Oct 22, 2024
9484d5e
greed
cwx-worst-one Oct 22, 2024
77a6daf
add cosyvoice
cwx-worst-one Oct 22, 2024
eaf32b4
lust
cwx-worst-one Oct 23, 2024
24c9c56
lust
cwx-worst-one Oct 23, 2024
8e79221
lust
cwx-worst-one Oct 23, 2024
2fdab67
lust
cwx-worst-one Oct 23, 2024
9421566
envy
cwx-worst-one Oct 25, 2024
1004538
sloth
cwx-worst-one Oct 26, 2024
ea1bb36
add group decoding for CosyVoice and add logging info
cwx-worst-one Oct 28, 2024
b010300
1
cwx-worst-one Oct 28, 2024
2dec8b0
update audio repetition penalty
cwx-worst-one Oct 28, 2024
9b419d8
update online inference of CV
cwx-worst-one Oct 29, 2024
9d9647f
add readme
cwx-worst-one Oct 30, 2024
cef5d6e
modify shell script
cwx-worst-one Oct 30, 2024
22b9fbf
fix wandb
cwx-worst-one Oct 30, 2024
1d15626
update generate
cwx-worst-one Oct 30, 2024
ee9be97
1
cwx-worst-one Oct 30, 2024
a2a6f1f
update
cwx-worst-one Nov 1, 2024
3152858
update generate
cwx-worst-one Nov 1, 2024
c1001ac
update generate
cwx-worst-one Nov 1, 2024
b760507
update generate
cwx-worst-one Nov 4, 2024
90b87ee
1
cwx-worst-one Nov 5, 2024
431ad19
update script
cwx-worst-one Nov 6, 2024
5a11d43
update
cwx-worst-one Nov 15, 2024
ae2d3f7
11.17
cwx-worst-one Nov 17, 2024
5a0a825
11.18
cwx-worst-one Nov 18, 2024
e57d15a
11.18
cwx-worst-one Nov 18, 2024
3b0fcbc
update prompt
cwx-worst-one Nov 20, 2024
5f93f6f
file clean
cwx-worst-one Nov 26, 2024
0eed70e
update re
cwx-worst-one Nov 26, 2024
e6e490d
file clean
cwx-worst-one Nov 26, 2024
54bc3a2
11.26
cwx-worst-one Nov 26, 2024
21e0362
update
cwx-worst-one Nov 26, 2024
ef4bb63
clean script
cwx-worst-one Nov 27, 2024
bb67424
clean
cwx-worst-one Nov 27, 2024
0a2c107
fix
cwx-worst-one Nov 27, 2024
beae32c
clean
cwx-worst-one Nov 28, 2024
9f24ac0
clean
cwx-worst-one Nov 28, 2024
8b634c9
clean
cwx-worst-one Nov 28, 2024
2184d2c
1
cwx-worst-one Dec 3, 2024
6d4942c
update
cwx-worst-one Dec 3, 2024
212993c
update
cwx-worst-one Dec 3, 2024
2453b58
clean
cwx-worst-one Dec 3, 2024
b792ee2
update
cwx-worst-one Dec 3, 2024
57ace40
update
cwx-worst-one Dec 4, 2024
9c197c1
update asr
cwx-worst-one Dec 7, 2024
e3f810e
add readme
cwx-worst-one Dec 7, 2024
4df702d
update
cwx-worst-one Dec 11, 2024
350a115
update
cwx-worst-one Dec 12, 2024
d055338
update
cwx-worst-one Dec 12, 2024
98a1bc8
update readme
cwx-worst-one Dec 17, 2024
ce7ccfd
update download
cwx-worst-one Dec 17, 2024
a1519a7
update
cwx-worst-one Dec 18, 2024
4f0edef
update requirement
cwx-worst-one Dec 18, 2024
ca1fb06
update readme
cwx-worst-one Dec 18, 2024
2cdd57b
clean
cwx-worst-one Dec 18, 2024
f47b449
clean
cwx-worst-one Dec 18, 2024
b3d0f6d
update script
cwx-worst-one Dec 18, 2024
13946c8
update
cwx-worst-one Dec 18, 2024
6f58c72
update
cwx-worst-one Dec 18, 2024
6c2aef5
update multi-round
cwx-worst-one Dec 19, 2024
07ca5be
update readme
cwx-worst-one Dec 20, 2024
2b2f33f
update readme
cwx-worst-one Dec 20, 2024
85191fb
update readme
cwx-worst-one Dec 22, 2024
bd5e5aa
Merge pull request #183 from X-LANCE/main
cwx-worst-one Dec 24, 2024
c1e26c1
feat: add encoder_fairseq_dir path to fine-tuning and inference scripts
cwx-worst-one Dec 24, 2024
42514c2
Revert "merge latest main branch"
cwx-worst-one Dec 24, 2024
a000404
Merge pull request #184 from X-LANCE/revert-183-main
cwx-worst-one Dec 24, 2024
569590c
fix: update num_latency_tokens to 0 and set OMP_NUM_THREADS in finetu…
cwx-worst-one Dec 24, 2024
58b1ed0
Revert "Revert "merge latest main branch""
cwx-worst-one Dec 24, 2024
207d299
Merge pull request #185 from X-LANCE/revert-184-revert-183-main
cwx-worst-one Dec 24, 2024
737cf76
update
cwx-worst-one Dec 27, 2024
2a401f4
fix: set num_latency_tokens to 0 in inference and pretraining scripts
cwx-worst-one Jan 17, 2025
b08b681
update
cwx-worst-one Jan 21, 2025
425d438
update main readme
cwx-worst-one Jan 21, 2025
26f3832
Update README.md
cwx-worst-one Jan 21, 2025
98797c2
Merge pull request #188 from X-LANCE/main
cwx-worst-one Jan 21, 2025
221e893
update
cwx-worst-one Jan 21, 2025
ec7fb64
Update README.md
cwx-worst-one Jan 22, 2025
229765a
Update README.md
cwx-worst-one Jan 22, 2025
4830fe1
update train_utils
cwx-worst-one Jan 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,18 @@ __pycache__
.ipynb_checkpoints
.vscode
debug.py
debug.ipynb
debug.sh
.idea/*
transformers
wandb/
log/
*.log
outputs/
data/
jobs/
debug/
audio/

examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_noself.sh
examples/asr_librispeech/scripts/decode_hubert_xtralarge_linear_vicuna_7b_copy.sh
Expand Down
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,22 @@ developers to train custom multimodal large language model (MLLM), focusing on <
6. [Citation](#citation)

# News
- [Update Jan. 22, 2025] 🔥🔥🔥 Full reproduction for [SLAM-Omni](examples/s2s/README.md) has been supported.
![](docs/slam-omni-model.png)
- SLAM-Omni is a **timbre-controllable** voice interaction system that requires only **single-stage training** and minimal resources to achieve high-quality, end-to-end speech dialogue, supporting multi-turn conversations in both Chinese and English. ([paper](https://arxiv.org/abs/2412.15649), [demo](https://slam-omni.github.io))
- We have fully reproduced the **training and inference** processes of SLAM-Omni and open-sourced all related training datasets. The provided code framework theoretically supports all codec-based spoken dialogue models. Additionally, we offer the reproduction code for [Mini-Omni](https://github.com/gpt-omni/mini-omni).

<table class="center">
<tr>
<td width=50% style="border: none">
<video controls autoplay loop src="https://github.com/user-attachments/assets/73597edb-0d66-453b-b10c-8cf8dd3cae18" muted="false"></video>
</td>
<td width=50% style="border: none">
<video controls autoplay loop src="https://github.com/user-attachments/assets/7a797491-0509-4da8-8662-f2107bd8856a" muted="false"></video>
</td>
</tr>
</table>

- [Update Nov. 17, 2024] Recipes for [LLM-Based Contextual ASR](examples/contextual_asr/README.md) have been supported.
- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported.
Expand Down Expand Up @@ -94,13 +110,17 @@ We provide reference implementations of various LLM-based speech, audio, and mus
- Text-to-Speech (TTS)
- [VALL-E-X](examples/vallex/README.md)
- [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)
- Voice Interaction System
- [SLAM-Omni](examples/s2s/README.md)

- **Audio Task**
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
- [SLAM-AAC](examples/slam_aac/README.md)
- [DRCap](examples/drcap_zeroshot_aac/README.md)

- Spatial Audio Understanding
- [BAT](examples/seld_spatialsoundqa/README.md)

- **Music Task**
- [Music Caption (MC)](examples/mc_musiccaps/README.md)

Expand Down Expand Up @@ -163,6 +183,15 @@ CoT-ST:
}
```

SLAM-Omni:
```
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
```

## Audio Task
SLAM-AAC:
Expand Down Expand Up @@ -191,4 +220,4 @@ BAT:
journal={Proc. ICML},
year={2024}
}
```
```
Binary file added docs/slam-omni-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
147 changes: 147 additions & 0 deletions examples/s2s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# SLAM-Omni
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![arXiv](https://img.shields.io/badge/arXiv-2412.15649-B31B1B.svg)](https://arxiv.org/abs/2412.15649) [![GitHub Demo Page](https://img.shields.io/badge/Github-Demo%20Page-orange.svg)](https://slam-omni.github.io/) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

(*Reproduction of the [paper](https://arxiv.org/abs/2412.15649) SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.*)

## Environment Setup
Set up the environment using the following commands after preparing the SLAM-LLM environment:
```bash
pip install -r ./examples/s2s/requirements.txt
```

Alternatively, you can use our provided Docker image:
```bash
docker pull worstchan/slam-omni:v0
docker run -it --gpus all --name slam-omni worstchan/slam-omni:v0 /bin/bash
```

## Data Preparation

Our project supports two data formats: **Parquet** and **JSONL**. The open-source datasets are available on the Hugging Face Hub in **Parquet** format. Examples usage is provided in [this notebook](./demo/demo_data/demo.ipynb).

### Supported Datasets
We provide three re-synthesized datasets for SLAM-Omni training:
- [VoiceAssistant-400K](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni): Single-round English dialogue dataset.
- [UltraChat-300K](https://huggingface.co/datasets/worstchan/UltraChat-300K-SLAM-Omni): Multi-round English dialogue dataset.
- [Belle_1.4M](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni): Multi-round Chinese dialogue dataset.

#### Usage
You can load any of these datasets using the following code:
```python
from datasets import load_dataset

# Replace "DATASET_NAME" with one of the following:
# - "worstchan/VoiceAssistant-400K-SLAM-Omni"
# - "worstchan/UltraChat-300K-SLAM-Omni"
# - "worstchan/Belle_1.4M-SLAM-Omni"

ds = load_dataset("DATASET_NAME")
```

### JSONL
We also support JSONL format for its concise structure. Below is an example:
```jsonl
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
```

## Checkpoints
We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of **3**. The following checkpoints are available for download:
- [Single-Round Dialogue (English)](https://drive.google.com/drive/folders/1ZmM1h5ZTvS-piuN-msmctmZdi51GWLAu?usp=sharing): Trained on VoiceAssistant-400K.
- [Multi-Round Dialogue (English)](https://drive.google.com/drive/folders/1xBNrqR2LWC0uEjezjx4aUgdsbstisboS?usp=sharing): Trained on VoiceAssistant-400K and UltraChat-300K.
- [Multi-Round Dialogue (Chinese)](https://drive.google.com/drive/folders/1sExIp-UDdL37gb-mh9YlhuDIib0-wUVP?usp=sharing): Trained on Belle_1.4M.


## Training

You can pre-train the S2S model using TTS or ASR tasks with our provided scripts, though we recommend proceeding directly to fine-tuning. Alternatively, you may directly train a TTS or ASR model under the SLAM-Omni framework. For detailed instructions, refer to the [pre-training README](./scripts/pretrain/README.md).

### Fine-tuning
We provide two primary fine-tuning options for **SLAM-Omni** modeling:
```bash
# Fine-tune with grouping strategy (Recommended)
bash ./examples/s2s/scripts/finetune/finetune_s2s_group.sh

# Fine-tune without grouping
bash ./examples/s2s/scripts/finetune/finetune_s2s.sh
```

We also include scripts for reproducing [Mini-Omni](https://github.com/gpt-omni/mini-omni). Note that this requires the original [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K) dataset for training:
```bash
bash ./examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh
```

#### Note💫
Our framework theoretically supports **all codec-based spoken dialogue model training**. Simply re-synthesize the target tokens (e.g., CosyVoice2 tokens) during training for compatibility.

## Inference
We provide scripts for both **online** and **batch** inference. You can use the trained model or the provided checkpoints for inference. For detailed guidance, refer to [inference README](./scripts/inference/README.md).



### Online Inference
Run the following commands for real-time inference:

```bash
# Multi-turn (Recommended)
bash ./examples/s2s/scripts/inference/inference_s2s_online_multi-round.sh

# Single-turn
bash ./examples/s2s/scripts/inference/inference_s2s_online.sh
```

For Mini-Omni modeling, use the following commands:
```bash
# Single-turn non-streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online.sh

# Single-turn streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online_stream.sh
```


### Batch Inference

For batch inference, ensure the data format matches the training format (**Parquet** or **JSONL**). Use the following commands:

```bash
# SLAM-Omni framework
bash ./examples/s2s/scripts/inference/inference_s2s_batch.sh

# Mini-Omni framework
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh
```

## TODO
- [ ] Add evaluation scripts.
- [ ] Add streaming inference scripts for SLAM-Omni.


<!-- ## Gradio Demo -->

## Citation
SLAM-Omni:
```bibtex
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
```
Mini-Omni:
```bibtex
@article{xie2024mini,
title={Mini-omni: Language models can hear, talk while thinking in streaming},
author={Xie, Zhifei and Wu, Changqiao},
journal={arXiv preprint arXiv:2408.16725},
year={2024}
}
```

## Acknowledgement
- We borrow some code from [Mini-Omni](https://github.com/gpt-omni/mini-omni) for SNAC-based modeling.
- We borrow some code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for the vocoder.


## License
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.
Binary file added examples/s2s/audio_prompt/en/prompt_1.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_2.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_3.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_4.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_5.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_6.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_1.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_2.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_3.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_4.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_5.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_6.wav
Binary file not shown.
19 changes: 19 additions & 0 deletions examples/s2s/conf/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
}
}
}
3 changes: 3 additions & 0 deletions examples/s2s/conf/prompt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dataset_config:
# we put prompt here, because the hydra override in shell script only support a small subset of chars
prompt: "Conduct a spoken conversation with the user. "
2 changes: 2 additions & 0 deletions examples/s2s/conf/prompt_asr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
dataset_config:
prompt: "Transcribe the provided audio into accurate text. "
4 changes: 4 additions & 0 deletions examples/s2s/conf/prompt_tts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dataset_config:
# we put prompt here, because the hydra override in shell script only support a small subset of chars
# prompt: "Transcribe speech to text. Output the transcription directly without redundant content. Ensure that the output is not duplicated. "
prompt: "Generate a natural and expressive spoken version of the given text. "
47 changes: 47 additions & 0 deletions examples/s2s/deepspeed_finetune_s2s.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from slam_llm.pipeline.finetune_deepspeed import main as train
from slam_llm.utils.deepspeed_utils import deepspeed_main_wrapper

import logging
from dataclasses import dataclass, field
from omegaconf import DictConfig, ListConfig, OmegaConf
from s2s_config import ModelConfig, TrainConfig, DataConfig, LogConfig


@dataclass
class RunConfig:
dataset_config: DataConfig = field(default_factory=DataConfig)
model_config: ModelConfig = field(default_factory=ModelConfig)
train_config: TrainConfig = field(default_factory=TrainConfig)
log_config: LogConfig = field(default_factory=LogConfig)
debug: bool = field(default=False, metadata={"help": "Use pdb when true"})
metric: str = field(default="acc", metadata={"help": "The metric for evaluation"})
deepspeed_config: str = field(default="examples/asr_librispeech/conf/ds_config.json", metadata={"help": "The metric for evaluation"})


@deepspeed_main_wrapper(config_name=None, version_base=None)
def main_hydra(cfg: DictConfig):
run_config = RunConfig()
cfg = OmegaConf.merge(run_config, cfg)
def to_plain_list(cfg_item):
if isinstance(cfg_item, ListConfig):
return OmegaConf.to_container(cfg_item, resolve=True)
elif isinstance(cfg_item, DictConfig):
return {k: to_plain_list(v) for k, v in cfg_item.items()}
else:
return cfg_item

# kwargs = to_plain_list(cfg)
kwargs = cfg
log_level = getattr(logging, kwargs.get("log_level", "INFO").upper())

logging.basicConfig(level=log_level)

if kwargs.get("debug", False):
import pdb;
pdb.set_trace()

train(kwargs)


if __name__ == "__main__":
main_hydra()
Loading
Loading