Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

Open
aidenyzhang opened this issue Oct 25, 2024 · 1 comment
Open

Comments

@aidenyzhang
Copy link

Hello Team,

I have a couple of questions regarding the training process of the ASR and TTS adapters as described in your report, specifically at stage 1 (modality alignment).

Are the TTS and ASR adapters trained jointly or separately during this stage? I'm curious about the approach taken for modality alignment.
image

For the ASR adapter, I'd also like to confirm the training flow and inquire about the specific loss functions used:

Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?

image

@mini-omni
Copy link
Contributor

Are the TTS and ASR adapters trained jointly or separately during this stage?

Since the TTS and ASR alignment process have no common parameters during training, so training jointly or seperately should be the same.

Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?

Yes, it looks accurate. And the asr targets are the text tokens, and the loss function is cross-entropy as LLM used for next token prediction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants