Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

aidenyzhang · 2024-10-25T03:43:38Z

Hello Team,

I have a couple of questions regarding the training process of the ASR and TTS adapters as described in your report, specifically at stage 1 (modality alignment).

Are the TTS and ASR adapters trained jointly or separately during this stage? I'm curious about the approach taken for modality alignment.

For the ASR adapter, I'd also like to confirm the training flow and inquire about the specific loss functions used:

Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?

mini-omni · 2024-11-04T08:14:57Z

Are the TTS and ASR adapters trained jointly or separately during this stage?

Since the TTS and ASR alignment process have no common parameters during training, so training jointly or seperately should be the same.

Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?

Yes, it looks accurate. And the asr targets are the text tokens, and the loss function is cross-entropy as LLM used for next token prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

aidenyzhang commented Oct 25, 2024

mini-omni commented Nov 4, 2024

Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

Clarification on Joint vs. Separate Training of ASR and TTS Adapters #110

Comments

aidenyzhang commented Oct 25, 2024

mini-omni commented Nov 4, 2024