You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a couple of questions regarding the training process of the ASR and TTS adapters as described in your report, specifically at stage 1 (modality alignment).
Are the TTS and ASR adapters trained jointly or separately during this stage? I'm curious about the approach taken for modality alignment.
For the ASR adapter, I'd also like to confirm the training flow and inquire about the specific loss functions used:
Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?
The text was updated successfully, but these errors were encountered:
Are the TTS and ASR adapters trained jointly or separately during this stage?
Since the TTS and ASR alignment process have no common parameters during training, so training jointly or seperately should be the same.
Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?
Yes, it looks accurate. And the asr targets are the text tokens, and the loss function is cross-entropy as LLM used for next token prediction.
Hello Team,
I have a couple of questions regarding the training process of the ASR and TTS adapters as described in your report, specifically at stage 1 (modality alignment).
Are the TTS and ASR adapters trained jointly or separately during this stage? I'm curious about the approach taken for modality alignment.

For the ASR adapter, I'd also like to confirm the training flow and inquire about the specific loss functions used:
Could you please confirm if the flowchart provided for the ASR adapter is accurate? If so, which loss functions are employed during the training process?
The text was updated successfully, but these errors were encountered: