From 3f553f0ba0f65986e6b168799c9fc767bae6f937 Mon Sep 17 00:00:00 2001
From: Bai-YT
Our method reduce the computation of the core step of diffusion-based text-to-audio generation by
a factor of 400 and enables on-device generation, while observing minimal performance degradation in
- Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.
+ Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.
This benchmark
demonstrates how our single-step models stack up with previous methods,
- most of which mostly require hundreds of generation steps.
+ most of which requiring hundreds of generation steps.
+
+ Main Experiment Results
+ Generation Time is the time in minutes to generate the entire validation set (882 samples).
+ ↑: higher is better; ↓: lower is better.
-
- # queries (↓)
- CLAPT (↑) CLAPA (↑)
- FAD (↓) FD (↓) KLD (↓)
+
+ Model Queries
↓Generation Time
+
↓Subjective Quality
↑Subjective Text Align
+
↑CLAPT
↑CLAPA
+
↑FAD
↓FD
↓KLD
↓
- Diffusion (Baseline) 400
- 24.57 72.79
- 1.908 19.57 1.350
+
+
+ AudioLDM-L (Baseline) 400
+ - -
+ - - -
+ 2.08 27.12 1.86
+
+
TANGO (Baseline)
+ 400 168
+ 4.136 4.064
+ 24.10 72.85
+ 1.631 20.11 1.362
-
Consistency + CLAP FT (Ours) 1
- 24.69 72.54
- 2.406 20.97 1.358
+ ConsistencyTTA + CLAP-FT
+ 1 2.3
+ 3.830 4.064
+ 24.69 72.54
+ 2.406 20.97 1.358
-
+ Consistency (Ours) 1
- 22.50 72.30
- 2.575 22.08 1.354
+ ConsistencyTTA
+ 1 2.3
+ 3.902 4.010
+ 22.50 72.30
+ 2.575 22.08
+ 1.354
+
+
Ground Truth -
+ - - -
+ 26.71 100
+ - - -
Ablation Studies on Distillation Settings
+
+
+
+ Based on these results, we can conclude that:
+
+
+
+
+ Guidance Method
+ CFG Weight
+ Teacher Solver
+ Noise Schedule
+ FAD ↓
+ FD ↓
+ KLD ↓
+
+
+ Unguided
+ 1
+ DDIM
+ Uniform
+ 13.48
+ 45.75
+ 2.409
+
+
+ External CFG
+ 3
+ DDIM
+ Uniform
+ 8.565
+ 38.67
+ 2.015
+
+
+ Heun
+ Karras
+ 7.421
+ 39.36
+ 1.976
+
+
+ CFG Distillation
+
with Fixed Weight3
+ Heun
+ Karras
+ 5.702
+ 33.18
+ 1.494
+
+
+ Uniform
+ 3.859
+ 27.79
+ 1.421
+
+
+ CFG Distillation
+
with Random Weight4
+ Heun
+ Uniform
+ 3.180
+ 27.92
+ 1.394
+
+
+
+ 6
+ 2.975
+ 28.63
+ 1.378
+
+
@article{bai2023accelerating, +@inproceedings{bai2024accelerating, author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh}, - title = {Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, - journal={arXiv preprint arXiv:2309.10740}, - year = {2023} + title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, + booktitle = {INTERSPEECH}, + year = {2024} }
b@>Zj?4Ai%-5|OFSq1QPu2OVT$4hvO-)USq&@9P&U_J-wXb?|Z{1Y0y!rheuXo>j
zU2FCjMZdr7m~U(=6W`T4Vyg5caGgTZ=(@}P{E>_<{#O&HFFSk8KKl~X@u@BJ{kGhX
z*A7!>dL|sa!jDq(-Fup&ls0lLtW73fM&c4~xG{=_jeOrdjsTr^X#wkpZXDijdcSYM
zPs=;k&{Z)$_jb6@kq#kmy{K18?3#M~b>| LnAsRL07n6s4D0{4y8xKMzrC&+zhB`$*j;}x
zV*j0`@Gr)NX1rAR3ImGh{VS^aI1SiJv)g9^kgH??G|$Oe{?4j|*~!IzTeQb}C5KTI
zbMZ}rp?hwX#H*ir&WFnqdm97od_fW!E9ICyWgXFuUkYu^T1sbsn4Opv-00|(5RU9@
z@9eMzp5$=fFWD7e;oDxYMZ(`!fBmK4GInKcfZSi-Qk0~#vQsh~eXsX`T%MDrn4?TR
z7hvx(m1O>)uH`>GQ{5nhJg{ ay %&&XkQkOd1rlZ_ABuC=2l887r`r5>^Y
zDXKale~7u(32)!Y>}BPu%c%Q)!iDj<$Z@4$TuWe@KTfNCSvCrj)8dzI!^4jmCGxyP
zUSgiO{2Vo})+MZU>e4bB=blGYODL-Fk*W1$s$DpGM38lx(SdLfj-Rt2G~vuYKZ>0f
zN%%A{+Pm{Qeh42;Z(fX~nMRT4gQr7!!jP5Q71n);f#DPk4NYBo@@AYi`;^NoMdw)Q
zly)0G?v7`OjdFKOaZ~LMjl-N?>sd@xiE@SPt`hm>$rij?v*RRQmJ!F?R(gI}pC*z?
zW*fsqB32B(z!y)7UQ&LVtfAqI6*GRngV2lpJtm5PT}Xlll&EJ(AI@VHHj#^azd7CR
zOf)>m#Il9Eg|N#dlTvJ8@j_KwSwWi&PgwSN(>(5K(5PBHXLWPt;4~_cZrew#y2eDn
z^+ySV6YHC-}@Ea8w8TbC^cYK<(nlhhl#OlMfx@~Z|#&4pX=^rE_
zbWBSeI@op2tPU>FINAI!Mmq5Dc3J7bQ-Z?QzpZqr&hIl
*2!)H53=fn^lJ})1k(m2&qpS)HW24Qdz|=&d-#5^joT@
zU}8L|V`^9&NeV9J*m}~qE*g0d`fqgXAGcWU-N!dC3!2WRFozn97`NoE7pt~#8PbED
ze(8>Ptr|pSUzzpb?;!51UO1F@#JV~O0~|6dT8*8)fp44%GU|ih(Q5rlvm$!vIC&5q
zVelTqzK)oCYNj`^%Ow4flGv~RnZuU1+n2mF+^ui#|Et^b`6&1DA^JJh?D!l!ed=>@
z*AGZRug?P8FvPLlAA@xQtJdJn2|Bn+hXfz7)
CH?eNKUC2#W6~fn4bDdOQ@`n7He#mzdSl
z