Skip to content

Commit

Permalink
TMLR Paper
Browse files Browse the repository at this point in the history
  • Loading branch information
Bai-YT committed Aug 18, 2024
1 parent 04640d4 commit 3f553f0
Show file tree
Hide file tree
Showing 5 changed files with 214 additions and 53 deletions.
144 changes: 125 additions & 19 deletions consistency_tta/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -97,38 +97,144 @@ <h2>Main Experiment Results</h2>
<p>
Our method reduce the computation of the core step of diffusion-based text-to-audio generation by
a factor of 400 and enables on-device generation, while observing minimal performance degradation in
Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.
Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.<br>
Generation Time is the time in minutes to generate the entire validation set (882 samples).<br>
<i>↑: higher is better; ↓: lower is better.</i>
</p>
<table class="result-table">
<thead>
<tr class="result-row">
<th class="result-head"></th> <th class="result-head"># queries (↓)</th>
<th class="result-head">CLAP<sub>T</sub> (↑)</th> <th class="result-head">CLAP<sub>A</sub> (↑)</th>
<th class="result-head">FAD (↓)</th> <th class="result-head">FD (↓)</th> <th class="result-head">KLD (↓)</th>
<th class="result-head"></th>
<th class="result-head-2">Model Queries<br></th> <th class="result-head-2">Generation Time<br></th>
<th class="result-head">Subjective Quality<br></th> <th class="result-head">Subjective Text Align<br></th>
<th class="result-head-2">CLAP<sub>T</sub><br></th> <th class="result-head-2">CLAP<sub>A</sub><br></th>
<th class="result-head-2">FAD<br></th> <th class="result-head-2">FD<br></th> <th class="result-head-2">KLD<br></th>
</tr>
</thead>
<tbody>
<tr class="result-row" style="color: #a0a0a0">
<td class="result-data">Diffusion (Baseline)</td> <td class="result-data">400</td>
<td class="result-data">24.57</td> <td class="result-data">72.79</td>
<td class="result-data">1.908</td> <td class="result-data">19.57</td> <td class="result-data">1.350</td>
<tr class="result-row-2" style="color: #898989">
<td class="result-data-small">AudioLDM-L (Baseline)</td> <td class="result-data-2">400</td>
<td class="result-data-2">-</td> <td class="result-data">-</td>
<td class="result-data">-</td> <td class="result-data-2">-</td> <td class="result-data-2">-</td>
<td class="result-data-2-400">2.08</td> <td class="result-data-2">27.12</td> <td class="result-data-2">1.86</td>
</tr>
<tr class="result-row-2" style="color: #898989">
<td class="result-data-small">TANGO (Baseline)</td>
<td class="result-data-2">400</td> <td class="result-data-2">168</td>
<td class="result-data"><b>4.136</b></td> <td class="result-data"><b>4.064</b></td>
<td class="result-data-2-400">24.10</td> <td class="result-data-2"><b>72.85</b></td>
<td class="result-data-2"><b>1.631</b></td> <td class="result-data-2"><b>20.11</b></td> <td class="result-data-2">1.362</td>
</tr>
<tr class="result-row">
<td class="result-data">Consistency + CLAP FT (Ours)</td> <td class="result-data">1</td>
<td class="result-data">24.69</td> <td class="result-data">72.54</td>
<td class="result-data">2.406</td> <td class="result-data">20.97</td> <td class="result-data">1.358</td>
<td class="result-data-small">ConsistencyTTA + CLAP-FT</td>
<td class="result-data-2"><b>1</b></td> <td class="result-data-2"><b>2.3</b></td>
<td class="result-data">3.830</td> <td class="result-data"><b>4.064</b></td>
<td class="result-data-2"><b>24.69</b></td> <td class="result-data-2-400">72.54</td>
<td class="result-data-2">2.406</td> <td class="result-data-2-400">20.97</td> <td class="result-data-2-400">1.358</td>
</tr>
<tr class="result-row">
<td class="result-data">Consistency (Ours)</td> <td class="result-data">1</td>
<td class="result-data">22.50</td> <td class="result-data">72.30</td>
<td class="result-data">2.575</td> <td class="result-data">22.08</td> <td class="result-data">1.354</td>
<td class="result-data-small">ConsistencyTTA</td>
<td class="result-data-2"><b>1</b></td> <td class="result-data-2"><b>2.3</b></td>
<td class="result-data-400">3.902</td> <td class="result-data">4.010</td>
<td class="result-data-2">22.50</td> <td class="result-data-2">72.30</td>
<td class="result-data-2">2.575</td> <td class="result-data-2">22.08</td>
<td class="result-data-2"><b>1.354</b></td>
</tr>
<tr class="result-row-2-small" style="color: #898989">
<td class="result-data-small">Ground Truth</td> <td class="result-data-2">-</td>
<td class="result-data-2">-</td> <td class="result-data">-</td> <td class="result-data">-</td>
<td class="result-data-2">26.71</td> <td class="result-data-2">100</td>
<td class="result-data-2">-</td> <td class="result-data-2">-</td> <td class="result-data-2">-</td>
</tr>
</tbody>
</table>
<p>
<a href="https://paperswithcode.com/sota/audio-generation-on-audiocaps" target=&ldquo;blank&rdquo;>This benchmark</a>
demonstrates how our single-step models stack up with previous methods,
most of which mostly require hundreds of generation steps.
most of which requiring hundreds of generation steps.
</p>
</section>

<section class="section">
<h2>Ablation Studies on Distillation Settings</h2>
<p>
<table class="result-table">
<thead>
<tr class="result-row">
<th class="result-head">Guidance Method</th>
<th class="result-head">CFG Weight</th>
<th class="result-head">Teacher Solver</th>
<th class="result-head">Noise Schedule</th>
<th class="result-head-2">FAD ↓</th>
<th class="result-head-2">FD ↓</th>
<th class="result-head-2">KLD ↓</th>
</tr>
</thead>
<tbody>
<tr class="result-row-2">
<td class="result-data-small">Unguided</td>
<td class="result-data-small">1</td>
<td class="result-data-small">DDIM</td>
<td class="result-data-small">Uniform</td>
<td class="result-data-2">13.48</td>
<td class="result-data-2">45.75</td>
<td class="result-data-2">2.409</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small" rowspan="2">External CFG</td>
<td class="result-data-small" rowspan="2">3</td>
<td class="result-data-small">DDIM</td>
<td class="result-data-small">Uniform</td>
<td class="result-data-2">8.565</td>
<td class="result-data-2">38.67</td>
<td class="result-data-2">2.015</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small">Heun</td>
<td class="result-data-small">Karras</td>
<td class="result-data-2">7.421</td>
<td class="result-data-2">39.36</td>
<td class="result-data-2">1.976</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small" rowspan="2">CFG Distillation<br>with Fixed Weight</td>
<td class="result-data-small" rowspan="2">3</td>
<td class="result-data-small" rowspan="2">Heun</td>
<td class="result-data-small">Karras</td>
<td class="result-data-2">5.702</td>
<td class="result-data-2">33.18</td>
<td class="result-data-2">1.494</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small">Uniform</td>
<td class="result-data-2">3.859</td>
<td class="result-data-2"><b>27.79</b></td>
<td class="result-data-2">1.421</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small" rowspan="3">CFG Distillation<br>with Random Weight</td>
<td class="result-data-small">4</td>
<td class="result-data-small" rowspan="2">Heun</td>
<td class="result-data-small" rowspan="2">Uniform</td>
<td class="result-data-2-400">3.180</td>
<td class="result-data-2-400">27.92</td>
<td class="result-data-2-400">1.394</td>
</tr>
<tr class="result-row-2">
<td class="result-data-small">6</td>
<td class="result-data-2"><b>2.975</b></td>
<td class="result-data-2">28.63</td>
<td class="result-data-2"><b>1.378</b></td>
</tr>
</tbody>
</table>
Based on these results, we can conclude that:
<ul>
<li>CFG distillation with random weight is more effective than fixed weight,
which is more effective than external CFG.</li>
<li>Heun is a better teacher solver than DDIM, and
Uniform noise schedule outperforms Karras noise schedule.</li>
</ul>
</p>
</section>

Expand Down Expand Up @@ -156,11 +262,11 @@ <h2>Human Evaluation</h2>
<h2>Citing Our Work (BibTeX)</h2>
<div id="bibtex1" class="bibtex" onclick="copyToClipboard('bibtex1')">
<i class="far fa-copy copy-icon"></i>
<pre>@article{bai2023accelerating,
<pre>@inproceedings{bai2024accelerating,
author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
title = {Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
journal={arXiv preprint arXiv:2309.10740},
year = {2023}
title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
booktitle = {INTERSPEECH},
year = {2024}
}</pre>
</div>
</section>
Expand Down
Binary file modified consistency_tta/poster.pdf
Binary file not shown.
46 changes: 42 additions & 4 deletions consistency_tta/styles.css
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ header h3 {

.section {
font-weight: 400;
max-width: 850px;
max-width: 900px;
margin: 20px auto;
padding: 20px;
padding-top: 10px;
Expand Down Expand Up @@ -248,13 +248,14 @@ tr td:last-child {

/* Base table styles */
.result-table {
max-width: 840px;
max-width: 880px;
width: 100%;
border-collapse: collapse;
margin: auto;
margin: 12px auto;
box-shadow: 0 0 15px rgba(0, 0, 0, 0.1);
overflow: hidden;
font-weight: 300;
}

/* Header styles */
Expand All @@ -266,9 +267,20 @@ tr td:last-child {
padding: 8px 12px;
}

.result-head-2 {
background-color: #55687a; /* Slightly lighter background color */
color: #ecf0f1;
font-weight: bold;
text-align: left;
padding: 8px 12px;
}

/* Row & data styles */
.result-row:nth-of-type(odd) {
background-color: #f7f9fc;
.result-row {
background-color: #e7e9ee;
}
.result-row-2 {
background-color: #ffffff;
}

.result-row:hover {
Expand All @@ -279,6 +291,32 @@ tr td:last-child {
.result-data {
padding: 7px 12px;
border-bottom: 1px solid #e7ebef;
background-color: #dfe3f241;
font-size: 1.15em;
}
.result-data-400 {
padding: 7px 12px;
border-bottom: 1px solid #e7ebef;
background-color: #dfe3f241;
font-size: 1.15em;
font-weight: 400;
}
.result-data-2 {
padding: 7px 12px;
border-bottom: 1px solid #e7ebef;
font-size: 1.15em;
}
.result-data-2-400 {
padding: 7px 12px;
border-bottom: 1px solid #e7ebef;
font-size: 1.15em;
font-weight: 400;
}
.result-data-small {
padding: 7px 12px;
border-bottom: 1px solid #e7ebef;
background-color: #dfe3f241;
font-weight: 400;
}

/* Optional: Add transitions for smoother hover effects */
Expand Down
25 changes: 20 additions & 5 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,11 @@ <h2>About</h2>

<p>
I have interned at
<a href="https://www.adobe.com" target=&ldquo;blank&rdquo;>Adobe</a>,
<a href="https://www.microsoft.com" target=&ldquo;blank&rdquo;>Microsoft</a>,
<a href="https://scale.com" target=&ldquo;blank&rdquo;>Scale AI</a>,
<a href="https://www.hondajet.com" target=&ldquo;blank&rdquo;>Honda Aircraft Company</a>, and
<a href="https://www.tesla.com" target=&ldquo;blank&rdquo;>Tesla, Inc</a>.
<a href="https://www.tesla.com" target=&ldquo;blank&rdquo;>Tesla</a>.
</p>
</section>

Expand All @@ -123,7 +124,7 @@ <h2>About</h2>
<h2>Research interests</h2>

<p>
My interests include generative models (particulaly audio), robust deep learning, (convex) optimization, and controls.
My interests include generative models (particulaly audio/music), robust deep learning, (convex) optimization, and controls.
</p>
<p>
Specifically, I enjoy working on ensuring the adversarial robustness of neural networks, addressing
Expand All @@ -140,15 +141,29 @@ <h2>Research interests</h2>
<h2>News</h2>

<ul>
<li><p>
<b>August 2024:</b> Our paper
<a href="https://arxiv.org/abs/2402.02263" target=&ldquo;blank&rdquo;>
“MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers”</a>
has been accepted to <a href="https://jmlr.org/tmlr"
target=&ldquo;blank&rdquo;>Transactions on Machine Learning Research (TMLR)</a>.
</p></li>

<li><p>
<b>July 2024:</b> I served as a reviewer for the
<a href="https://neurips.cc/Conferences/2024" target=&ldquo;blank&rdquo;>NeurIPS 2024</a> conference.
</p></li>

<li><p>
<b>June 2024:</b> New paper
<a href="https://arxiv.org/abs/2406.03589" target=&ldquo;blank&rdquo;>
“Ranking Manipulation for Conversational Search Engines”</a>
by Samuel Pfrommer, <b>Yatong bai</b>, Tanmay Gautam, and Somayeh Sojoudi.
Project code <a href="https://github.com/spfrommer/ranking_manipulation"
target=&ldquo;blank&rdquo;>Here</a> and
<a href="https://github.com/spfrommer/ranking_manipulation_data_pipeline"
target=&ldquo;blank&rdquo;>Here</a>.
target=&ldquo;blank&rdquo;>on GitHub</a>.
This work proposes the "RAGDOLL" dataset, which is available on
<a href="https://huggingface.co/datasets/Bai-YT/RAGDOLL"
target=&ldquo;blank&rdquo;>Huggingface Datasets</a>.
</p></li>

<li><p>
Expand Down
Loading

0 comments on commit 3f553f0

Please sign in to comment.