-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
295 lines (266 loc) · 13.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>MHD</title>
<link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" rel="stylesheet">
<script crossorigin="anonymous"
integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
src="https://code.jquery.com/jquery-3.3.1.slim.min.js"></script>
<script crossorigin="anonymous"
integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1"
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js"></script>
<script crossorigin="anonymous"
integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM"
src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"></script>
</head>
<body>
<div class="container">
<center>
<h3>
Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms
</h3>
<h6 style="color: #9A2617">Badri N. Patro<sup>1*</sup>  Mayank Lunayach<sup>1*</sup>  Deepankar Srivastav<sup>1</sup>  Sarvesh<sup>1</sup>  Hunar Singh<sup>1</sup>  Vinay P. Namboodiri<sup>1</sup></h6>
<div style="color: gray"><sup>1</sup><a href="https://www.iitk.ac.in/">IIT Kanpur</a>, *equal contribution</div>
<br>
<h6>
In WACV 2021
</h6>
</center>
<hr>
<div style="text-align: center">
<div class="thumbnail">
<div class="embed-responsive embed-responsive-4by3" style="width: 50%; left: 27%">
<iframe class="embed-responsive-item" src="demo_video.3gp"></iframe>
</div>
<div class="caption" style="color: #6d757d">Task example</div>
</div>
</div>
<br><br>
<hr>
<h4>Download Dataset</h4>
<ul>
<li>
<a href="example_dataset.json" download>Dataset sample (consisting of randomly sampled 400 dialogues)</a>
</li>
<li>
<a href="Dataset.zip" download>Full dataset (compressed in a zip file)</a>
<br><br>
The dataset folder has the following directory structure:
<br>
<code>
|-- DT_{<b>N</b>}<br>
| |-- Raw<br>
| | |-- S{<b>M</b>}<br>
| | | |-- The Big Bang_S0{<b>M</b>}{<b>I</b>}.json<br>
| |-- test.json<br>
| |-- train.json<br>
| `-- val.json<br>
<br>
</code>
where <code><b>N</b></code> is the no. of dialogue turns for that sub dataset, <code><b>M</b></code> represents the season of the series (varies from 1 to 5) and <code><b>I</b></code> represents the episode number in that season (like 01, 02, and so on).
Episode level extracted dialogues are in the <code>Raw</code> folder. Dialogues split into the train, val, and test categories are in <code>train.json</code>, <code>val.json</code>, and <code>test.json</code>, respectively.
</ul>
<hr>
<h4>Additional Plots/Figures</h4>
<br>
<table class="table table-striped table-bordered">
<caption style="text-align: center">tSNE plots of Visual Dialogs</caption>
<tr>
<td>
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53881679-474cb100-403a-11e9-9d92-c71a0c634fe9.png"
width="100%">
</td>
</tr>
<tr>
<td width="60%">
A tSNE plot made by randomly selecting 1500 images (each from Humorous and Non-Humorous set) as
the last
frame of some visual dialog turns. Sometimes these visual models could cheat by detecting some
pattern
inHumorous/Non-Humorous visual dialogs like specific camera angle etc. The above plot hints
towards its
absence.To visualize the plot better, each image is represented by a dot and the corresponding
plot is
shown below. (Currentplot is slightly scaled up to ease the visibility.)
</td>
</tr>
<tr>
<td>
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53881680-47e54780-403a-11e9-823b-79386e56c39e.png"
width="100%">
</td>
</tr>
<tr>
<td width="60%">
A green dot represents a humorous sample and red dot, a non-humorous sample. They seem to be
randomly
distributed, hinting towards absence of any such bias.
</td>
</tr>
</table>
<br>
<br>
<table class="table table-striped table-bordered">
<caption style="text-align: center"> Bar plots drawn for the word distribution of dialogs spoken by Top
6 Speakers in our dataset. Similarity in the top 20 set of each plot suggests that humor/non-humor
is not biased due to a particular speaker.
</caption>
<tr>
<td>
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746598-83a17900-8492-11e9-9479-62cdd7b618e2.png"
width="100%">
</td>
<td>
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746604-8dc37780-8492-11e9-8f8d-b41152fc466b.png"
width="100%">
</td>
</tr>
<tr>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746601-8c924a80-8492-11e9-8ca0-2909a7c8b681.png"
width="100%">
</td>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746605-8e5c0e00-8492-11e9-91b2-0344751e5cfe.png"
width="100%">
</td>
</tr>
<tr>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746602-8d2ae100-8492-11e9-9482-cc834d219e5a.png"
width="100%">
</td>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/21227893/58746603-8dc37780-8492-11e9-9502-5312f6985737.png"
width="100%">
</td>
</tr>
</table>
<br>
<br>
<table class="table table-striped table-bordered">
<caption style="text-align: center">Other Dataset Statistics
</caption>
<tr>
<td width="33.33%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53884884-88e15a00-4042-11e9-99a5-8fdd7a46ce68.png"
width="100%">
</td>
<td width="33.33%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53952853-f13e4300-40f7-11e9-9444-7d9a030dc4ae.png"
width="100%">
</td>
<td width="33.33%">
<img src="https://user-images.githubusercontent.com/48205355/53885089-ea092d80-4042-11e9-94c3-b4690723cb32.png"
width="100%">
</td>
</tr>
<tr>
<td width="33.33%">The figure showing average time per turn in a Dialog, across the Dataset.</td>
<td width="33.33%">The figure showing average dialog time, across the Dataset.</td>
<td width="33.33%">The figure showing contribution of each speaker in generating humor, across the
Dataset.
</td>
</tr>
</table>
<!-- #### Proposed Model, Multimodal Self Attention Model(MSAM) for Multimodal Humor:-->
<!-- <table>-->
<!-- <tr>-->
<!-- <td width="100%">-->
<!-- <img src="https://user-images.githubusercontent.com/21227893/58746488-53a5a600-8491-11e9-9769-f1f9093db75e.png">-->
<!-- </td>-->
<!-- </tr>-->
<!-- </table>-->
<br><br>
<h4> MSAM model </h4>
<table class="table table-striped table-bordered">
<caption>The figure describing the proposed Multimodal Self Attention Model (MSAM) for the laughter detection task. We obtain features of each joint dialogue turn using Multimodal Self attention network. We then obtain the final feature vector using a sequential network before feeding the resultant vector to the binary classifier.</caption>
<tr>
<td width="60%">
<img align="center" src="assets/m.png" width="100%">
</td>
</tr>
</table>
<br><br>
<h4> Qualitative results </h4>
<table class="table table-striped table-bordered">
<caption>
Randomly sampled results (MSAM model) of each prediction category, (correct/incorrect) x (humor/non-humor). Eg. Humor label in a red box means ground truth label was non-humor but predicted label was humor.
</caption>
<tr>
<td>
<img align="center" src="assets/q.png" width="100%">
</td>
</tr>
</table>
<br><br>
<h4> Explaining humor </h4>
<table class="table table-striped table-bordered">
<caption>
The left column shows visualization of attention at the word level and the right column shows attention visualization at turn level.
</caption>
<tr>
<td>
<img align="center" src="assets/c.png" width="100%">
</td>
</tr>
</table>
<br><br>
<h4> Baseline Models </h4>
<table class="table table-striped table-bordered">
<caption style="text-align: center">Fusion Models</caption>
<tr>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53886442-0e1a3e00-4046-11e9-87a3-259d68593d62.png"
width="100%">
</td>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53886443-0e1a3e00-4046-11e9-808b-6e50ec5a9b04.png"
width="100%">
</td>
</tr>
<tr>
<td align="center" width="50%">Text based Fusion Model (TFM)</td>
<td align="center" width="50%">Video based Fusion Model (VFM)</td>
</tr>
</table>
<table class="table table-striped table-bordered">
<caption style="text-align: center">Attention Models</caption>
<tr>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53887017-5f76fd00-4047-11e9-97d6-7e690101f2a9.png"
width="100%">
</td>
<td width="50%">
<img align="center"
src="https://user-images.githubusercontent.com/48205355/53887018-600f9380-4047-11e9-94b9-062446d18306.png"
width="100%">
</td>
</tr>
<tr>
<td align="center" width="50%">Text based Attention Model (TAM)</td>
<td align="center" width="50%">Video based Attention Model (VAM)</td>
</tr>
</table>
<br>
<br>
<br>
<br><br>
</div>
</body>
</html>