index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>MHD</title>
    <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
          integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" rel="stylesheet">
    <script crossorigin="anonymous"
            integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
            src="https://code.jquery.com/jquery-3.3.1.slim.min.js"></script>
    <script crossorigin="anonymous"
            integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1"
            src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js"></script>
    <script crossorigin="anonymous"
            integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM"
            src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"></script>
</head>
<body>

<div class="container">

    <center>
    <h3>
        Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms
    </h3>
    <h6 style="color: #9A2617">Badri N. Patro<sup>1*</sup>&ensp;   Mayank Lunayach<sup>1*</sup>&ensp;    Deepankar Srivastav<sup>1</sup>&ensp;    Sarvesh<sup>1</sup>&ensp;    Hunar Singh<sup>1</sup>&ensp;    Vinay P. Namboodiri<sup>1</sup></h6>
        <div style="color: gray"><sup>1</sup><a href="https://www.iitk.ac.in/">IIT Kanpur</a>, *equal contribution</div>
        <br>
        <h6>
        In WACV 2021
    </h6>
    </center>
    <hr>
            <div style="text-align: center">
                <div class="thumbnail">
                <div class="embed-responsive embed-responsive-4by3" style="width: 50%; left: 27%">
                    <iframe class="embed-responsive-item" src="demo_video.3gp"></iframe>
                </div>
                    <div class="caption" style="color: #6d757d">Task example</div>
                </div>
            </div>
            <br><br>
    <hr>
        <h4>Download Dataset</h4>
        <ul>
        <li>
            <a href="example_dataset.json" download>Dataset sample (consisting of randomly sampled 400 dialogues)</a>
        </li>
            <li>
                <a href="Dataset.zip" download>Full dataset (compressed in a zip file)</a>
                <br><br>
                The dataset folder has the following directory structure:
                <br>


                <code>
                    |-- DT_{<b>N</b>}<br>
                        |   |-- Raw<br>
                    |   |   |-- S{<b>M</b>}<br>
                    |   |   |   |-- The Big Bang_S0{<b>M</b>}{<b>I</b>}.json<br>
                        |   |-- test.json<br>
                        |   |-- train.json<br>
                        |   `-- val.json<br>
                    <br>
                </code>
                where <code><b>N</b></code> is the no. of dialogue turns for that sub dataset, <code><b>M</b></code> represents the season of the series (varies from 1 to 5) and <code><b>I</b></code> represents the episode number in that season (like 01, 02, and so on).
                Episode level extracted dialogues are in the <code>Raw</code> folder. Dialogues split into the train, val, and test categories are in <code>train.json</code>, <code>val.json</code>, and <code>test.json</code>, respectively.
        </ul>
    <hr>
            <h4>Additional Plots/Figures</h4>
            <br>


            <table class="table table-striped table-bordered">
                <caption style="text-align: center">tSNE plots of Visual Dialogs</caption>
                <tr>
                    <td>
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53881679-474cb100-403a-11e9-9d92-c71a0c634fe9.png"
                             width="100%">
                    </td>
                </tr>
                <tr>
                    <td width="60%">
                        A tSNE plot made by randomly selecting 1500 images (each from Humorous and Non-Humorous set) as
                        the last
                        frame of some visual dialog turns. Sometimes these visual models could cheat by detecting some
                        pattern
                        inHumorous/Non-Humorous visual dialogs like specific camera angle etc. The above plot hints
                        towards its
                        absence.To visualize the plot better, each image is represented by a dot and the corresponding
                        plot is
                        shown below. (Currentplot is slightly scaled up to ease the visibility.)
                    </td>
                </tr>
                <tr>
                    <td>
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53881680-47e54780-403a-11e9-823b-79386e56c39e.png"
                             width="100%">
                    </td>
                </tr>
                <tr>
                    <td width="60%">
                        A green dot represents a humorous sample and red dot, a non-humorous sample. They seem to be
                        randomly
                        distributed, hinting towards absence of any such bias.
                    </td>
                </tr>
            </table>

            <br>
            <br>

            <table class="table table-striped table-bordered">
                <caption style="text-align: center"> Bar plots drawn for the word distribution of dialogs spoken by Top
                    6 Speakers in our dataset. Similarity in the top 20 set of each plot suggests that humor/non-humor
                    is not biased due to a particular speaker.
                </caption>

                <tr>
                    <td>
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746598-83a17900-8492-11e9-9479-62cdd7b618e2.png"
                             width="100%">
                    </td>
                    <td>
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746604-8dc37780-8492-11e9-8f8d-b41152fc466b.png"
                             width="100%">
                    </td>
                </tr>

                <tr>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746601-8c924a80-8492-11e9-8ca0-2909a7c8b681.png"
                             width="100%">
                    </td>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746605-8e5c0e00-8492-11e9-91b2-0344751e5cfe.png"
                             width="100%">
                    </td>
                </tr>

                <tr>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746602-8d2ae100-8492-11e9-9482-cc834d219e5a.png"
                             width="100%">
                    </td>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/21227893/58746603-8dc37780-8492-11e9-9502-5312f6985737.png"
                             width="100%">
                    </td>
                </tr>
            </table>

            <br>
            <br>

            <table class="table table-striped table-bordered">
                <caption style="text-align: center">Other Dataset Statistics
                </caption>

                <tr>
                    <td width="33.33%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53884884-88e15a00-4042-11e9-99a5-8fdd7a46ce68.png"
                             width="100%">
                    </td>
                    <td width="33.33%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53952853-f13e4300-40f7-11e9-9444-7d9a030dc4ae.png"
                             width="100%">
                    </td>
                    <td width="33.33%">
                        <img src="https://user-images.githubusercontent.com/48205355/53885089-ea092d80-4042-11e9-94c3-b4690723cb32.png"
                             width="100%">
                    </td>
                </tr>
                <tr>
                    <td width="33.33%">The figure showing average time per turn in a Dialog, across the Dataset.</td>
                    <td width="33.33%">The figure showing average dialog time, across the Dataset.</td>
                    <td width="33.33%">The figure showing contribution of each speaker in generating humor, across the
                        Dataset.
                    </td>
                </tr>
            </table>

            <!--    #### Proposed Model, Multimodal Self Attention Model(MSAM) for Multimodal Humor:-->
            <!--    <table>-->
            <!--        <tr>-->
            <!--            <td width="100%">-->
            <!--                <img src="https://user-images.githubusercontent.com/21227893/58746488-53a5a600-8491-11e9-9769-f1f9093db75e.png">-->
            <!--            </td>-->
            <!--        </tr>-->

            <!--    </table>-->
            <br><br>
            <h4> MSAM model </h4>
            <table class="table table-striped table-bordered">
                <caption>The figure describing the proposed Multimodal Self Attention Model (MSAM) for the laughter detection task. We obtain features of each joint dialogue turn using Multimodal Self attention network. We then obtain the final feature vector using a sequential network before feeding the resultant vector to the binary classifier.</caption>
                <tr>
                    <td width="60%">
                        <img align="center" src="assets/m.png" width="100%">
                    </td>
                </tr>
            </table>

            <br><br>
            <h4> Qualitative results </h4>
            <table class="table table-striped table-bordered">
                <caption>
                    Randomly sampled results (MSAM model) of each prediction category, (correct/incorrect) x (humor/non-humor). Eg. Humor label in a red box means ground truth label was non-humor but predicted label was humor.

                </caption>
                <tr>
                    <td>
                        <img align="center" src="assets/q.png" width="100%">
                    </td>
                </tr>
            </table>

            <br><br>
            <h4> Explaining humor </h4>
            <table class="table table-striped table-bordered">
                <caption>
                    The left column shows visualization of attention at the word level and the right column shows attention visualization at turn level.
                </caption>
                <tr>
                    <td>
                        <img align="center" src="assets/c.png" width="100%">
                    </td>
                </tr>
            </table>

            <br><br>
            <h4> Baseline Models </h4>
            <table class="table table-striped table-bordered">
                <caption style="text-align: center">Fusion Models</caption>

                <tr>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53886442-0e1a3e00-4046-11e9-87a3-259d68593d62.png"
                             width="100%">
                    </td>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53886443-0e1a3e00-4046-11e9-808b-6e50ec5a9b04.png"
                             width="100%">
                    </td>
                </tr>

                <tr>
                    <td align="center" width="50%">Text based Fusion Model (TFM)</td>
                    <td align="center" width="50%">Video based Fusion Model (VFM)</td>
                </tr>
            </table>


            <table class="table table-striped table-bordered">
                <caption style="text-align: center">Attention Models</caption>

                <tr>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53887017-5f76fd00-4047-11e9-97d6-7e690101f2a9.png"
                             width="100%">
                    </td>
                    <td width="50%">
                        <img align="center"
                             src="https://user-images.githubusercontent.com/48205355/53887018-600f9380-4047-11e9-94b9-062446d18306.png"
                             width="100%">
                    </td>
                </tr>

                <tr>
                    <td align="center" width="50%">Text based Attention Model (TAM)</td>
                    <td align="center" width="50%">Video based Attention Model (VAM)</td>
                </tr>
            </table>

            <br>
            <br>
            <br>
    <br><br>


</div>
</body>
</html>