Skip to content

Commit

Permalink
add Visualization
Browse files Browse the repository at this point in the history
  • Loading branch information
liutaocode committed Jan 24, 2024
1 parent 308d86d commit d51115c
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ Move the `slider` to preview the positions and ID information of faces on differ
* The result can `not` directly converted to exactly the same [RTTM](./rttms/all.rttm) as some duration or face ids are adjusted and off-screen speech is not included in this part. By the way, the facial identification in each video is unique and also differs from the identifiers in [RTTM](./rttms/all.rttm) mentioned above.
* Different from the above-mentioned cropped face, the annotation here is for the bounding box of the unprocessed face in the original video.
* **Why are we releasing it now?** Our initial experiments were conducted using a training set based on cropped faces. However, we realized that facial tagging is extremely important for multi-modal speaker diarization. Consequently, following the publication of our work, we decided to embark on a frame-by-frame review process. The undertaking is massive, involving the inspection of approximately 120,000 video frames, and ensuring that the IDs remain consistent throughout the video. We also conducted a second round of checks for added accuracy. It is only after this meticulous process that we are now able to release the dataset for public use.
* **What is the relationship between audio labels and visual labels?** You can use this [Link](https://github.com/X-LANCE/MSDWILD/tree/master/visualization) to visualize the relationship.
* I suggest that this is merely **supplementary** material for this dataset. Possible future work we envision includes training an end-to-end multimodal speaker diarization that incorporates facial location information, and an evaluation method for a multimodal speaker diarization that takes into account the human face location.


Expand Down

0 comments on commit d51115c

Please sign in to comment.