-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
105 lines (82 loc) · 6.77 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD</title>
<link rel="stylesheet" type="text/css" href="css/bootstrap.min.css">
<link rel="stylesheet" type="text/css" href="css/user.css">
</head>
<body>
<div class="container">
<header>
<h1>MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD <small class="color_fade">Online Samples</small>
</h1>
</header>
<h4> Authors:</h4>
<div style="font-size: medium;"> Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu </div>
<h4> Abstract:</h4>
<div style="font-size: medium;"> Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporating visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this paper, we release MSDWild, a benchmark dataset for multi-modal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.</div>
<hr class="hr_line">
<h3>Demo1</h3>
<p>We use <a href="https://www.robots.ox.ac.uk/~vgg/software/via/">VIA Annotator</a> to label speaker diarization. We will provide videos and the labels in RTTM format.</p>
<video width="960" height="540" controls>
<source src="samples/demo1/main.mp4" type="video/mp4">
</video>
<p>We also provide additional talking face videos for training audio-visual relation.</p>
<table class="table">
<tbody>
<tr> <td> Speaking Faces </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/0.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/1.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/2.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/3.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/4.mp4" type="video/mp4"> </video></td>
</tr>
<tr>
<td> </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/5.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/6.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/speaking_mp4s/7.mp4" type="video/mp4"> </video></td>
</tr>
<tr> <td> Not Speaking Faces </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/0.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/1.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/2.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/3.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/4.mp4" type="video/mp4"> </video></td>
</tr>
<tr>
<td> </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/5.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/6.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/7.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/8.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/9.mp4" type="video/mp4"> </video></td>
</tr>
<tr>
<td> </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/10.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/11.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/12.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/13.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/14.mp4" type="video/mp4"> </video></td>
</tr>
<tr>
<td> </td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/15.mp4" type="video/mp4"> </video></td>
<td> <video width="150" height="150" controls> <source src="samples/demo1/silence_mp4s/16.mp4" type="video/mp4"> </video></td>
</tr>
</tbody>
</table>
<h4> Download:</h4>
<table class="table">
<tbody>
<tr>
<td> <a href="https://github.com/X-LANCE/MSDWILD">Download</a> </td>
</tr>
</tbody>
</table>
<hr>
The dataset is <b>ONLY</b> for academic usage.
<hr>
</div>
</body>