IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

ckgithub2019 · 2024-07-31T11:01:39Z

For testing some audio files that only contain human voice dialogue, the v5 VAD seems to be better than v4.

But for the music lyrics transcription I tested, 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4, v5 and related vad.py will pre-truncate audio content (more than 3 mins about this music test file). If I replace the v5 version file with the v4 version and 1.0.1 vad.py, same parameter value, the v4 version can fully transcribe the content of this music file. Why?

Here is my transcription result:
[00:00:00 -> 00:00:02] What's the trick? I wish I knew
[00:00:02 -> 00:00:05] I'm so done with thinkin' through
[00:00:05 -> 00:00:07] All the things I could've been
[00:00:07 -> 00:00:10] And I know you wonder too
[00:00:10 -> 00:00:12] All it takes is that one look you do
[00:00:12 -> 00:00:14] When I run right back to you
[00:00:14 -> 00:00:16] You cross the line
[00:00:16 -> 00:00:19] and it's time to say a few
[00:00:19 -> 00:00:22] What's point in sayin' that
[00:00:22 -> 00:00:24] when you know how I'll react
[00:00:24 -> 00:00:26] You think you can just take it back
[00:00:26 -> 00:00:29] But shit, just don't work like that
[00:00:29 -> 00:00:38] You're the drug that I'm addicted to
[00:00:38 -> 00:01:37] Cause when you know who falls
[00:01:37 -> 00:01:38] Cause when it-

zh-plus · 2024-07-31T16:43:08Z

You must carefully turn the parameters for Silero VAD-v5 according to your audio.

Related issue: #925, #934

ckgithub2019 · 2024-08-01T01:51:07Z

You must carefully turn the parameters for Silero VAD-v5 according to your audio.

Related issue: #925, #934

Thanks.

I tried to use different speech_pad_ms values like 400, 600, 800,1000 as related issue #925 mentioned, but the improvement in results is not significant.

Here is the test result for 600 to 1000 speech_pad_ms for Silero VAD-v5. Still, a lot of content is being lost.:
[00:00:00 -> 00:00:02] What's the trick? I wish I knew
[00:00:02 -> 00:00:05] I'm so done with thinkin' through
[00:00:05 -> 00:00:07] All the things I could've been
[00:00:07 -> 00:00:10] And I know you wonder too
[00:00:10 -> 00:00:12] All it takes is that one look you do
[00:00:12 -> 00:00:14] When I run right back to you
[00:00:14 -> 00:00:16] You cross the line
[00:00:16 -> 00:00:19] and it's time to say a few
[00:00:19 -> 00:00:22] What's point in sayin' that
[00:00:22 -> 00:00:24] when you know how I'll react
[00:00:24 -> 00:00:27] You think you can just take it back
[00:00:27 -> 00:00:29] But shit, just don't work like that
[00:00:29 -> 00:00:32] You're the drug that I'm addicted to
[00:00:38 -> 00:01:37] Cause when it all falls down

zh-plus · 2024-08-01T14:10:20Z

Better to directly change the threshold to 0.4/0.3/0.2.

ckgithub2019 · 2024-08-02T05:02:51Z

Better to directly change the threshold to 0.4/0.3/0.2.

Not really, I tried to change the threshold to 0.4/0.3/0.2, a little bit of improvement, still cannot transcribe the whole audio contents, but VAD v4 could do it well. I have no idea how to fix it, weird. moreover, VAD v5 would easily cut the whole sentence into segments word by word when the speaker's voice has pauses or is not very fluent, but VAD v4 is all right.

CheshireCC · 2024-08-03T03:56:49Z

I also think that V5 version is worse than V4 . why？

aligokalppeker · 2024-08-05T16:48:33Z

I've tested the VAD V5 model in its bare form independent of the time-stamping and cropping stuff. I can say that it is worse than V4. It definitely misses some of the silence/activity parts and adjusting the parameters in the code with the V5 model may improve the results up to a level but not significantly. When you reduce the threshold then it will start to detect more false positives.

ckgithub2019 · 2024-08-06T08:20:12Z

I've tested the VAD V5 model in its bare form independent of the time-stamping and cropping stuff. I can say that it is worse than V4. It definitely misses some of the silence/activity parts and adjusting the parameters in the code with the V5 model may improve the results up to a level but not significantly. When you reduce the threshold then it will start to detect more false positives.

Couldn't agree more, moreover, when multiple voices overlap, such as during a meeting discussion. the effect would be much worse, for example:
...
[00:14:03 -> 00:14:03] About
[00:14:03 -> 00:14:03] The
[00:14:03 -> 00:14:04] Changes
[00:14:04 -> 00:14:04] So
[00:14:04 -> 00:14:04] Oh
[00:14:04 -> 00:14:04] I
[00:14:04 -> 00:14:04] Can
[00:14:04 -> 00:14:05] Tell
[00:14:05 -> 00:14:05] Him
[00:14:05 -> 00:14:06] Yeah
[00:14:06 -> 00:14:06] I
[00:14:06 -> 00:14:06] Can
[00:14:06 -> 00:14:06] Tell
[00:14:06 -> 00:14:06] Him
[00:14:06 -> 00:14:07] Tomorrow
[00:14:07 -> 00:14:07] We
[00:14:07 -> 00:14:07] Start
[00:14:07 -> 00:14:07] Nine
[00:14:07 -> 00:14:08] Autonomous
[00:14:08 -> 00:14:08] All
[00:14:08 -> 00:14:09] Right
[00:14:09 -> 00:14:09] Yeah
[00:14:09 -> 00:14:09] 10
[00:14:09 -> 00:14:10] To
[00:14:10 -> 00:14:10] 11
[00:14:11 -> 00:14:15] A.M

aligokalppeker · 2024-08-06T08:56:48Z

Unfortunately, this issue should be opened on the Silero repository, and they should fix this issue with a new model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

ckgithub2019 commented Jul 31, 2024

zh-plus commented Jul 31, 2024

ckgithub2019 commented Aug 1, 2024

zh-plus commented Aug 1, 2024

ckgithub2019 commented Aug 2, 2024

CheshireCC commented Aug 3, 2024

aligokalppeker commented Aug 5, 2024 •

edited

Loading

ckgithub2019 commented Aug 6, 2024

aligokalppeker commented Aug 6, 2024

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

Comments

ckgithub2019 commented Jul 31, 2024

zh-plus commented Jul 31, 2024

ckgithub2019 commented Aug 1, 2024

zh-plus commented Aug 1, 2024

ckgithub2019 commented Aug 2, 2024

CheshireCC commented Aug 3, 2024

aligokalppeker commented Aug 5, 2024 • edited Loading

ckgithub2019 commented Aug 6, 2024

aligokalppeker commented Aug 6, 2024

aligokalppeker commented Aug 5, 2024 •

edited

Loading