Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944

Open
ckgithub2019 opened this issue Jul 31, 2024 · 8 comments

Comments

@ckgithub2019
Copy link

All falls down.zip

For testing some audio files that only contain human voice dialogue, the v5 VAD seems to be better than v4.

But for the music lyrics transcription I tested, 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4, v5 and related vad.py will pre-truncate audio content (more than 3 mins about this music test file). If I replace the v5 version file with the v4 version and 1.0.1 vad.py, same parameter value, the v4 version can fully transcribe the content of this music file. Why?

Here is my transcription result:
[00:00:00 -> 00:00:02] What's the trick? I wish I knew
[00:00:02 -> 00:00:05] I'm so done with thinkin' through
[00:00:05 -> 00:00:07] All the things I could've been
[00:00:07 -> 00:00:10] And I know you wonder too
[00:00:10 -> 00:00:12] All it takes is that one look you do
[00:00:12 -> 00:00:14] When I run right back to you
[00:00:14 -> 00:00:16] You cross the line
[00:00:16 -> 00:00:19] and it's time to say a few
[00:00:19 -> 00:00:22] What's point in sayin' that
[00:00:22 -> 00:00:24] when you know how I'll react
[00:00:24 -> 00:00:26] You think you can just take it back
[00:00:26 -> 00:00:29] But shit, just don't work like that
[00:00:29 -> 00:00:38] You're the drug that I'm addicted to
[00:00:38 -> 00:01:37] Cause when you know who falls
[00:01:37 -> 00:01:38] Cause when it-

@zh-plus
Copy link
Contributor

zh-plus commented Jul 31, 2024

You must carefully turn the parameters for Silero VAD-v5 according to your audio.

Related issue: #925, #934

@ckgithub2019
Copy link
Author

You must carefully turn the parameters for Silero VAD-v5 according to your audio.

Related issue: #925, #934

Thanks.

I tried to use different speech_pad_ms values like 400, 600, 800,1000 as related issue #925 mentioned, but the improvement in results is not significant.

Here is the test result for 600 to 1000 speech_pad_ms for Silero VAD-v5. Still, a lot of content is being lost.:
[00:00:00 -> 00:00:02] What's the trick? I wish I knew
[00:00:02 -> 00:00:05] I'm so done with thinkin' through
[00:00:05 -> 00:00:07] All the things I could've been
[00:00:07 -> 00:00:10] And I know you wonder too
[00:00:10 -> 00:00:12] All it takes is that one look you do
[00:00:12 -> 00:00:14] When I run right back to you
[00:00:14 -> 00:00:16] You cross the line
[00:00:16 -> 00:00:19] and it's time to say a few
[00:00:19 -> 00:00:22] What's point in sayin' that
[00:00:22 -> 00:00:24] when you know how I'll react
[00:00:24 -> 00:00:27] You think you can just take it back
[00:00:27 -> 00:00:29] But shit, just don't work like that
[00:00:29 -> 00:00:32] You're the drug that I'm addicted to
[00:00:38 -> 00:01:37] Cause when it all falls down

@zh-plus
Copy link
Contributor

zh-plus commented Aug 1, 2024

Better to directly change the threshold to 0.4/0.3/0.2.

@ckgithub2019
Copy link
Author

Better to directly change the threshold to 0.4/0.3/0.2.

Not really, I tried to change the threshold to 0.4/0.3/0.2, a little bit of improvement, still cannot transcribe the whole audio contents, but VAD v4 could do it well. I have no idea how to fix it, weird. moreover, VAD v5 would easily cut the whole sentence into segments word by word when the speaker's voice has pauses or is not very fluent, but VAD v4 is all right.

@CheshireCC
Copy link

I also think that V5 version is worse than V4 . why?

@aligokalppeker
Copy link

aligokalppeker commented Aug 5, 2024

I've tested the VAD V5 model in its bare form independent of the time-stamping and cropping stuff. I can say that it is worse than V4. It definitely misses some of the silence/activity parts and adjusting the parameters in the code with the V5 model may improve the results up to a level but not significantly. When you reduce the threshold then it will start to detect more false positives.

@ckgithub2019
Copy link
Author

I've tested the VAD V5 model in its bare form independent of the time-stamping and cropping stuff. I can say that it is worse than V4. It definitely misses some of the silence/activity parts and adjusting the parameters in the code with the V5 model may improve the results up to a level but not significantly. When you reduce the threshold then it will start to detect more false positives.

Couldn't agree more, moreover, when multiple voices overlap, such as during a meeting discussion. the effect would be much worse, for example:
...
[00:14:03 -> 00:14:03] About
[00:14:03 -> 00:14:03] The
[00:14:03 -> 00:14:04] Changes
[00:14:04 -> 00:14:04] So
[00:14:04 -> 00:14:04] Oh
[00:14:04 -> 00:14:04] I
[00:14:04 -> 00:14:04] Can
[00:14:04 -> 00:14:05] Tell
[00:14:05 -> 00:14:05] Him
[00:14:05 -> 00:14:06] Yeah
[00:14:06 -> 00:14:06] I
[00:14:06 -> 00:14:06] Can
[00:14:06 -> 00:14:06] Tell
[00:14:06 -> 00:14:06] Him
[00:14:06 -> 00:14:07] Tomorrow
[00:14:07 -> 00:14:07] We
[00:14:07 -> 00:14:07] Start
[00:14:07 -> 00:14:07] Nine
[00:14:07 -> 00:14:08] Autonomous
[00:14:08 -> 00:14:08] All
[00:14:08 -> 00:14:09] Right
[00:14:09 -> 00:14:09] Yeah
[00:14:09 -> 00:14:09] 10
[00:14:09 -> 00:14:10] To
[00:14:10 -> 00:14:10] 11
[00:14:11 -> 00:14:15] A.M

@aligokalppeker
Copy link

Unfortunately, this issue should be opened on the Silero repository, and they should fix this issue with a new model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants