-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORTANT: 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4 for some certain audio data. WHY? #944
Comments
Thanks. I tried to use different speech_pad_ms values like 400, 600, 800,1000 as related issue #925 mentioned, but the improvement in results is not significant. Here is the test result for 600 to 1000 speech_pad_ms for Silero VAD-v5. Still, a lot of content is being lost.: |
Better to directly change the |
Not really, I tried to change the |
I also think that V5 version is worse than V4 . why? |
I've tested the VAD V5 model in its bare form independent of the time-stamping and cropping stuff. I can say that it is worse than V4. It definitely misses some of the silence/activity parts and adjusting the parameters in the code with the V5 model may improve the results up to a level but not significantly. When you reduce the threshold then it will start to detect more false positives. |
Couldn't agree more, moreover, when multiple voices overlap, such as during a meeting discussion. the effect would be much worse, for example: |
Unfortunately, this issue should be opened on the Silero repository, and they should fix this issue with a new model. |
All falls down.zip
For testing some audio files that only contain human voice dialogue, the v5 VAD seems to be better than v4.
But for the music lyrics transcription I tested, 1.0.3 VAD v5 is much worse than 1.0.2 or 1.0.1 VAD v4, v5 and related vad.py will pre-truncate audio content (more than 3 mins about this music test file). If I replace the v5 version file with the v4 version and 1.0.1 vad.py, same parameter value, the v4 version can fully transcribe the content of this music file. Why?
Here is my transcription result:
[00:00:00 -> 00:00:02] What's the trick? I wish I knew
[00:00:02 -> 00:00:05] I'm so done with thinkin' through
[00:00:05 -> 00:00:07] All the things I could've been
[00:00:07 -> 00:00:10] And I know you wonder too
[00:00:10 -> 00:00:12] All it takes is that one look you do
[00:00:12 -> 00:00:14] When I run right back to you
[00:00:14 -> 00:00:16] You cross the line
[00:00:16 -> 00:00:19] and it's time to say a few
[00:00:19 -> 00:00:22] What's point in sayin' that
[00:00:22 -> 00:00:24] when you know how I'll react
[00:00:24 -> 00:00:26] You think you can just take it back
[00:00:26 -> 00:00:29] But shit, just don't work like that
[00:00:29 -> 00:00:38] You're the drug that I'm addicted to
[00:00:38 -> 00:01:37] Cause when you know who falls
[00:01:37 -> 00:01:38] Cause when it-
The text was updated successfully, but these errors were encountered: