-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squash long words at window and sentence boundaries. #1114
Conversation
I've converted this PR to a draft, because I think case 2 (i.e. the original heuristic) probably should be improved further. It looks like it could make the 2nd and 3rd word of a window overlap. And it also looks like it is still susceptible to transporting the first two words a long distance if the 3rd word of a window begins a new sentence. |
From the original heuristic, if len(word_durations) >= 2 and word_durations[1] > max_duration:
boundary = max(end_times[2] / 2, end_times[2] - max_duration)
end_times[0] = start_times[1] = boundary I think that was supposed to be: if len(word_durations) >= 2 and word_durations[1] > max_duration:
boundary = max(end_times[1] / 2, end_times[1] - max_duration)
end_times[0] = start_times[1] = boundary |
I fixed the above issue and a similar issue, and so I've now removed the draft status of the PR. |
Hi Ryan, I tried your improvements, and they work really well. However, I wanted to ask you could tell me how did you get the subtitles with the underlines highlighting the words, because passing the srt file right away to moviepy shows the tag instead of underlining the words. I thought it was due to the font I was using but it was not. Hope you can help me. |
Thanks, @seba-aguila . To generate the video above, I used:
|
* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <[email protected]>
* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <[email protected]>
* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <[email protected]>
* Go binding: Implement SetSplitOnWord * Add comment for consistency
This PR improves the heuristic to squash long words at window/sentence boundaries.
The previous heuristic squashed any of the first two words of a window if they were too long, since long words at the start tend to indicate words being stretched out to cover silence at the start of the window.
This PR makes the following 3 improvements:
Test example: https://audio2.redcircle.com/episodes/6b196013-8672-43d9-be52-4332b3207d93/stream.mp3
BEFORE
69-whiskey_subbed-35-orig.mp4
AFTER
69-whiskey_subbed-35-pr.mp4
Example of 1. (detecting a gap between sentences mid-window)
In this example, both segments are within the same window. The audio exhibits a 3 second pause between sentences, but the timestamps elongate the first word of the second sentence ("Along") to cover that silence.
Before:
After:
Example of 2. (first 2 words of a window incorrectly classified as long)
In this example, the second segment starts a new window. However, the second word of this window ("unapologetic") is naturally long while the word before it ("and") is short, so it is incorrectly classified by the heuristic. As a result, "and apologetic" gets transported 1.5 seconds into the future after squashing.
Before:
After:
It's a minor change to the original heuristic but I didn't want to modify it too much since I didn't have the original test cases that may have inspired it. My inclination would have been to get rid of it and just generalise (1) to handle both sentence and window boundaries the same way since I have never seen a case where the first TWO words were both elongated, I've only seen the words immediately touching a boundary become elongated (at least since the introduction of the new word-level timestamps feature).
So I've left this in there for now, and if you had test cases where the first two words were elongated, you might be able to come back to this point.
Example of 3. (end of window being skipped and last word incorrectly elongated)
In this example, the second segment starts a new window. At the end of the first window, the word "you're" is elongated to almost 4 seconds which masks words "going to need to understand because" from the original audio which we never see transcribed - they are skipped over because the next window starts too late.
Before:
After: