Squash long words at window and sentence boundaries. #1114

ryanheise · 2023-03-18T08:52:45Z

This PR improves the heuristic to squash long words at window/sentence boundaries.

The previous heuristic squashed any of the first two words of a window if they were too long, since long words at the start tend to indicate words being stretched out to cover silence at the start of the window.

This PR makes the following 3 improvements:

It adds a heuristic to detect long words at sentence boundaries in the middle of the window, and takes these to be indicative of words being stretched to cover silent gaps between sentences. The long boundary words will be squashed appropriately. This prevents many cases where the first word in a sentence comes in seconds too early.
It modifies the original heuristic that squashed any of the first two words if they were long, so that now it won't squash the second word if the first word wasn't also long. This is because the long-ness of a word "mid window" is less significant if it comes after a short word, except at a sentence boundary which is already handled by case 1 above.
If token inference doesn't complete the last sentence in a window, that last sentence might be discarded and then timestamp alignment may come up with a super elongated word duration for the last word to cover the span where that last sentence would have been. And since that last word end timestamp determines the start of the next window, a chunk of that last sentence will skipped over. In such cases, we can still get an accurate timestamp for the end of the segment coming from the last timestamp token in the finished sequence. So if the last word of the window is super elongated, and the last timestamp token looks reasonable, we prefer it.

Test example: https://audio2.redcircle.com/episodes/6b196013-8672-43d9-be52-4332b3207d93/stream.mp3

BEFORE

69-whiskey_subbed-35-orig.mp4

AFTER

69-whiskey_subbed-35-pr.mp4

Example of 1. (detecting a gap between sentences mid-window)

In this example, both segments are within the same window. The audio exhibits a 3 second pause between sentences, but the timestamps elongate the first word of the second sentence ("Along") to cover that silence.

Before:

49
00:00:28,860 --> 00:00:30,100
and unapologetic<u>.</u>

50
00:00:30,100 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

After:

49
00:00:28,860 --> 00:00:29,620
and unapologetic<u>.</u>

50
00:00:32,900 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

Example of 2. (first 2 words of a window incorrectly classified as long)

In this example, the second segment starts a new window. However, the second word of this window ("unapologetic") is naturally long while the word before it ("and") is short, so it is incorrectly classified by the heuristic. As a result, "and apologetic" gets transported 1.5 seconds into the future after squashing.

Before:

46
00:00:26,580 --> 00:00:27,180
A show once restrained by rules and boundaries now comes straight to you raw,<u> uncensored</u>

47
00:00:28,580 --> 00:00:29,340
<u>and</u> unapologetic.

48
00:00:29,340 --> 00:00:28,860
and<u> unapologetic</u>.

49
00:00:28,860 --> 00:00:30,100
and unapologetic<u>.</u>

50
00:00:30,100 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

After:

46
00:00:26,580 --> 00:00:27,180
A show once restrained by rules and boundaries now comes straight to you raw,<u> uncensored</u>

47
00:00:27,180 --> 00:00:27,480
<u>and</u> unapologetic.

48
00:00:27,480 --> 00:00:28,860
and<u> unapologetic</u>.

49
00:00:28,860 --> 00:00:29,620
and unapologetic<u>.</u>

50
00:00:32,900 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

It's a minor change to the original heuristic but I didn't want to modify it too much since I didn't have the original test cases that may have inspired it. My inclination would have been to get rid of it and just generalise (1) to handle both sentence and window boundaries the same way since I have never seen a case where the first TWO words were both elongated, I've only seen the words immediately touching a boundary become elongated (at least since the introduction of the new word-level timestamps feature).

So I've left this in there for now, and if you had test cases where the first two words were elongated, you might be able to come back to this point.

Example of 3. (end of window being skipped and last word incorrectly elongated)

In this example, the second segment starts a new window. At the end of the first window, the word "you're" is elongated to almost 4 seconds which masks words "going to need to understand because" from the original audio which we never see transcribed - they are skipped over because the next window starts too late.

Before:

1244
00:07:47,980 --> 00:07:51,340
and girls, these are all the things your parents didn't want you to understand about, but<u> you're</u>

1245
00:07:54,800 --> 00:07:55,280
<u>okay</u>.

After:

1245
00:07:47,980 --> 00:07:48,360
and girls, these are all the things your parents didn't want you to understand about, but<u> you're</u>

1246
00:07:48,360 --> 00:07:48,520
<u>going</u> to need to understand because it's okay.

ryanheise · 2023-03-21T06:46:51Z

So I've left this in there for now, and if you had test cases where the first two words were elongated, you might be able to come back to this point.

I've converted this PR to a draft, because I think case 2 (i.e. the original heuristic) probably should be improved further. It looks like it could make the 2nd and 3rd word of a window overlap. And it also looks like it is still susceptible to transporting the first two words a long distance if the 3rd word of a window begins a new sentence.

ryanheise · 2023-03-21T07:16:35Z

From the original heuristic, start_time[1] could be shifted right beyond both end_time[1] and start_time[2]:

        if len(word_durations) >= 2 and word_durations[1] > max_duration:
            boundary = max(end_times[2] / 2, end_times[2] - max_duration)
            end_times[0] = start_times[1] = boundary

I think that was supposed to be:

        if len(word_durations) >= 2 and word_durations[1] > max_duration:
            boundary = max(end_times[1] / 2, end_times[1] - max_duration)
            end_times[0] = start_times[1] = boundary

ryanheise · 2023-03-21T13:21:14Z

I fixed the above issue and a similar issue, and so I've now removed the draft status of the PR.

seba-aguila · 2023-03-28T06:09:53Z

Hi Ryan, I tried your improvements, and they work really well. However, I wanted to ask you could tell me how did you get the subtitles with the underlines highlighting the words, because passing the srt file right away to moviepy shows the tag instead of underlining the words. I thought it was due to the font I was using but it was not. Hope you can help me.

ryanheise · 2023-03-28T06:56:12Z

Thanks, @seba-aguila . To generate the video above, I used:

ffmpeg -f lavfi -i color=size=720x120:rate=25:color=black -i input.mp3 -vf "subtitles=input.srt:force_style='Fontsize=70'" -shortest output.mp4

* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <[email protected]>

* Go binding: Implement SetSplitOnWord * Add comment for consistency

ryanheise added 2 commits March 18, 2023 17:28

Squash long words at window and sentence boundaries.

6771ef9

Formatting requirements.

ab3f38d

ryanheise marked this pull request as draft March 21, 2023 06:11

Fix squashing logic to point to correct words.

63e2c6b

ryanheise marked this pull request as ready for review March 21, 2023 13:17

ryanheise mentioned this pull request Apr 2, 2023

Implement max line width and max line count, and make word highlighting optional #1184

Merged

Merge branch 'main' into truncate-long-words

4a6a3ea

jongwook merged commit 255887f into openai:main Apr 11, 2023

guillaumekln mentioned this pull request Jun 6, 2023

Different transcription with word_timestamps SYSTRAN/faster-whisper#280

Closed

ryanheise mentioned this pull request Jun 21, 2023

Improve timestamp heuristics. #1461

Merged

ryanheise deleted the truncate-long-words branch November 18, 2023 03:44

linsen20220222 pushed a commit to zebra-media/whisper that referenced this pull request Nov 29, 2024

go : implement SetSplitOnWord (openai#1114)

a2684cd

* Go binding: Implement SetSplitOnWord * Add comment for consistency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squash long words at window and sentence boundaries. #1114

Squash long words at window and sentence boundaries. #1114

ryanheise commented Mar 18, 2023 •

edited

Loading

ryanheise commented Mar 21, 2023

ryanheise commented Mar 21, 2023

ryanheise commented Mar 21, 2023

seba-aguila commented Mar 28, 2023 •

edited

Loading

ryanheise commented Mar 28, 2023

Squash long words at window and sentence boundaries. #1114

Squash long words at window and sentence boundaries. #1114

Conversation

ryanheise commented Mar 18, 2023 • edited Loading

Example of 1. (detecting a gap between sentences mid-window)

Example of 2. (first 2 words of a window incorrectly classified as long)

Example of 3. (end of window being skipped and last word incorrectly elongated)

ryanheise commented Mar 21, 2023

ryanheise commented Mar 21, 2023

ryanheise commented Mar 21, 2023

seba-aguila commented Mar 28, 2023 • edited Loading

ryanheise commented Mar 28, 2023

ryanheise commented Mar 18, 2023 •

edited

Loading

seba-aguila commented Mar 28, 2023 •

edited

Loading