Add first version of time-constrained ORC-WER (tcORC-WER) #52

thequilo · 2024-01-26T09:52:58Z

This PR adds code to compute the tcORC-WER.

Example:

python -m meeteval.wer tcorcwer -h hyp.stm -r ref.stm --collar 5

The current version is tested on Libri-CSS with a system that produces 8 streams. It finished computation within 10 minutes and used less than 2GB of RAM (which is a huge improvement over ORC-WER!). These requirements should drop further when the number of streams is smaller.

The code is not fully optimized yet and contains many TODOs. I'll work on some of these TODOs and update the PR during the next few days.

I'll merge main back into this PR once #50 is merged.

boeddeker · 2024-01-29T21:06:50Z

meeteval/wer/wer/time_constrained_orc.py

+    # Add a segment index to the reference so that we can later find words that
+    # come from the same segment
+    for i, s in enumerate(reference):
+        s['segment_index'] = i


Hmm. Should the assignment be before the filter?

It shouldn't make a difference because the segment index is here just used for grouping and not for sorting. But it could be easier to understand what's happening when moved before the filter operation

Hmm. The filter will remove empty segments. So the assignment will not be valid for the input of this function.

Should the assignment also consider the empty segments? Or do we drop the empty segments in the apply assignment function?

We could add this to a ToDo list and solve this later. For the challenge start it is not important.

tests/test_time_constrained_orc_matching.py

thequilo added 15 commits January 26, 2024 10:44

Add first version of time-constrained ORC-WER (tcORC-WER)

1c8c203

flake8

b2f4e3c

flake8

35492a1

Fix sum over empty error rate list for Python <3.8

ed3afa7

Add tests for tcORC-WER

524d34c

Add missing *

636979d

Sample strings with spaces, so multiple words

f5f4664

Only compute matching on overlapping streams

66d4fca

Remove an optimization that didn't bring a speedup

aa63b38

Compile with -O3

5fb2277

Merge branch 'main' into tcorcwer

7a49165

Add reference_sort option to tcORC-WER

db595bd

Add tcORC-WER to api

08c6c47

Add testcase for tcorcwer Python api

2ae792c

Mention tcORC-WER in README

3c60812

boeddeker reviewed Jan 29, 2024

View reviewed changes

tests/test_time_constrained_orc_matching.py Show resolved Hide resolved

thequilo added 4 commits January 30, 2024 06:25

Add test that tcpWER is an upper bound on tcORC-WER

8827214

Remove designated initializers

175fba5

Compute self-overlap

04f5083

Disable warnings for word-order changes on the reference for tcORC-WER

4f336a3

boeddeker approved these changes Jan 30, 2024

View reviewed changes

thequilo merged commit f3d92c7 into main Jan 30, 2024
6 checks passed

thequilo deleted the tcorcwer branch September 3, 2024 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first version of time-constrained ORC-WER (tcORC-WER) #52

Add first version of time-constrained ORC-WER (tcORC-WER) #52

thequilo commented Jan 26, 2024

boeddeker Jan 29, 2024

thequilo Jan 30, 2024

boeddeker Jan 30, 2024

Add first version of time-constrained ORC-WER (tcORC-WER) #52

Add first version of time-constrained ORC-WER (tcORC-WER) #52

Conversation

thequilo commented Jan 26, 2024

boeddeker Jan 29, 2024

Choose a reason for hiding this comment

thequilo Jan 30, 2024

Choose a reason for hiding this comment

boeddeker Jan 30, 2024

Choose a reason for hiding this comment