feat: optimize alignment processing and remove vocabulary size limitations #100

thevilledev · 2025-01-29T19:16:10Z

Implement integer-based word mapping:

Replaced chr() mapping with integer lists in _word2char()
Renamed function to _word2int()
Removed 0x10FFFF vocabulary size limitation
Removed redundant empty string ValueError

Performance improvements:

Directly use RapidFuzz's opcodes instead of editops + conversion
Single-pass error counting during alignment processing
Single-pass word2int implementation per type
Benchmarks show 3-10% speedup across batch sizes

New test coverage:

Added tests for large vocabularies (1M+ unique words)
Verified hash collision resilience
Added speed benchmarks for regression testing

This prevents failures when processing large vocabularies.

References

Add validation and documentation for Unicode code point limit (0x10FFFF) in word mapping implementation. Changes include: - Add vocabulary size check in `_word2char` to prevent `chr()` overflow - Add comprehensive docstrings to `_word2char` and `process_words` functions - Add "Known Limitations" section to usage docs with workarounds - Document all potential ValueErrors in public API This prevents unexpected failures when processing large vocabularies and helps users handle the limitation appropriately. Signed-off-by: Ville Vesilehto <[email protected]>

jitsi-jenkins · 2025-01-29T19:20:02Z

Hi, thanks for your contribution!
If you haven't already done so, could you please make sure you sign our CLA (https://jitsi.org/icla for individuals and https://jitsi.org/ccla for corporations)? We would unfortunately be unable to merge your patch unless we have that piece :(.

thevilledev · 2025-01-29T19:34:24Z

CLA ✅

nikvaessen · 2025-01-30T12:30:26Z

Thanks a lot for looking into this issue! I was not aware of this limitation. I think it might be worth it to change the word hashing implementation. We could map each unique word a python integer instead. I'm not sure how much slower this is, maybe we only do it if the vocab size is larger than 0x10FFFF words.

Can you share your thoughts behind throwing a ValueError, or do you agree that it might be better to fix the limitation?

thevilledev · 2025-01-30T12:35:54Z

Hey, thanks for the quick reply! I actually had an implementation ready for hashing, but it broke basically all the tests that rely on chr() in this case. Totally agree that hashing would be the right thing to do, and this is just an immediate fix. I can also work on that refactoring, but should we merge this in in the meanwhile?

nikvaessen · 2025-01-30T12:49:51Z

def _word2char(reference: List[List[str]], hypothesis: List[List[str]]):
    # tokenize each word into an integer
    vocabulary = set(chain(*reference, *hypothesis))

    if "" in vocabulary:
        raise ValueError(
            "Empty strings cannot be a word. "
            "Please ensure that the given transform removes empty strings."
        )

    word2int = dict(zip(vocabulary, range(len(vocabulary))))

    reference_ints = [
        [word2int[w] for w in sentence] for sentence in reference
    ]
    hypothesis_ints = [
        [word2int[w] for w in sentence] for sentence in hypothesis
    ]

    return reference_ints, hypothesis_ints

Seems to pass all test cases for me.

It seems a bit slower, but not significantly:




current 
--------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                  Mean                StdDev                Median                   IQR            Outliers          OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1            19.1780 (1.0)          59.8240 (1.0)         20.5117 (1.0)          1.7625 (1.0)         20.3230 (1.0)          0.4490 (1.0)       251;441  48,752.6335 (1.0)       11379           1
test_speed_n10           92.6020 (4.83)        221.6960 (3.71)        97.2698 (4.74)         5.0923 (2.89)        96.6030 (4.75)         1.0650 (2.37)      204;976  10,280.6836 (0.21)       7027           1
test_speed_n100         782.6560 (40.81)     1,802.7650 (30.13)      834.0676 (40.66)      108.0104 (61.28)      812.2400 (39.97)       16.0885 (35.83)       19;50   1,198.9436 (0.02)        439           1
test_speed_n1000      7,985.9050 (416.41)   16,603.6110 (277.54)   9,299.2639 (453.36)   1,798.1539 (>1000.0)  8,550.4255 (420.73)   1,560.5880 (>1000.0)       8;7     107.5354 (0.00)        104           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int mapping

---------------------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers          OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1            19.9990 (1.0)          69.8270 (1.0)          24.7692 (1.0)          2.6409 (1.0)          24.9580 (1.0)          0.9015 (1.0)     1686;1891  40,372.6539 (1.0)        9912           1
test_speed_n10           94.5230 (4.73)        923.4460 (13.22)       110.9225 (4.48)        15.8243 (5.99)        112.0380 (4.49)        15.2380 (16.90)     249;115   9,015.3040 (0.22)       6126           1
test_speed_n100         825.8250 (41.29)    11,821.3680 (169.30)    1,047.6752 (42.30)      669.3954 (253.47)    1,001.9540 (40.15)      169.1627 (187.65)      11;22     954.4943 (0.02)        865           1
test_speed_n1000      8,345.1510 (417.28)   20,286.1870 (290.52)   10,408.9054 (420.24)   2,130.1998 (806.62)   10,090.4130 (404.30)   1,230.6775 (>1000.0)       7;7      96.0716 (0.00)         88           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- Replace chr-based word mapping with integer-based approach to eliminate the 0x10FFFF vocabulary size limit - Update word2char to map words to integers instead of Unicode characters - Remove vocabulary size check and related error handling - Adjust process_words to work with integer lists instead of character strings - Update test suite to validate large vocabulary handling and hash collisions - Remove documentation references to vocabulary size limitation Signed-off-by: Ville Vesilehto <[email protected]>

- Replace editops + Opcodes.from_editops with direct opcodes call - Process alignment chunks and error counts in a single loop - Remove redundant sum operations over edit operations - Update AlignmentChunk indices to use RapidFuzz's native opcode ranges - Maintain existing error rate calculation logic while improving efficiency Signed-off-by: Ville Vesilehto <[email protected]>

thevilledev · 2025-01-30T16:00:41Z

Yeah my bad - I was wrapping them to string (and breaking it in the process). I missed that rapidfuzz.distance.Levenshtein.editops() supports generic sequences. Updated the PR based on the comments, thanks!

I took some optimisations from my broken implementation and aligned it with the integer-based maps. Benchmark results from my machine:

- current main branch:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             9.5000 (1.0)         42.6670 (1.0)          9.9302 (1.0)        2.1570 (1.0)          9.6670 (1.0)       0.1250 (1.0)          8;41  100,702.4429 (1.0)         521           1
test_speed_n10           61.0830 (6.43)       194.6250 (4.56)        62.1981 (6.26)       2.2764 (1.06)        61.8330 (6.40)      0.3330 (2.66)     555;1103   16,077.6569 (0.16)      12013           1
test_speed_n100         572.9580 (60.31)      657.8750 (15.42)      577.7681 (58.18)      4.9037 (2.27)       576.7500 (59.66)     3.0410 (24.33)       97;85    1,730.7982 (0.02)       1582           1
test_speed_n1000      5,842.5000 (615.00)   9,644.5410 (226.04)   5,948.0211 (598.98)   412.8883 (191.42)   5,890.1245 (609.30)   42.0625 (336.50)       3;17      168.1231 (0.00)        160           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- with proposed changes:

------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             8.8330 (1.0)          29.6250 (1.0)          9.0443 (1.0)        0.5102 (1.0)          9.0000 (1.0)       0.0830 (1.0)        42;230  110,566.5630 (1.0)        4508           1
test_speed_n10           55.1670 (6.25)        189.3330 (6.39)        56.1595 (6.21)       2.2208 (4.35)        55.8750 (6.21)      0.2500 (3.01)     667;1140   17,806.4399 (0.16)      13683           1
test_speed_n100         518.7500 (58.73)    14,773.5410 (498.68)     561.5938 (62.09)    457.9989 (897.63)     539.5000 (59.94)    18.7193 (225.53)       6;96    1,780.6464 (0.02)       1757           1
test_speed_n1000      5,291.5420 (599.07)    9,306.3340 (314.14)   5,417.3031 (598.97)   476.7528 (934.39)   5,326.2500 (591.81)   31.1145 (374.87)       5;30      184.5937 (0.00)        179           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I replaced editops with direct opcodes usage for faster alignment processing, with single-pass error counting when doing the chunking. This shows consistent speedups on all batch sizes, mainly 7-10% on smaller ones and 3-9% on larger ones, with lower standard deviance.

jiwer/process.py

Signed-off-by: Ville Vesilehto <[email protected]>

thevilledev · 2025-01-30T21:02:46Z

Style issues fixed as well

Optimize the _word2int function by: 1. Using defaultdict with auto-incrementing IDs to avoid building vocabulary set 2. Remove redundant empty string check since it's handled by transformations This change improves performance by: - Eliminating the need to build a full vocabulary set upfront - Removing unnecessary empty string validation that's already handled by transforms - Using a single pass through words instead of multiple iterations The empty string check was redundant since the transformation pipeline (e.g. ReduceToListOfListOfWords) already handles this case appropriately. Signed-off-by: Ville Vesilehto <[email protected]>

thevilledev · 2025-01-30T21:37:30Z

I realised the empty string value error is redundant as it's already handled by the transformation pipeline (ReduceToListOfListOfWords). So I simplified the word2int implementation even further. Latest benchmarks:

main:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             9.4160 (1.0)         44.7080 (1.0)          9.7707 (1.0)        1.1321 (1.0)          9.6660 (1.0)       0.1240 (1.0)        31;259  102,347.1622 (1.0)        2855           1
test_speed_n10           60.8750 (6.47)       189.2090 (4.23)        62.1888 (6.36)       2.0044 (1.77)        61.7500 (6.39)      0.3330 (2.69)     880;1768   16,080.0733 (0.16)      12746           1
test_speed_n100         569.3750 (60.47)      755.7910 (16.91)      576.5351 (59.01)     15.3150 (13.53)      573.1045 (59.29)     3.6875 (29.74)      77;162    1,734.4996 (0.02)       1624           1
test_speed_n1000      5,834.7910 (619.67)   9,961.1670 (222.81)   5,937.2898 (607.66)   454.3998 (401.37)   5,861.1660 (606.37)   38.9167 (313.84)       3;21      168.4270 (0.00)        155           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


proposed changes:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             8.5830 (1.0)        337.6660 (1.0)          9.1340 (1.0)        7.7864 (1.0)          8.7500 (1.0)       0.1240 (1.0)        11;261  109,480.8948 (1.0)        2966           1
test_speed_n10           53.6250 (6.25)       793.7090 (2.35)        54.8579 (6.01)       8.6824 (1.12)        54.3340 (6.21)      0.2920 (2.35)      90;1165   18,228.9011 (0.17)      14511           1
test_speed_n100         503.0000 (58.60)      660.7500 (1.96)       507.9502 (55.61)      8.4541 (1.09)       506.5840 (57.90)     3.0830 (24.86)      56;100    1,968.6970 (0.02)       1857           1
test_speed_n1000      5,164.0830 (601.66)   9,340.0000 (27.66)    5,323.5881 (582.83)   563.3457 (72.35)    5,201.0210 (594.40)   76.6460 (618.09)       4;16      187.8432 (0.00)        180           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

nikvaessen · 2025-01-31T01:36:31Z

I think poetry does not support 3.8 anymore, I've moved the github workflow on the main branch over to uv, so tests shouldn't fail with a rebase.

thevilledev changed the title ~~feat: add vocabulary size limit check and documentation~~ feat: implement integer-based word mapping Jan 30, 2025

thevilledev changed the title ~~feat: implement integer-based word mapping~~ feat: optimize alignment processing and remove vocabulary size limitations Jan 30, 2025

nikvaessen reviewed Jan 30, 2025

View reviewed changes

jiwer/process.py Outdated Show resolved Hide resolved

jiwer/process.py Outdated Show resolved Hide resolved

thevilledev added 2 commits January 30, 2025 22:56

fix: chars to ints

0fe59c0

Signed-off-by: Ville Vesilehto <[email protected]>

style: lint

b5de057

Signed-off-by: Ville Vesilehto <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize alignment processing and remove vocabulary size limitations #100

feat: optimize alignment processing and remove vocabulary size limitations #100

thevilledev commented Jan 29, 2025 •

edited

Loading

jitsi-jenkins commented Jan 29, 2025

thevilledev commented Jan 29, 2025

nikvaessen commented Jan 30, 2025 •

edited

Loading

thevilledev commented Jan 30, 2025

nikvaessen commented Jan 30, 2025 •

edited

Loading

thevilledev commented Jan 30, 2025

thevilledev commented Jan 30, 2025

thevilledev commented Jan 30, 2025

nikvaessen commented Jan 31, 2025 •

edited

Loading

feat: optimize alignment processing and remove vocabulary size limitations #100

Are you sure you want to change the base?

feat: optimize alignment processing and remove vocabulary size limitations #100

Conversation

thevilledev commented Jan 29, 2025 • edited Loading

References

jitsi-jenkins commented Jan 29, 2025

thevilledev commented Jan 29, 2025

nikvaessen commented Jan 30, 2025 • edited Loading

thevilledev commented Jan 30, 2025

nikvaessen commented Jan 30, 2025 • edited Loading

thevilledev commented Jan 30, 2025

thevilledev commented Jan 30, 2025

thevilledev commented Jan 30, 2025

nikvaessen commented Jan 31, 2025 • edited Loading

thevilledev commented Jan 29, 2025 •

edited

Loading

nikvaessen commented Jan 30, 2025 •

edited

Loading

nikvaessen commented Jan 30, 2025 •

edited

Loading

nikvaessen commented Jan 31, 2025 •

edited

Loading