Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize alignment processing and remove vocabulary size limitations #100

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

thevilledev
Copy link

@thevilledev thevilledev commented Jan 29, 2025

  1. Implement integer-based word mapping:
  • Replaced chr() mapping with integer lists in _word2char()
  • Renamed function to _word2int()
  • Removed 0x10FFFF vocabulary size limitation
  • Removed redundant empty string ValueError
  1. Performance improvements:
  • Directly use RapidFuzz's opcodes instead of editops + conversion
  • Single-pass error counting during alignment processing
  • Single-pass word2int implementation per type
  • Benchmarks show 3-10% speedup across batch sizes
  1. New test coverage:
  • Added tests for large vocabularies (1M+ unique words)
  • Verified hash collision resilience
  • Added speed benchmarks for regression testing

This prevents failures when processing large vocabularies.

References

Add validation and documentation for Unicode code point limit (0x10FFFF) in
word mapping implementation. Changes include:

- Add vocabulary size check in `_word2char` to prevent `chr()` overflow
- Add comprehensive docstrings to `_word2char` and `process_words` functions
- Add "Known Limitations" section to usage docs with workarounds
- Document all potential ValueErrors in public API

This prevents unexpected failures when processing large vocabularies and
helps users handle the limitation appropriately.

Signed-off-by: Ville Vesilehto <[email protected]>
@jitsi-jenkins
Copy link

Hi, thanks for your contribution!
If you haven't already done so, could you please make sure you sign our CLA (https://jitsi.org/icla for individuals and https://jitsi.org/ccla for corporations)? We would unfortunately be unable to merge your patch unless we have that piece :(.

@thevilledev
Copy link
Author

CLA ✅

@nikvaessen
Copy link
Collaborator

nikvaessen commented Jan 30, 2025

Thanks a lot for looking into this issue! I was not aware of this limitation. I think it might be worth it to change the word hashing implementation. We could map each unique word a python integer instead. I'm not sure how much slower this is, maybe we only do it if the vocab size is larger than 0x10FFFF words.

Can you share your thoughts behind throwing a ValueError, or do you agree that it might be better to fix the limitation?

@thevilledev
Copy link
Author

Hey, thanks for the quick reply! I actually had an implementation ready for hashing, but it broke basically all the tests that rely on chr() in this case. Totally agree that hashing would be the right thing to do, and this is just an immediate fix. I can also work on that refactoring, but should we merge this in in the meanwhile?

@nikvaessen
Copy link
Collaborator

nikvaessen commented Jan 30, 2025

def _word2char(reference: List[List[str]], hypothesis: List[List[str]]):
    # tokenize each word into an integer
    vocabulary = set(chain(*reference, *hypothesis))

    if "" in vocabulary:
        raise ValueError(
            "Empty strings cannot be a word. "
            "Please ensure that the given transform removes empty strings."
        )

    word2int = dict(zip(vocabulary, range(len(vocabulary))))

    reference_ints = [
        [word2int[w] for w in sentence] for sentence in reference
    ]
    hypothesis_ints = [
        [word2int[w] for w in sentence] for sentence in hypothesis
    ]

    return reference_ints, hypothesis_ints

Seems to pass all test cases for me.

It seems a bit slower, but not significantly:




current 
--------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                  Mean                StdDev                Median                   IQR            Outliers          OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1            19.1780 (1.0)          59.8240 (1.0)         20.5117 (1.0)          1.7625 (1.0)         20.3230 (1.0)          0.4490 (1.0)       251;441  48,752.6335 (1.0)       11379           1
test_speed_n10           92.6020 (4.83)        221.6960 (3.71)        97.2698 (4.74)         5.0923 (2.89)        96.6030 (4.75)         1.0650 (2.37)      204;976  10,280.6836 (0.21)       7027           1
test_speed_n100         782.6560 (40.81)     1,802.7650 (30.13)      834.0676 (40.66)      108.0104 (61.28)      812.2400 (39.97)       16.0885 (35.83)       19;50   1,198.9436 (0.02)        439           1
test_speed_n1000      7,985.9050 (416.41)   16,603.6110 (277.54)   9,299.2639 (453.36)   1,798.1539 (>1000.0)  8,550.4255 (420.73)   1,560.5880 (>1000.0)       8;7     107.5354 (0.00)        104           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

int mapping

---------------------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers          OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1            19.9990 (1.0)          69.8270 (1.0)          24.7692 (1.0)          2.6409 (1.0)          24.9580 (1.0)          0.9015 (1.0)     1686;1891  40,372.6539 (1.0)        9912           1
test_speed_n10           94.5230 (4.73)        923.4460 (13.22)       110.9225 (4.48)        15.8243 (5.99)        112.0380 (4.49)        15.2380 (16.90)     249;115   9,015.3040 (0.22)       6126           1
test_speed_n100         825.8250 (41.29)    11,821.3680 (169.30)    1,047.6752 (42.30)      669.3954 (253.47)    1,001.9540 (40.15)      169.1627 (187.65)      11;22     954.4943 (0.02)        865           1
test_speed_n1000      8,345.1510 (417.28)   20,286.1870 (290.52)   10,408.9054 (420.24)   2,130.1998 (806.62)   10,090.4130 (404.30)   1,230.6775 (>1000.0)       7;7      96.0716 (0.00)         88           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




- Replace chr-based word mapping with integer-based approach to eliminate
  the 0x10FFFF vocabulary size limit
- Update word2char to map words to integers instead of Unicode characters
- Remove vocabulary size check and related error handling
- Adjust process_words to work with integer lists instead of character strings
- Update test suite to validate large vocabulary handling and hash collisions
- Remove documentation references to vocabulary size limitation

Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev thevilledev changed the title feat: add vocabulary size limit check and documentation feat: implement integer-based word mapping Jan 30, 2025
- Replace editops + Opcodes.from_editops with direct opcodes call
- Process alignment chunks and error counts in a single loop
- Remove redundant sum operations over edit operations
- Update AlignmentChunk indices to use RapidFuzz's native opcode ranges
- Maintain existing error rate calculation logic while improving efficiency

Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev thevilledev changed the title feat: implement integer-based word mapping feat: optimize alignment processing and remove vocabulary size limitations Jan 30, 2025
@thevilledev
Copy link
Author

Yeah my bad - I was wrapping them to string (and breaking it in the process). I missed that rapidfuzz.distance.Levenshtein.editops() supports generic sequences. Updated the PR based on the comments, thanks!

I took some optimisations from my broken implementation and aligned it with the integer-based maps. Benchmark results from my machine:

- current main branch:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             9.5000 (1.0)         42.6670 (1.0)          9.9302 (1.0)        2.1570 (1.0)          9.6670 (1.0)       0.1250 (1.0)          8;41  100,702.4429 (1.0)         521           1
test_speed_n10           61.0830 (6.43)       194.6250 (4.56)        62.1981 (6.26)       2.2764 (1.06)        61.8330 (6.40)      0.3330 (2.66)     555;1103   16,077.6569 (0.16)      12013           1
test_speed_n100         572.9580 (60.31)      657.8750 (15.42)      577.7681 (58.18)      4.9037 (2.27)       576.7500 (59.66)     3.0410 (24.33)       97;85    1,730.7982 (0.02)       1582           1
test_speed_n1000      5,842.5000 (615.00)   9,644.5410 (226.04)   5,948.0211 (598.98)   412.8883 (191.42)   5,890.1245 (609.30)   42.0625 (336.50)       3;17      168.1231 (0.00)        160           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- with proposed changes:

------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------
Name (time in us)            Min                    Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             8.8330 (1.0)          29.6250 (1.0)          9.0443 (1.0)        0.5102 (1.0)          9.0000 (1.0)       0.0830 (1.0)        42;230  110,566.5630 (1.0)        4508           1
test_speed_n10           55.1670 (6.25)        189.3330 (6.39)        56.1595 (6.21)       2.2208 (4.35)        55.8750 (6.21)      0.2500 (3.01)     667;1140   17,806.4399 (0.16)      13683           1
test_speed_n100         518.7500 (58.73)    14,773.5410 (498.68)     561.5938 (62.09)    457.9989 (897.63)     539.5000 (59.94)    18.7193 (225.53)       6;96    1,780.6464 (0.02)       1757           1
test_speed_n1000      5,291.5420 (599.07)    9,306.3340 (314.14)   5,417.3031 (598.97)   476.7528 (934.39)   5,326.2500 (591.81)   31.1145 (374.87)       5;30      184.5937 (0.00)        179           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I replaced editops with direct opcodes usage for faster alignment processing, with single-pass error counting when doing the chunking. This shows consistent speedups on all batch sizes, mainly 7-10% on smaller ones and 3-9% on larger ones, with lower standard deviance.

jiwer/process.py Outdated Show resolved Hide resolved
jiwer/process.py Outdated Show resolved Hide resolved
Signed-off-by: Ville Vesilehto <[email protected]>
Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev
Copy link
Author

Style issues fixed as well

Optimize the _word2int function by:
1. Using defaultdict with auto-incrementing IDs to avoid building vocabulary set
2. Remove redundant empty string check since it's handled by transformations

This change improves performance by:
- Eliminating the need to build a full vocabulary set upfront
- Removing unnecessary empty string validation that's already handled by transforms
- Using a single pass through words instead of multiple iterations

The empty string check was redundant since the transformation pipeline
(e.g. ReduceToListOfListOfWords) already handles this case appropriately.

Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev
Copy link
Author

I realised the empty string value error is redundant as it's already handled by the transformation pipeline (ReduceToListOfListOfWords). So I simplified the word2int implementation even further. Latest benchmarks:

main:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             9.4160 (1.0)         44.7080 (1.0)          9.7707 (1.0)        1.1321 (1.0)          9.6660 (1.0)       0.1240 (1.0)        31;259  102,347.1622 (1.0)        2855           1
test_speed_n10           60.8750 (6.47)       189.2090 (4.23)        62.1888 (6.36)       2.0044 (1.77)        61.7500 (6.39)      0.3330 (2.69)     880;1768   16,080.0733 (0.16)      12746           1
test_speed_n100         569.3750 (60.47)      755.7910 (16.91)      576.5351 (59.01)     15.3150 (13.53)      573.1045 (59.29)     3.6875 (29.74)      77;162    1,734.4996 (0.02)       1624           1
test_speed_n1000      5,834.7910 (619.67)   9,961.1670 (222.81)   5,937.2898 (607.66)   454.3998 (401.37)   5,861.1660 (606.37)   38.9167 (313.84)       3;21      168.4270 (0.00)        155           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


proposed changes:

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_speed_n1             8.5830 (1.0)        337.6660 (1.0)          9.1340 (1.0)        7.7864 (1.0)          8.7500 (1.0)       0.1240 (1.0)        11;261  109,480.8948 (1.0)        2966           1
test_speed_n10           53.6250 (6.25)       793.7090 (2.35)        54.8579 (6.01)       8.6824 (1.12)        54.3340 (6.21)      0.2920 (2.35)      90;1165   18,228.9011 (0.17)      14511           1
test_speed_n100         503.0000 (58.60)      660.7500 (1.96)       507.9502 (55.61)      8.4541 (1.09)       506.5840 (57.90)     3.0830 (24.86)      56;100    1,968.6970 (0.02)       1857           1
test_speed_n1000      5,164.0830 (601.66)   9,340.0000 (27.66)    5,323.5881 (582.83)   563.3457 (72.35)    5,201.0210 (594.40)   76.6460 (618.09)       4;16      187.8432 (0.00)        180           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@nikvaessen
Copy link
Collaborator

nikvaessen commented Jan 31, 2025

I think poetry does not support 3.8 anymore, I've moved the github workflow on the main branch over to uv, so tests shouldn't fail with a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants