-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: optimize alignment processing and remove vocabulary size limitations #100
base: master
Are you sure you want to change the base?
Conversation
Add validation and documentation for Unicode code point limit (0x10FFFF) in word mapping implementation. Changes include: - Add vocabulary size check in `_word2char` to prevent `chr()` overflow - Add comprehensive docstrings to `_word2char` and `process_words` functions - Add "Known Limitations" section to usage docs with workarounds - Document all potential ValueErrors in public API This prevents unexpected failures when processing large vocabularies and helps users handle the limitation appropriately. Signed-off-by: Ville Vesilehto <[email protected]>
Hi, thanks for your contribution! |
CLA ✅ |
Thanks a lot for looking into this issue! I was not aware of this limitation. I think it might be worth it to change the word hashing implementation. We could map each unique word a python integer instead. I'm not sure how much slower this is, maybe we only do it if the vocab size is larger than 0x10FFFF words. Can you share your thoughts behind throwing a |
Hey, thanks for the quick reply! I actually had an implementation ready for hashing, but it broke basically all the tests that rely on |
def _word2char(reference: List[List[str]], hypothesis: List[List[str]]):
# tokenize each word into an integer
vocabulary = set(chain(*reference, *hypothesis))
if "" in vocabulary:
raise ValueError(
"Empty strings cannot be a word. "
"Please ensure that the given transform removes empty strings."
)
word2int = dict(zip(vocabulary, range(len(vocabulary))))
reference_ints = [
[word2int[w] for w in sentence] for sentence in reference
]
hypothesis_ints = [
[word2int[w] for w in sentence] for sentence in hypothesis
]
return reference_ints, hypothesis_ints Seems to pass all test cases for me. It seems a bit slower, but not significantly:
|
- Replace chr-based word mapping with integer-based approach to eliminate the 0x10FFFF vocabulary size limit - Update word2char to map words to integers instead of Unicode characters - Remove vocabulary size check and related error handling - Adjust process_words to work with integer lists instead of character strings - Update test suite to validate large vocabulary handling and hash collisions - Remove documentation references to vocabulary size limitation Signed-off-by: Ville Vesilehto <[email protected]>
- Replace editops + Opcodes.from_editops with direct opcodes call - Process alignment chunks and error counts in a single loop - Remove redundant sum operations over edit operations - Update AlignmentChunk indices to use RapidFuzz's native opcode ranges - Maintain existing error rate calculation logic while improving efficiency Signed-off-by: Ville Vesilehto <[email protected]>
Yeah my bad - I was wrapping them to string (and breaking it in the process). I missed that I took some optimisations from my broken implementation and aligned it with the integer-based maps. Benchmark results from my machine:
I replaced editops with direct opcodes usage for faster alignment processing, with single-pass error counting when doing the chunking. This shows consistent speedups on all batch sizes, mainly 7-10% on smaller ones and 3-9% on larger ones, with lower standard deviance. |
Signed-off-by: Ville Vesilehto <[email protected]>
Signed-off-by: Ville Vesilehto <[email protected]>
Style issues fixed as well |
Optimize the _word2int function by: 1. Using defaultdict with auto-incrementing IDs to avoid building vocabulary set 2. Remove redundant empty string check since it's handled by transformations This change improves performance by: - Eliminating the need to build a full vocabulary set upfront - Removing unnecessary empty string validation that's already handled by transforms - Using a single pass through words instead of multiple iterations The empty string check was redundant since the transformation pipeline (e.g. ReduceToListOfListOfWords) already handles this case appropriately. Signed-off-by: Ville Vesilehto <[email protected]>
I realised the empty string value error is redundant as it's already handled by the transformation pipeline (ReduceToListOfListOfWords). So I simplified the word2int implementation even further. Latest benchmarks:
|
I think poetry does not support 3.8 anymore, I've moved the github workflow on the main branch over to uv, so tests shouldn't fail with a rebase. |
chr()
mapping with integer lists in_word2char()
_word2int()
This prevents failures when processing large vocabularies.
References