-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement a simple max_sentencepiece_length into BPE #1228
Conversation
I think work on the |
I like the proposal, however I think this approach is a bit too simplistic to work effectively (I could be wrong). If you're working on non space splitted languages (like korean, I'm guessing given your GH location), then BPE will always be unbearably slow if you don't set a pre_tokenizer. The reason is that if you have a 1000 char long sentence, for each merge it has to update all it's neighbors. So as the BPE is merging pairs, the amount of neighbors is ever growing leading to huge amount of compute to keep track of everything. With a space splitting, some words get "finished" early (like So here I think we should follow the same idea, and remove the neighbors when the resulting token would be larger than X. Also I don't like using Would you be up to change the PR so remove the neighbors instead of just skipping value ? |
Preface#1227 (comment) I've made a separate issue, but if this PR is where you are comfortable discussing the commit I am fine with it. MotivationSince you've shown interest I'll try more to implement it and show you data on my own datasets and code. I am indeed using preprocessors, in fact I am using them extensively. "ValueFromPipelineByPropertyName" to name a few.
I initially tried to use the "Split()" preprocessor regex function to implement this function. Other implementationsIn the original sentencepiece library it is implemented as a function IsValidSentencePiece called by both BPE or Unigram It is the same for the huggingface library unigramtrainer
I think it makes sense to have separate options for bpetrainer and unigram trainer for the huggingface library. Going Forward
I'll be more than glad to adhere to idiomatic Rust patterns or your codebase guidelines. As you might have seen from the silly error in the commit, I am not versed in rust but I am motivated to learn.
|
Currently I implemented the Option to make it safer. but if usize is better there would be no difficulty reimplementing as such. How should the variable be communicated to down stream? should it be set to None as default for the time being? as for neighbor discarding, wouldn't it be gone automatically?
|
I trained a tokenizer and my findings Some unexpected findings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the additional info+context.
I looked more deeply into the merge part. to skip neighboring on invalid tokens.
The issue is that it will make the new potential "large" pairs still be in the where_to_update
hashmap, meaning we're keeping track of extra potential pairs which will always get ignored afterwards.
I proposed the change I envisionned here: #1235 (Feel free to just cherry-pick my commit, I can also push here, but in general we refrain from pushing to contributor's branch).
What this change modifies:
- Pairs that would lead to the merge being > max_length get ignored. So they cannot even be considered in the future. meaning the overall pair maps should be smaller, and less wasted computation
- Keep
Option
in thedo_train
because it expressed better intention. - Use
usize::MAX
withinmerge
method. This just makes the code easier to read. This doesn't seem to affect performance (it's only 1 add + 1 compare).
For the choice, of u8
vs u16
vs usize
.
In general, don't optimize too soon. The size of such value doesn't really matter, and you'd be suprised how often the result code is just as fast. Since there's only 1 such value, there's not even RAM benefits for it.
IMO the rule for types is:
- Choose a type that only expresses VALID choices. (Here there can be a value or None, and since we're using unsigned, we're de facto discarding negative values which don't mean much). Zero while being a stupid choice will do something (and there's no type for >1)
- Choose the simplest possible type on impact to code (here
usize
means we avoid ll those nastyas usize
) - If you see performance bottlenecks, THEN, check if you need to update the type.
Cheers.
tokenizers/src/models/bpe/trainer.rs
Outdated
match max_merge_length { | ||
None | Some(0) => { | ||
// in case 0 was manually entered, treat as None | ||
} | ||
Some(length) => { | ||
if new_token.chars().count() > | ||
(length as usize) { | ||
continue; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match max_merge_length { | |
None | Some(0) => { | |
// in case 0 was manually entered, treat as None | |
} | |
Some(length) => { | |
if new_token.chars().count() > | |
(length as usize) { | |
continue; | |
} | |
} | |
} | |
if let Some(max_token_length) = max_token_length{ | |
if new_token.chars().count() > max_token_length{ | |
continue; | |
} | |
} |
This is more idiomatic imo.
0 is NOT a special value. None
means ignore, zero
does mean zero. If things starts to do weird things it's not the problem of this function, it's respecting the value which is more important.
Try to switch to usize
it makes everything easier to read, and the actual "size" isn't important optimization wise.
Sorry I implemented the change, but since I had to look deeply to provide good suggestions, I pretty much had to make the change (to verify performance impact for instance). For propagating the option, I'll let you handle it in Rust first as you seem eager to make this change properly. For the bindings, I think it will be easier if I create the changes, and I will ping you on the modifications so you can learn how to do it. (You're more than welcome to try to do it, but it's likely to be error prone, boring and difficult, just want to manage expectations :) ) |
I appreciate you explaining the code and what is going on, while trying to educate me on the issues and offering me opportunities to contribute. I would usually prefer that i make as much changes myself, but since I hope to utilize the changed version of the tokenizer for my project I would rather not stand in your way. Atleast for this PR i will just take as much code as I can from your version rather than try to interject my views for this instance. |
You're welcome to try ! If you want to take a stab, try to micmic what exist beforehand. Just ping me when you're done ! cheers. |
tbh the scariest thing for me is not the coding but the git fu involved. |
@Narsil second commit provides a descriptive variable name. |
Can you rename I think it's better than Then could you add a test showcasing on dummy data the effect after training ? |
If you point out a test that is known to produce >16 char tokens i could maybe duplicate it to make a new test, but I am not sure what would need to be done for a more comprehensive test. |
I have sent two commits that each
I will work on tests by copying a pre-existing test in the tests/ folder. this may take a while. Edit : I think it should live in tests/training.rs |
The documentation is not available anymore as the PR was closed or merged. |
The ubuntu-latest error isnt reproducible in my machine? |
It should. You changed the serialization of the tokenizer, so the tests need to be updated accordingly. (We always make sure we can load the old formats and make update the latest ones). You probably haven't tested with tokenizers installed from source:
|
ah okay. thank you. |
Also, I realize that even for the debugging/development purposes,
What would be the standard approach in this case? tests/training.rs is the right approach. |
You can do it on a super small set, it's actually better than on a big one, since you could include the entirety of the code within the test, making it readable, and have both the training without the flag producing tokens larger than x, and with the flag not producing them. Does this make sense ? (Actually don't create a big training test if possible, that slows down the CI, and produces less value than the smaller test IMO) |
What do you mean by small set? I think I grasp what you are saying but I fear I might not be understanding completely so I'll describe more specifically what I'd do.
Do you like this idea? |
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs#L651 I meant adding another test along those lines. Much narrower scope. |
ah. So is this the integrated test vs unit test (what I suggested was the former, what you mentioned the latter) Anyway, I copied the test you mentioned and made a new one. Also the previous commits does : |
token.chars().count() <= max_token_length, | ||
"token too long : {} , chars().count() = {}", | ||
token, | ||
token.chars().count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a great test ! Good idea on the UTF-8 intense examples !
Could you add the explicit vocab as a result ? It makes the test more readable and more robust.
Let's keep the functional part too, but having the explicit values prevents "under the radar" bugs, where behavior is modified unintentionally.|
You could reduce the size of those strings and max_token_length
in order to keep things readable maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by keep explicit vocabulary?
Do you mean the tokens learned by the model under that setting?
In that case I propose we make separate tests.
The current test could serve as the basis for a robust test that checks across multiple max_token_length values and added vocabulary.
I wanted to add Arabic but it renders differently depending on the IDE (due to left to right script) but exactly because of such behavior it would be useful to have in the future.
Another test could be added that takes a look into individual vocabularies learned, but this would be more rigid since adding more words to be learned would also change the vocabs, so we'd need to have new known good tokens to compare with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we go for the multiple test route, i definitely think one test with max_token_length set to 16 would be useful since that's the value Sentencepiece uses as default for both unigram and bpe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No 1 test, but explicit assert.
assert_eq!(tokenizer.get_vocab(), HashMapFrom([....]))
It's really much better imo. Tests don't change that often, and I have seen many bugs be silent for too long for this lack of explicit value testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so you feel like the test already has enough unicode coverage?
What about,
Leave current test (#1)
change current test with explicit assert compare (#2)
This way we could add any new unicode we want on #1 and #2 can serve as explicit coverage.
Is this too many tests for this feature?
Considering the breadth of this feature i think this warrants at least two.(hopefully three) But if you want it to be one, I'll try to pack as much into it
added one test on top of the previous one. |
it seems the failing test is due to pickling error. Am I correct in assuming that the proper way to make them pass is to implement the python bindings? |
It seems that right now the missing pieces are docs and python bindings |
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
clarify with type usize propagate max_length option
in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
change max_merge_length into max_token_length
add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing.
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
clarified test documentation
1c4507b
to
aa309dd
Compare
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.