Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement a simple max_sentencepiece_length into BPE #1228

Merged
merged 18 commits into from
May 16, 2023

Conversation

chris-ha458
Copy link
Contributor

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

@chris-ha458
Copy link
Contributor Author

I think work on the
source(trainer,model, and module), bindings and documentation would be necessary

@Narsil
Copy link
Collaborator

Narsil commented Apr 28, 2023

I like the proposal, however I think this approach is a bit too simplistic to work effectively (I could be wrong).

If you're working on non space splitted languages (like korean, I'm guessing given your GH location), then BPE will always be unbearably slow if you don't set a pre_tokenizer.

The reason is that if you have a 1000 char long sentence, for each merge it has to update all it's neighbors. So as the BPE is merging pairs, the amount of neighbors is ever growing leading to huge amount of compute to keep track of everything. With a space splitting, some words get "finished" early (like the) and as they don't have neighbors (since you can't merge on spaces) then, they are entirely discarded from the merge computation.

So here I think we should follow the same idea, and remove the neighbors when the resulting token would be larger than X.

Also I don't like using 0 for None. Rust has the concept of using Option<usize> which would be more appropriate here.

Would you be up to change the PR so remove the neighbors instead of just skipping value ?
Also I could be wrong about this issue, if you have some way of showing how this works on your data, I'd be glad to be proven wrong.

@chris-ha458
Copy link
Contributor Author

Preface

#1227 (comment)
#1230 (comment)

I've made a separate issue, but if this PR is where you are comfortable discussing the commit I am fine with it.
As for the first PR, my mistake. Git and Github is still difficult for me.

Motivation

Since you've shown interest I'll try more to implement it and show you data on my own datasets and code.
The motivation of this commit is indeed Unigram.

I am indeed using preprocessors, in fact I am using them extensively.
But due to training on extensive code alongside multilingual tokens, the preprocessors alone
(unicode, digit etc) do not stop the inclusion of certain pathologic tokens such as

"ValueFromPipelineByPropertyName"
"showCompletionInArgumentLists"
"ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡ãħ¡"
"ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg"

to name a few.
As for why this might be problematic I offer 3 reasons.

  1. is intuition. they do not seem semantically meaningful and infact, they are composed of smaller more meaningful units.
  2. Empirical testing aka "SolidGoldMagiKarp" phenomena
  3. negative evidence, where 16 characters are enough to encode 99.5% of actual words. (refer to Zipfian distribution, RetVec paper, and 99.5% of fastext vocab being shorter than 16 characters)
  4. spm also uses 16 chars

I initially tried to use the "Split()" preprocessor regex function to implement this function.
I would guess it would not be impossible, but it is very slow and it is still hard to ensure proper function.
Instead of a proper 16 char processing, I implemented a camelcase preprocessor that splits on upper/lowercase change in regex which fixed most of it. Unfortunately, long lowercases/specialcharacter tokens remain.
But I felt like it was a bandaid so I'm trying to upstream a change.

Other implementations

In the original sentencepiece library it is implemented as a function IsValidSentencePiece called by both BPE or Unigram
https://github.com/google/sentencepiece/blob/25b64fc630a126d25074f9393a33f61182dfdb49/src/trainer_interface.h#L118

It is the same for the huggingface library unigramtrainer

fn is_valid_sentencepiece(&self, char_string: &[char]) -> bool {

I think it makes sense to have separate options for bpetrainer and unigram trainer for the huggingface library.

Going Forward

Also I don't like using 0 for None. Rust has the concept of using Option which would be more appropriate here.
Would you be up to change the PR so remove the neighbors instead of just skipping value ?

I'll be more than glad to adhere to idiomatic Rust patterns or your codebase guidelines.
I was unaware of BPE neighbor behaviors so I shall try to implement it the way you describe.
However, do you still agree that the neighbors should be dropped instead of separated even when this is happening after preprocessing? Regardless, for the time being, I will try to implement them as you see fit.

As you might have seen from the silly error in the commit, I am not versed in rust but I am motivated to learn.
The following is my plan
0. Fix the silly error. learn rust

  1. First prepare a commit that implements Option
  2. prepare commit that discard neighbors

@chris-ha458
Copy link
Contributor Author

Currently I implemented the Option to make it safer.
u8 is inadequate since there is clearly tokens longer than 256 characters that also have some meaning (whether it is useful should be predetermined)
I have never seen a token whose character length is longer than 65535 ( which is minimum 64k bytes maximum 256k bytes)

but if usize is better there would be no difficulty reimplementing as such.

How should the variable be communicated to down stream? should it be set to None as default for the time being?
i left it as the default 16 characters for my own testing.

as for neighbor discarding, wouldn't it be gone automatically?
If my understanding of the code is correct, the potential merges are popped from the top by continuing without processing them,
the part_a and part_b will be gone in the next iteration.
Please let me know if I misunderstood the code or your intentions. I'll be testing them out in the meantime.

Some(length) => {
    if new_token.chars().count() >
    (length as usize) {
        continue;
    }

@chris-ha458
Copy link
Contributor Author

I trained a tokenizer and my findings
Visual inspection shows no particularly "problematic" tokens. even without my camelcase regex.
No tokens are longer than 16 characters as expected.
For english/ascii tokens this works as expected.
Many tokenizer related metrics seem positive unk rate, text compression rate, token utilization, etc.

Some unexpected findings
for unicode tokens (2+bytes) this counts the bytelevel representations instead the characters
so the tokens themselves could be 16 characters which represent as low as 5 actual characters
(Ġ is whitespace modified by ByteLevel() pretokenizer. L is a byte representing unicode leading byte C a continuing byte)
example : ĠLCCLCCLCCLCCLCC
This isnt a major issue since max_merge_length will be a modifiable parameter that defaults to 0/None/infinity
Unigram used with bytelevel also shares this behavior so i don't see any issues.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the additional info+context.

I looked more deeply into the merge part. to skip neighboring on invalid tokens.

The issue is that it will make the new potential "large" pairs still be in the where_to_update hashmap, meaning we're keeping track of extra potential pairs which will always get ignored afterwards.

I proposed the change I envisionned here: #1235 (Feel free to just cherry-pick my commit, I can also push here, but in general we refrain from pushing to contributor's branch).

What this change modifies:

  • Pairs that would lead to the merge being > max_length get ignored. So they cannot even be considered in the future. meaning the overall pair maps should be smaller, and less wasted computation
  • Keep Option in the do_train because it expressed better intention.
  • Use usize::MAX within merge method. This just makes the code easier to read. This doesn't seem to affect performance (it's only 1 add + 1 compare).

For the choice, of u8 vs u16 vs usize.
In general, don't optimize too soon. The size of such value doesn't really matter, and you'd be suprised how often the result code is just as fast. Since there's only 1 such value, there's not even RAM benefits for it.

IMO the rule for types is:

  • Choose a type that only expresses VALID choices. (Here there can be a value or None, and since we're using unsigned, we're de facto discarding negative values which don't mean much). Zero while being a stupid choice will do something (and there's no type for >1)
  • Choose the simplest possible type on impact to code (here usize means we avoid ll those nasty as usize)
  • If you see performance bottlenecks, THEN, check if you need to update the type.

Cheers.

Comment on lines 510 to 520
match max_merge_length {
None | Some(0) => {
// in case 0 was manually entered, treat as None
}
Some(length) => {
if new_token.chars().count() >
(length as usize) {
continue;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match max_merge_length {
None | Some(0) => {
// in case 0 was manually entered, treat as None
}
Some(length) => {
if new_token.chars().count() >
(length as usize) {
continue;
}
}
}
if let Some(max_token_length) = max_token_length{
if new_token.chars().count() > max_token_length{
continue;
}
}

This is more idiomatic imo.
0 is NOT a special value. None means ignore, zero does mean zero. If things starts to do weird things it's not the problem of this function, it's respecting the value which is more important.

Try to switch to usize it makes everything easier to read, and the actual "size" isn't important optimization wise.

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2023

Sorry I implemented the change, but since I had to look deeply to provide good suggestions, I pretty much had to make the change (to verify performance impact for instance).

For propagating the option, I'll let you handle it in Rust first as you seem eager to make this change properly.

For the bindings, I think it will be easier if I create the changes, and I will ping you on the modifications so you can learn how to do it. (You're more than welcome to try to do it, but it's likely to be error prone, boring and difficult, just want to manage expectations :) )

@chris-ha458
Copy link
Contributor Author

I appreciate you explaining the code and what is going on, while trying to educate me on the issues and offering me opportunities to contribute.

I would usually prefer that i make as much changes myself, but since I hope to utilize the changed version of the tokenizer for my project I would rather not stand in your way.
Nevertheless, since you suggested that I implement the propagation part I shall try to do it myself.

Atleast for this PR i will just take as much code as I can from your version rather than try to interject my views for this instance.
BTW there is a >0 type in std called NonZeroU16 but i would assume that it would not be useful performance or maintainabilitywise at this time, so I shall continue using usize for the time being.

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2023

I would usually prefer that i make as much changes myself, but since I hope to utilize the changed version of the tokenizer for my project I would rather not stand in your way.

You're welcome to try ! If you want to take a stab, try to micmic what exist beforehand.

Just ping me when you're done ! cheers.

@chris-ha458
Copy link
Contributor Author

tbh the scariest thing for me is not the coding but the git fu involved.
I have merged your version to my branch and have started working.
Please feel free to point out any issues with the git portion of my contribution as you see fit.

@chris-ha458
Copy link
Contributor Author

@Narsil
One commit simply propagates the variable upstream. defaults to None and not do anything.
I can see why you said 0 should be zero and invalid value should be something else(None) now.

second commit provides a descriptive variable name.

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2023

Can you rename max_merge_length vs max_token_length (it's called `max_piece_length in unigram, but piece is not really significant).

I think it's better than merge_length because you care about the token lengths, not about the merge (most users don't know what a merge is).

Then could you add a test showcasing on dummy data the effect after training ?

@chris-ha458
Copy link
Contributor Author

  1. I have concerns on the name, but it's not strictly incorrect so I don't have enough objections to it. I'll prepare a commit to that effect.

  2. Writing a test like that might be more involved than i can handle due to my current level or understanding regarding this codebase, rust and cargo testing, and bpe mechanisms.

If you point out a test that is known to produce >16 char tokens i could maybe duplicate it to make a new test, but I am not sure what would need to be done for a more comprehensive test.

@chris-ha458
Copy link
Contributor Author

chris-ha458 commented May 2, 2023

I have sent two commits that each

  1. change variable name as requested
  2. add some missing declarations.

I will work on tests by copying a pre-existing test in the tests/ folder. this may take a while.

Edit : I think it should live in tests/training.rs
maybe as fn bpe_max_token_length()

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 3, 2023

The documentation is not available anymore as the PR was closed or merged.

@chris-ha458
Copy link
Contributor Author

The ubuntu-latest error isnt reproducible in my machine?

@Narsil
Copy link
Collaborator

Narsil commented May 3, 2023

The ubuntu-latest error isnt reproducible in my machine?

It should. You changed the serialization of the tokenizer, so the tests need to be updated accordingly. (We always make sure we can load the old formats and make update the latest ones).

You probably haven't tested with tokenizers installed from source:

cd bindings/python && pip install -e . && pytest -sv tests/

@chris-ha458
Copy link
Contributor Author

ah okay. thank you.
so how should I fix it? would it need work on the bindings part or did i do something wrong on the rust side?

@chris-ha458
Copy link
Contributor Author

Also, I realize that even for the debugging/development purposes,
the let max_token_length inside the loop is highly suboptimal
I'd like to be able to

  1. Easily set it to 16 even before the python bindings are finished,
  2. set it to None/default easily so that i can write tests that compare with it on and off.

What would be the standard approach in this case?
Meanwhile I am coming to the conclusion that approaching this

tests/training.rs is the right approach.
I will closely copy a test that trains a big tokenizer and see if setting max_token_length makes the tokenizer vocab be smaller than 16.
What other things should be done?
I can think of, other corpuses, apply some of pretokenizers, seeing if vocab is okay etc
but I guess that might explode combinations, so I would like suggestions so as to what tests should be done.

@Narsil
Copy link
Collaborator

Narsil commented May 3, 2023

What other things should be done?

You can do it on a super small set, it's actually better than on a big one, since you could include the entirety of the code within the test, making it readable, and have both the training without the flag producing tokens larger than x, and with the flag not producing them.

Does this make sense ?

(Actually don't create a big training test if possible, that slows down the CI, and produces less value than the smaller test IMO)

@chris-ha458
Copy link
Contributor Author

What do you mean by small set? I think I grasp what you are saying but I fear I might not be understanding completely so I'll describe more specifically what I'd do.

  1. copy the bpe_continuing_subword_prefix_error() in training.rs and make bpe_max_token_length_check(). would it count as a "small set" since it uses a "small" corpus?
  2. remove the current assert_eq!
  3. make the code run twice, once with the bpe_max_token_length set and once without.
  4. check that there are tokens longer than max_token_length not set, and arent such tokens with it set.

Do you like this idea?

@Narsil
Copy link
Collaborator

Narsil commented May 4, 2023

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs#L651

I meant adding another test along those lines.

Much narrower scope.

@chris-ha458
Copy link
Contributor Author

ah.

So is this the integrated test vs unit test (what I suggested was the former, what you mentioned the latter)

Anyway, I copied the test you mentioned and made a new one.
It is passing for 8,16.

Also the previous commits does :
Set max_token_length default in the proper location. The default shouldn't be 16 but rather None, but this is for debugging :)
I commented it to reflect that. Thankfully no tests fail.
Also the initial "retrieval" ? of the value from self is done outside the loop. no reason for it to be done inside.
atleast after the training started it should be an immutable variable.

token.chars().count() <= max_token_length,
"token too long : {} , chars().count() = {}",
token,
token.chars().count()
Copy link
Collaborator

@Narsil Narsil May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a great test ! Good idea on the UTF-8 intense examples !

Could you add the explicit vocab as a result ? It makes the test more readable and more robust.
Let's keep the functional part too, but having the explicit values prevents "under the radar" bugs, where behavior is modified unintentionally.|

You could reduce the size of those strings and max_token_length in order to keep things readable maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by keep explicit vocabulary?
Do you mean the tokens learned by the model under that setting?

In that case I propose we make separate tests.

The current test could serve as the basis for a robust test that checks across multiple max_token_length values and added vocabulary.
I wanted to add Arabic but it renders differently depending on the IDE (due to left to right script) but exactly because of such behavior it would be useful to have in the future.

Another test could be added that takes a look into individual vocabularies learned, but this would be more rigid since adding more words to be learned would also change the vocabs, so we'd need to have new known good tokens to compare with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we go for the multiple test route, i definitely think one test with max_token_length set to 16 would be useful since that's the value Sentencepiece uses as default for both unigram and bpe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No 1 test, but explicit assert.

assert_eq!(tokenizer.get_vocab(), HashMapFrom([....]))

It's really much better imo. Tests don't change that often, and I have seen many bugs be silent for too long for this lack of explicit value testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so you feel like the test already has enough unicode coverage?

What about,

Leave current test (#1)

 change current test with explicit assert compare (#2)

This way we could add any new unicode we want on #1 and #2 can serve as explicit coverage.

Is this too many tests for this feature?
Considering the breadth of this feature i think this warrants at least two.(hopefully three) But if you want it to be one, I'll try to pack as much into it

@chris-ha458
Copy link
Contributor Author

added one test on top of the previous one.
Directly tests the learned vocabulary against a known good one.
Possible improvements :
Error message in case of inequal case could be improved
(currently outputs each content completely without mentioniong which one is problematic.

@chris-ha458
Copy link
Contributor Author

it seems the failing test is due to pickling error. Am I correct in assuming that the proper way to make them pass is to implement the python bindings?

@chris-ha458
Copy link
Contributor Author

It seems that right now the missing pieces are docs and python bindings
@Narsil do you have something in the works?
If not, I can start on those myself. if you have any preference for my start point, feel free to let me know.

chris-ha458 and others added 14 commits May 15, 2023 18:02
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
clarify with type usize propagate max_length option
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
change max_merge_length into max_token_length
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
clarified test documentation
@Narsil Narsil force-pushed the chris-ha458-patch-1 branch from 1c4507b to aa309dd Compare May 15, 2023 16:02
@Narsil Narsil merged commit cefc41e into huggingface:main May 16, 2023
@chris-ha458 chris-ha458 deleted the chris-ha458-patch-1 branch May 16, 2023 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants