-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOS and EOS #195
BOS and EOS #195
Conversation
I'm curious, what does padding the token id's accomplish? |
Padding with padding tokens allows you to run multiple sequences of different length in the same batch. "Padding" in this PR really more prepending and appending the BOS and EOS tokens, respectively. Maybe the arguments should be called Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add |
Sure, will change!
I totally get that hence the question a couple of days prior to why sentencepiece, but llama2s repo is doing exactly this, appending eos and bos, tokenizing each 'role' prompt and concatenating all of those encoded IDs to a full prompt. I wanted to implement exactly the same without messing with the tokenizer, so I believe that this change should also be merged in conjunction to tokenzier changes. |
I will later also commit a PR of the exact llama 2 chat completion implementation as it itself is using sentencepiece and bos eos appending and I would like this tokenizer feature to be added for that, would that be ok with you? |
This PR just adds the optional functionality to the tokenzier to pad EOS and BOS token IDs (important for some chat formats like llama2 and openassistant)