Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOS and EOS #195

Merged
merged 5 commits into from
Jul 26, 2023
Merged

BOS and EOS #195

merged 5 commits into from
Jul 26, 2023

Conversation

SinanAkkoyun
Copy link
Contributor

This PR just adds the optional functionality to the tokenzier to pad EOS and BOS token IDs (important for some chat formats like llama2 and openassistant)

@vadi2
Copy link

vadi2 commented Jul 25, 2023

I'm curious, what does padding the token id's accomplish?

@turboderp
Copy link
Owner

turboderp commented Jul 25, 2023

Padding with padding tokens allows you to run multiple sequences of different length in the same batch. "Padding" in this PR really more prepending and appending the BOS and EOS tokens, respectively. Maybe the arguments should be called add_bos and add_eos to avoid this confusion?

Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add <s> and </s> at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

@SinanAkkoyun
Copy link
Contributor Author

Maybe the arguments should be called add_bos and add_eos to avoid this confusion?

Sure, will change!

Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add and at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

I totally get that hence the question a couple of days prior to why sentencepiece, but llama2s repo is doing exactly this, appending eos and bos, tokenizing each 'role' prompt and concatenating all of those encoded IDs to a full prompt.

I wanted to implement exactly the same without messing with the tokenizer, so I believe that this change should also be merged in conjunction to tokenzier changes.
(and doesn't decoding output the eos and bos strings?)

@SinanAkkoyun
Copy link
Contributor Author

SinanAkkoyun commented Jul 25, 2023

I will later also commit a PR of the exact llama 2 chat completion implementation as it itself is using sentencepiece and bos eos appending and I would like this tokenizer feature to be added for that, would that be ok with you?

@SinanAkkoyun SinanAkkoyun changed the title BOS and EOS padding BOS and EOS Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants