BOS and EOS #195

SinanAkkoyun · 2023-07-25T10:28:03Z

This PR just adds the optional functionality to the tokenzier to pad EOS and BOS token IDs (important for some chat formats like llama2 and openassistant)

vadi2 · 2023-07-25T12:10:33Z

I'm curious, what does padding the token id's accomplish?

turboderp · 2023-07-25T12:22:53Z

Padding with padding tokens allows you to run multiple sequences of different length in the same batch. "Padding" in this PR really more prepending and appending the BOS and EOS tokens, respectively. Maybe the arguments should be called add_bos and add_eos to avoid this confusion?

Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add <s> and </s> at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

SinanAkkoyun · 2023-07-25T14:23:41Z

Maybe the arguments should be called add_bos and add_eos to avoid this confusion?

Sure, will change!

Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add ~~and~~ at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

I totally get that hence the question a couple of days prior to why sentencepiece, but llama2s repo is doing exactly this, appending eos and bos, tokenizing each 'role' prompt and concatenating all of those encoded IDs to a full prompt.

I wanted to implement exactly the same without messing with the tokenizer, so I believe that this change should also be merged in conjunction to tokenzier changes.
(and doesn't decoding output the eos and bos strings?)

SinanAkkoyun · 2023-07-25T14:25:23Z

I will later also commit a PR of the exact llama 2 chat completion implementation as it itself is using sentencepiece and bos eos appending and I would like this tokenizer feature to be added for that, would that be ok with you?

SinanAkkoyun added 2 commits July 25, 2023 12:25

Added pad bos eos functionality (and .gitignore)

d96825f

Deleted debug lines

f20c149

Changed 'pad' to 'add'

12aac52

SinanAkkoyun changed the title ~~BOS and EOS padding~~ BOS and EOS Jul 25, 2023

fix

bf88487

SinanAkkoyun mentioned this pull request Jul 25, 2023

Tokenzier special chars #197

Merged

fix inconsistent indentation

0fae40b

turboderp merged commit 89775d1 into turboderp:master Jul 26, 2023

SinanAkkoyun mentioned this pull request Aug 4, 2023

Llama 2 Chat implementation #221

Open

zmarty mentioned this pull request Aug 10, 2023

Modify generator.py > generate_simple to accept encode_special_characters? #243

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOS and EOS #195

BOS and EOS #195

SinanAkkoyun commented Jul 25, 2023

vadi2 commented Jul 25, 2023

turboderp commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

BOS and EOS #195

BOS and EOS #195

Conversation

SinanAkkoyun commented Jul 25, 2023

vadi2 commented Jul 25, 2023

turboderp commented Jul 25, 2023 • edited Loading

SinanAkkoyun commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023 • edited Loading

turboderp commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading