EOS token in the prepare_redpajama script #329

LamOne1 · 2023-05-28T10:05:07Z

Hello,

I noticed that you didn't add eos token at the end of the examples in prepare_redpajama script. So, I tokenized my data without it. The model performs quite well in the pretraining, but off course I encountered a problem when I generate some text; which is I don't know when I should stop the generation.
I ran the prepare_redpajama script again with tokenizer.encode(.., eos=True), I found the pretrainig result is very bad, the model generates text that is not related at all to the prompt text, and the model never generate my eos token.

Also, I noticed that there is a difference between prepare_redpajama scripts in lit-llama and lit-parrot:
lit-llama:

builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=set_name,
            chunk_size=chunk_size,
            sep_token=tokenizer.bos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )

the sep_token is bos.

lit-parrot:

builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=prefix,
            chunk_size=chunk_size,
            sep_token=tokenizer.eos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )

the sep_token is eos.

Is this the cause of my problem?
I'm using a different tokenizer than LLama, so in my case should I follow lit-parrot approach?

The text was updated successfully, but these errors were encountered:

LamOne1 · 2023-05-29T06:04:36Z

not sure but this may answer my question..

Also, form my understanding of the training code, the model is not fed with the whole input sequence during the training. It's segmented till block_size, which means it doesn't always see the eos token at the end of the sentence if it's is too long (?)

input_ids = train_data[:, 0 : model.config.block_size].contiguous()
targets = train_data[:, 1 : model.config.block_size + 1].contiguous()

lantiga · 2023-05-29T16:53:24Z

Hi @LamMoh1, a few points that make pre-training different from supervised fine-tuning:

the pre-training task is next token prediction on a single token, not a whole sequence; so you may need to predict a "e" or a "m" and that would be good
which means that you can just pack sequences one after the other, separated by EOS (we do separate them by EOS when we build the packed dataset), and compute the loss on the predicted token; in this case, you would also attend to extraneous sentences, but that may not be that important in practice

The format is similar to what used here: https://huggingface.co/blog/stackllama#supervised-fine-tuning

Note that for pre-training you can also not pack and instead add padding tokens for sequences that are shorter than the context length. This is more akin to what done conventionally, what option yields better results is still unclear AFAIK.

LamOne1 · 2023-05-31T06:47:14Z

Thank you @lantiga, Your answer was clear and very helpful. I really appreciate that.

LamOne1 closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOS token in the prepare_redpajama script #329

EOS token in the prepare_redpajama script #329

LamOne1 commented May 28, 2023

LamOne1 commented May 29, 2023 •

edited

Loading

lantiga commented May 29, 2023

LamOne1 commented May 31, 2023

EOS token in the prepare_redpajama script #329

EOS token in the prepare_redpajama script #329

Comments

LamOne1 commented May 28, 2023

LamOne1 commented May 29, 2023 • edited Loading

lantiga commented May 29, 2023

LamOne1 commented May 31, 2023

LamOne1 commented May 29, 2023 •

edited

Loading