Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOS token in the prepare_redpajama script #329

Closed
LamOne1 opened this issue May 28, 2023 · 3 comments
Closed

EOS token in the prepare_redpajama script #329

LamOne1 opened this issue May 28, 2023 · 3 comments

Comments

@LamOne1
Copy link

LamOne1 commented May 28, 2023

Hello,

I noticed that you didn't add eos token at the end of the examples in prepare_redpajama script. So, I tokenized my data without it. The model performs quite well in the pretraining, but off course I encountered a problem when I generate some text; which is I don't know when I should stop the generation.
I ran the prepare_redpajama script again with tokenizer.encode(.., eos=True), I found the pretrainig result is very bad, the model generates text that is not related at all to the prompt text, and the model never generate my eos token.

Also, I noticed that there is a difference between prepare_redpajama scripts in lit-llama and lit-parrot:
lit-llama:

builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=set_name,
            chunk_size=chunk_size,
            sep_token=tokenizer.bos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )

the sep_token is bos.

lit-parrot:

builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=prefix,
            chunk_size=chunk_size,
            sep_token=tokenizer.eos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )

the sep_token is eos.

Is this the cause of my problem?
I'm using a different tokenizer than LLama, so in my case should I follow lit-parrot approach?

@LamOne1
Copy link
Author

LamOne1 commented May 29, 2023

not sure but this may answer my question..

Also, form my understanding of the training code, the model is not fed with the whole input sequence during the training. It's segmented till block_size, which means it doesn't always see the eos token at the end of the sentence if it's is too long (?)

input_ids = train_data[:, 0 : model.config.block_size].contiguous()
targets = train_data[:, 1 : model.config.block_size + 1].contiguous()

@lantiga
Copy link
Collaborator

lantiga commented May 29, 2023

Hi @LamMoh1, a few points that make pre-training different from supervised fine-tuning:

  • the pre-training task is next token prediction on a single token, not a whole sequence; so you may need to predict a "e" or a "m" and that would be good
  • which means that you can just pack sequences one after the other, separated by EOS (we do separate them by EOS when we build the packed dataset), and compute the loss on the predicted token; in this case, you would also attend to extraneous sentences, but that may not be that important in practice

The format is similar to what used here: https://huggingface.co/blog/stackllama#supervised-fine-tuning

Note that for pre-training you can also not pack and instead add padding tokens for sequences that are shorter than the context length. This is more akin to what done conventionally, what option yields better results is still unclear AFAIK.

@LamOne1
Copy link
Author

LamOne1 commented May 31, 2023

Thank you @lantiga, Your answer was clear and very helpful. I really appreciate that.

@LamOne1 LamOne1 closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants