-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EOS token in the prepare_redpajama script #329
Comments
not sure but this may answer my question.. Also, form my understanding of the training code, the model is not fed with the whole input sequence during the training. It's segmented till block_size, which means it doesn't always see the eos token at the end of the sentence if it's is too long (?)
|
Hi @LamMoh1, a few points that make pre-training different from supervised fine-tuning:
The format is similar to what used here: https://huggingface.co/blog/stackllama#supervised-fine-tuning Note that for pre-training you can also not pack and instead add padding tokens for sequences that are shorter than the context length. This is more akin to what done conventionally, what option yields better results is still unclear AFAIK. |
Thank you @lantiga, Your answer was clear and very helpful. I really appreciate that. |
Hello,
I noticed that you didn't add eos token at the end of the examples in prepare_redpajama script. So, I tokenized my data without it. The model performs quite well in the pretraining, but off course I encountered a problem when I generate some text; which is I don't know when I should stop the generation.
I ran the prepare_redpajama script again with tokenizer.encode(.., eos=True), I found the pretrainig result is very bad, the model generates text that is not related at all to the prompt text, and the model never generate my eos token.
Also, I noticed that there is a difference between prepare_redpajama scripts in lit-llama and lit-parrot:
lit-llama:
the sep_token is bos.
lit-parrot:
the sep_token is eos.
Is this the cause of my problem?
I'm using a different tokenizer than LLama, so in my case should I follow lit-parrot approach?
The text was updated successfully, but these errors were encountered: