-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFT chatml
applied using setup_chat_format
leads to surprising updates that (potentially) break pretraining decisions
#1412
Comments
Why is setup_chat_format is implemented in a way that <|im_start|> token replaces bos_token? As far as I understand <|im_start|> and <|im_end|> are special tokens to separate user's messages in chat history, shouldn't they be added to the tokenizer as new tokens? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Any inputs from the TRL team on this? |
I'm also having problems with this. I trained a LoRA for Llama3 using
|
did you save LoraConfig(
..,
modules_to_save = ["lm_head", "embed_tokens"])
) if you did and it still fails then probably you didn't apply |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Should we just use |
huggingface/tokenizers#1570 could also help as well |
``
Issues/questions:
setup_chat_format
seems to need more fine-grained control based on model/tokenizerI'm instruction tuning
google/gemma-7b
on a custom dataset using TRL. I came acrosssetup_chat_format
, which is a great step towards all things problematic with chat templates. After looking into updates applied to the tokenizer I'm wondering if some of the individual token id updates are problematic, as well as the resultingchat_template
update.The Gemma models require a
<bos>
token prepended to each message. If I applysetup_chat_format
to the tokenizer the newbos_token
becomes<|im_start|>
. Other issues occur such as theeos_token
is overwritten with<|im_end|>
. This leaves me with some rather surprising updates to the tokenizer (e.g.['<bos>', '<eos>', '<unk>', '<pad>']
are lost), as well as the updates to the token embeddings (why aren't added tokens being instantiated from the average of token embeddings, see 3. ?).Quick and dirty example:
This all leads to some surprising behavior and updates that ignores pretraining decisions with model/tokenizer that directly impacts downstream SFT using chatml format. This all came from seeing some rather strange loss values.
To accommodate this one has to embed
<bos>, <eos>
into some processing_fn for the dataset and keep this in mind during inference, or completely ignore Gemma's decisions during pretraining. It seems like this could be resolved with finer-grained control along with more robust templating for specific models when applyingsetup_chat_format
, which I feel is the better approach to reduce individual user effort and get more people on board with a seamless way to unify chat structure during SFT, which is already a mess as of today.I also like @philschmid work around to essentially reassign tokens that comply with chatml, without losing the model/tokenizer's trained embedding.
The text was updated successfully, but these errors were encountered: