-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalized chat sft prompt #7655
Conversation
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor code style issues
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
"end_of_name": "\n", | ||
} | ||
else: | ||
self.special_tokens = special_tokens | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a check to see if the tokens in special_tokens are tokenizer's special tokens or not? If not (the case with llama), can we just throw a warning that we'll use text as turn tokens which might cause incorrect merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an assert in the code
assert torch.equal(torch.tensor(target[:header_len]), torch.tensor(header_tokens))
which will throw an exception if the token merge happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is different, the token merge can still happen during multi-turn
what I mean is that if the turn tokens are not special tokens, we just say that there might be an error possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header_len
stops at the "end_of_turn". The next token is "turn_start". If the merge happens this assert will catch it. The multiple turn has the same thing. each turn ends with "end_of_turn" and the next token is "turn_start". So this one is enough to catch it.
Also I don't see the point of just giving a warning which doesn't help the user at all.
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
# for key in turn['human_labels']: | ||
# value_set = label_values.get(key, set()) | ||
# value_set.add(turn['human_labels'][key]['value']) | ||
# label_values[key] = value_set |
Check notice
Code scanning / CodeQL
Commented-out code Note
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: Sasha Meister <[email protected]>
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]>
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]>
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]>
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]>
* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]>
What does this PR do ?
In this PR, it genialized the chat SFT dataset that it can use customized turn start/end tokens by using chat_prompt_tokens config. e.g.
after this change, the LM is not required to have "extra_id" special tokens any more to use chat SFT dataset. In this PR, also expanded the unit test to cover more LM tokenizers.
Another feature added is to overwrite the prompt_template config with the chat prompt format.