-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert generated dataset to messages format #60
Comments
Eval is expecting this as well for MMLU! Tracker issue on our side: instructlab/eval#35 |
the current output is consistent with what the CLI has always done. we can change it, but we have to coordinate it across all components, including code still in the CLI, that depends on it |
yes I see that it converts to messages - but we still need to handle all other columns so that it doesn't break when we concatenate with other datasets during training. Added some more details about the schema we need in my original comment |
The training library currently expects messages, but I believe legacy Linux training and MacOS training doesn't. |
Where do we see this conversion happening? Would this be in _gen_train_data in generate_data.py? Will we be maintaining 2 formats, the existing one that is for qlora and the additional new format as proposed above for the full train? |
Yes you are spot on @oindrillac sdg/src/instructlab/sdg/generate_data.py Lines 80 to 98 in 45ecc73
We would want to have two separate outputs, one in the format the CLI expects for legacy training, and one for the new version that would expect the messages format. |
cool, and based on whether |
Yes! |
or we could just always produce both? |
For what's it worth, on the |
It would be in the same directory and we can produce both formats as per @russellb suggestion |
Currently the generated synthetic data has the q/a/context in its own columns, the new training api assumes the datasets are formatted in messages format
Will need a simple util function to run post generation to handle the conversion
Columns to have:
The text was updated successfully, but these errors were encountered: