Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converts dataset format messages required for training #94

Merged
merged 3 commits into from
Jul 11, 2024

Conversation

oindrillac
Copy link
Contributor

This PR adds a function that converts dataset format to messages and saves an additional jsonl along with each generate and does that for both train and test dataset.

This resolves #60

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oindrillac, tested and it gives out desired output files!

@mergify mergify bot added the ci-failure label Jul 8, 2024
@oindrillac
Copy link
Contributor Author

The e2e is failing, is it because I changed the name of the output file? Will re-instating the name back fix it?

@russellb
Copy link
Member

russellb commented Jul 8, 2024

The e2e is failing, is it because I changed the name of the output file? Will re-instating the name back fix it?

yes, changing the filenames would break it

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e failure should be resolved before merging, thanks!

Copy link

github-actions bot commented Jul 8, 2024

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

github-actions bot commented Jul 8, 2024

e2e workflow succeeded on this PR: View run, congrats!

@russellb
Copy link
Member

russellb commented Jul 8, 2024

e2e workflow succeeded on this PR: View run, congrats!

just FYI -- this job runs the full pipeline, but does not run training, as we do not have the new training method (training library) working in CI yet.

@oindrillac
Copy link
Contributor Author

Fixed the file names for E2E to pass. Seems like the E2E is failing on a training step.

@oindrillac
Copy link
Contributor Author

e2e workflow succeeded on this PR: View run, congrats!

just FYI -- this job runs the full pipeline, but does not run training, as we do not have the new training method (training library) working in CI yet.

so is the e2e expected to fail right now?

@mergify mergify bot added ci-failure and removed ci-failure labels Jul 9, 2024
@oindrillac
Copy link
Contributor Author

Spoke to @RobotSail and we are not sure what the generated_*.jsonl form is being used for right now. Added a change to drop that file and keep train_ and test_ files intact to be compatible with the CLI.

@oindrillac
Copy link
Contributor Author

ah seems like that seems to fix it, the CI passes

@oindrillac oindrillac requested a review from russellb July 9, 2024 21:21
Copy link

github-actions bot commented Jul 9, 2024

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

github-actions bot commented Jul 9, 2024

e2e workflow failed on this PR: View run, please investigate.

@oindrillac
Copy link
Contributor Author

@russellb can you please take a quick look?

@russellb
Copy link
Member

russellb commented Jul 10, 2024

@russellb can you please take a quick look?

don't worry about that last failure. The CI job broke because of instructlab/instructlab#1471

fix is here: instructlab/instructlab#1645

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow failed on this PR: View run, please investigate.

@russellb
Copy link
Member

russellb commented Jul 10, 2024

e2e workflow failed on this PR: View run, please investigate.

Ignore again. Still broken related to instructlab/instructlab#1471

fix v2 -- instructlab/instructlab#1650

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow failed on this PR: View run, please investigate.

@russellb
Copy link
Member

e2e workflow failed on this PR: View run, please investigate.

sigh. continue to ignore these.

@oindrillac
Copy link
Contributor Author

I think this is ready for merge. @russellb is this good to go from your end

@russellb
Copy link
Member

I think this is ready for merge. @russellb is this good to go from your end

yeah, you don't need to block on me. I'll run the full sdg CI workflow again on this repo when it's fixed. Thanks for checking!

@russellb
Copy link
Member

... though I just looked and it's still removing one of the files that used to be created? (the one with a generated_ prefix)

but maybe nothing cares

@russellb russellb dismissed their stale review July 10, 2024 18:51

previously requested changes were addressed, except for the removal of the "generated_" file, but i'm not sure if there is code anywhere that cares

@oindrillac
Copy link
Contributor Author

yeah, it is removing that one since it was a duplicate, does not seem to be needed on the CLI from when @RobotSail and I checked, hopefully its just a duplicate

@russellb
Copy link
Member

russellb commented Jul 10, 2024

yeah, it is removing that one since it was a duplicate, does not seem to be needed on the CLI from when @RobotSail and I checked, hopefully its just a duplicate

I see it used in the CLI repo in functional_tests.sh

https://github.com/instructlab/instructlab/blob/e1699cf69fe70e6db58c938e14e32f1f2a9e3f2b/scripts/functional-tests.sh#L306

so if I'm reading this right, this will break all of the unit+functional test jobs on the CLI repo once we make an sdg library release with this change as-is

@oindrillac
Copy link
Contributor Author

oindrillac commented Jul 10, 2024

functional_tests.sh

Thanks for finding it. Maybe that should also use the train_.jsonl, but that should be tracked as a cli issue, I can keep it here then

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

@russellb
Copy link
Member

both the simple and full SDG pipelines are now passing in CI based on this PR, so I'm good with it. thanks everyone!

@russellb russellb merged commit 7bf1563 into instructlab:main Jul 11, 2024
11 checks passed
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert generated dataset to messages format
5 participants