converts dataset format messages required for training #94

oindrillac · 2024-07-08T18:23:42Z

This PR adds a function that converts dataset format to messages and saves an additional jsonl along with each generate and does that for both train and test dataset.

This resolves #60

Signed-off-by: Oindrilla Chatterjee <[email protected]>

src/instructlab/sdg/generate_data.py

aakankshaduggal

Thanks @oindrillac, tested and it gives out desired output files!

src/instructlab/sdg/generate_data.py

oindrillac · 2024-07-08T19:26:03Z

The e2e is failing, is it because I changed the name of the output file? Will re-instating the name back fix it?

russellb · 2024-07-08T22:27:18Z

The e2e is failing, is it because I changed the name of the output file? Will re-instating the name back fix it?

yes, changing the filenames would break it

russellb

The e2e failure should be resolved before merging, thanks!

github-actions · 2024-07-08T22:43:57Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-08T22:56:55Z

e2e workflow succeeded on this PR: View run, congrats!

russellb · 2024-07-08T22:58:46Z

e2e workflow succeeded on this PR: View run, congrats!

just FYI -- this job runs the full pipeline, but does not run training, as we do not have the new training method (training library) working in CI yet.

oindrillac · 2024-07-09T15:24:38Z

Fixed the file names for E2E to pass. Seems like the E2E is failing on a training step.

oindrillac · 2024-07-09T15:27:22Z

e2e workflow succeeded on this PR: View run, congrats!

just FYI -- this job runs the full pipeline, but does not run training, as we do not have the new training method (training library) working in CI yet.

so is the e2e expected to fail right now?

oindrillac · 2024-07-09T21:15:01Z

Spoke to @RobotSail and we are not sure what the generated_*.jsonl form is being used for right now. Added a change to drop that file and keep train_ and test_ files intact to be compatible with the CLI.

oindrillac · 2024-07-09T21:21:03Z

ah seems like that seems to fix it, the CI passes

github-actions · 2024-07-09T21:30:37Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-09T21:38:09Z

e2e workflow failed on this PR: View run, please investigate.

oindrillac · 2024-07-09T21:56:54Z

@russellb can you please take a quick look?

russellb · 2024-07-10T00:10:02Z

@russellb can you please take a quick look?

don't worry about that last failure. The CI job broke because of instructlab/instructlab#1471

fix is here: instructlab/instructlab#1645

github-actions · 2024-07-10T11:14:06Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-10T11:20:39Z

e2e workflow failed on this PR: View run, please investigate.

russellb · 2024-07-10T11:59:06Z

e2e workflow failed on this PR: View run, please investigate.

Ignore again. Still broken related to instructlab/instructlab#1471

fix v2 -- instructlab/instructlab#1650

github-actions · 2024-07-10T13:56:37Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-10T14:06:00Z

e2e workflow failed on this PR: View run, please investigate.

russellb · 2024-07-10T14:30:26Z

e2e workflow failed on this PR: View run, please investigate.

sigh. continue to ignore these.

oindrillac · 2024-07-10T15:23:12Z

I think this is ready for merge. @russellb is this good to go from your end

russellb · 2024-07-10T18:48:41Z

I think this is ready for merge. @russellb is this good to go from your end

yeah, you don't need to block on me. I'll run the full sdg CI workflow again on this repo when it's fixed. Thanks for checking!

russellb · 2024-07-10T18:50:35Z

... though I just looked and it's still removing one of the files that used to be created? (the one with a generated_ prefix)

but maybe nothing cares

previously requested changes were addressed, except for the removal of the "generated_" file, but i'm not sure if there is code anywhere that cares

oindrillac · 2024-07-10T18:51:37Z

yeah, it is removing that one since it was a duplicate, does not seem to be needed on the CLI from when @RobotSail and I checked, hopefully its just a duplicate

russellb · 2024-07-10T18:52:01Z

yeah, it is removing that one since it was a duplicate, does not seem to be needed on the CLI from when @RobotSail and I checked, hopefully its just a duplicate

I see it used in the CLI repo in functional_tests.sh

https://github.com/instructlab/instructlab/blob/e1699cf69fe70e6db58c938e14e32f1f2a9e3f2b/scripts/functional-tests.sh#L306

so if I'm reading this right, this will break all of the unit+functional test jobs on the CLI repo once we make an sdg library release with this change as-is

oindrillac · 2024-07-10T18:53:35Z

functional_tests.sh

Thanks for finding it. Maybe that should also use the train_.jsonl, but that should be tracked as a cli issue, I can keep it here then

github-actions · 2024-07-11T13:53:25Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-11T14:05:47Z

e2e workflow succeeded on this PR: View run, congrats!

russellb · 2024-07-11T14:11:40Z

both the simple and full SDG pipelines are now passing in CI based on this PR, so I'm good with it. thanks everyone!

RobotSail

LGTM

…ab-app InstructLab macOS App

oindrillac added 2 commits July 8, 2024 14:16

added method to convert to messages for training

47db7d7

Signed-off-by: Oindrilla Chatterjee <[email protected]>

save an additional train dataset in the converted format

8054981

Signed-off-by: Oindrilla Chatterjee <[email protected]>

oindrillac mentioned this pull request Jul 8, 2024

converts dataset format messages required for training #92

Closed

oindrillac requested review from aakankshaduggal, RobotSail, russellb and shivchander July 8, 2024 18:25

mergify bot added the ci-failure label Jul 8, 2024

oindrillac force-pushed the messages branch from 7f502dc to 4554ce4 Compare July 8, 2024 18:27

mergify bot added ci-failure and removed ci-failure labels Jul 8, 2024

russellb requested changes Jul 8, 2024

View reviewed changes

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved

src/instructlab/sdg/generate_data.py Show resolved Hide resolved

oindrillac force-pushed the messages branch from 4554ce4 to 42f3802 Compare July 8, 2024 18:56

mergify bot removed the ci-failure label Jul 8, 2024

aakankshaduggal approved these changes Jul 8, 2024

View reviewed changes

shivchander approved these changes Jul 8, 2024

View reviewed changes

src/instructlab/sdg/generate_data.py Show resolved Hide resolved

mergify bot added the ci-failure label Jul 8, 2024

russellb previously requested changes Jul 8, 2024

View reviewed changes

oindrillac force-pushed the messages branch from 42f3802 to 7a54f78 Compare July 8, 2024 23:00

mergify bot added ci-failure and removed ci-failure labels Jul 8, 2024

mergify bot added ci-failure and removed ci-failure labels Jul 9, 2024

oindrillac force-pushed the messages branch from 886f304 to b0fcf32 Compare July 9, 2024 20:52

mergify bot removed the ci-failure label Jul 9, 2024

oindrillac requested a review from russellb July 9, 2024 21:21

aakankshaduggal approved these changes Jul 10, 2024

View reviewed changes

russellb approved these changes Jul 11, 2024

View reviewed changes

russellb merged commit 7bf1563 into instructlab:main Jul 11, 2024
11 checks passed

RobotSail reviewed Jul 19, 2024

View reviewed changes

jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024

Merge pull request instructlab#94 from instructlab/jjasghar/instructl…

c62403b

…ab-app InstructLab macOS App

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

converts dataset format messages required for training #94

converts dataset format messages required for training #94

oindrillac commented Jul 8, 2024

aakankshaduggal left a comment

oindrillac commented Jul 8, 2024

russellb commented Jul 8, 2024

russellb left a comment

github-actions bot commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

russellb commented Jul 8, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

oindrillac commented Jul 9, 2024

russellb commented Jul 10, 2024 •

edited

Loading

github-actions bot commented Jul 10, 2024

github-actions bot commented Jul 10, 2024

russellb commented Jul 10, 2024 •

edited

Loading

github-actions bot commented Jul 10, 2024

github-actions bot commented Jul 10, 2024

russellb commented Jul 10, 2024

oindrillac commented Jul 10, 2024

russellb commented Jul 10, 2024

russellb commented Jul 10, 2024

oindrillac commented Jul 10, 2024

russellb commented Jul 10, 2024 •

edited

Loading

oindrillac commented Jul 10, 2024 •

edited

Loading

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb commented Jul 11, 2024

RobotSail left a comment

converts dataset format messages required for training #94

converts dataset format messages required for training #94

Conversation

oindrillac commented Jul 8, 2024

aakankshaduggal left a comment

Choose a reason for hiding this comment

oindrillac commented Jul 8, 2024

russellb commented Jul 8, 2024

russellb left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

russellb commented Jul 8, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

oindrillac commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

oindrillac commented Jul 9, 2024

russellb commented Jul 10, 2024 • edited Loading

github-actions bot commented Jul 10, 2024

github-actions bot commented Jul 10, 2024

russellb commented Jul 10, 2024 • edited Loading

github-actions bot commented Jul 10, 2024

github-actions bot commented Jul 10, 2024

russellb commented Jul 10, 2024

oindrillac commented Jul 10, 2024

russellb commented Jul 10, 2024

russellb commented Jul 10, 2024

oindrillac commented Jul 10, 2024

russellb commented Jul 10, 2024 • edited Loading

oindrillac commented Jul 10, 2024 • edited Loading

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb commented Jul 11, 2024

RobotSail left a comment

Choose a reason for hiding this comment

russellb commented Jul 10, 2024 •

edited

Loading

russellb commented Jul 10, 2024 •

edited

Loading

russellb commented Jul 10, 2024 •

edited

Loading

oindrillac commented Jul 10, 2024 •

edited

Loading