Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow hf dataset id to be passed via training_data_path #431

Merged

Conversation

dushyantbehl
Copy link
Contributor

@dushyantbehl dushyantbehl commented Jan 2, 2025

Description of the change

Recent commit d7f06f5 introduced support for hf dataset id inside data_config but the code had a minor fix needed to support hf dataset id being passed via training_data_path.

With this patch code now supports hf dataset id via training_data_path argument.

Related issue number

How to verify the PR

Tested with an hf dataset and added a unit test for the same.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@dushyantbehl dushyantbehl changed the title fix bug which caused error while loading hf dataset id fix: bug which caused error while loading hf dataset id Jan 2, 2025
Copy link

github-actions bot commented Jan 2, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

1 similar comment
Copy link

github-actions bot commented Jan 2, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Jan 2, 2025
@dushyantbehl
Copy link
Contributor Author

cc @anhuong @Abhishek-TAMU

@dushyantbehl dushyantbehl changed the title fix: bug which caused error while loading hf dataset id feat: Allow hf dataset id to be passed via training_data_path Jan 2, 2025
@github-actions github-actions bot added feat and removed fix labels Jan 2, 2025
@@ -130,7 +130,7 @@ def replace_text(match_obj):
if index_object not in element:
raise KeyError("Requested template string is not a valid key in dict")

return element[index_object]
return str(element[index_object])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset which I have tested in my unit test is lhoestq/demo1 which is used with the formatting template "### Text:{{review}} \n\n### Stars: {{star}}"
Now if you see the dataset the field star is an int which would cause an error if this element was not a string.

so this change supports any dataset formatting with dataset row types other than string as well.

dushyantbehl and others added 2 commits January 2, 2025 19:27
Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dushyantbehl for the code refactor to _try_load_dataset and adding support for hf dataset id via training_data_path. Refactored changes look good to me. Just small nit which I have pushed here.

@dushyantbehl
Copy link
Contributor Author

changes look good to me @Abhishek-TAMU. Can you please also check why the image build failed post the change.

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well

@willmj
Copy link
Collaborator

willmj commented Jan 2, 2025

@dushyantbehl a new release of fms-acceleration just came out which may be causing this problem with the image build.

@anhuong
Copy link
Collaborator

anhuong commented Jan 2, 2025

I also verified that this works with a different dataset to load from HF, thanks Dushyant!

Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the image build passes successfully!

@Abhishek-TAMU Abhishek-TAMU merged commit 3dc8ef7 into foundation-model-stack:main Jan 4, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants