Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset size on CLI #345

Merged
merged 12 commits into from
Sep 14, 2023
Merged

Dataset size on CLI #345

merged 12 commits into from
Sep 14, 2023

Conversation

hitenvidhani
Copy link
Contributor

@hitenvidhani hitenvidhani commented Sep 10, 2023

Description

Dataset size on CLI by fetching size on runtime

References

#253

Blocked by

NA

@hitenvidhani hitenvidhani marked this pull request as draft September 10, 2023 20:23
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a quick question. Is this API call always reliable? I don't see the error handler of the API call.

@viswavi
Copy link
Collaborator

viswavi commented Sep 11, 2023

I tested this out and it seems reasonably fast (~0.1s per dataset) and accurate! While this will slow down our script by ~3 seconds for 25 datasets, I think that's ok.

+1 to Chenyang's point about API handling (e.g. we should have default behavior if there are exceptions, e.g. log the error and return NaN as the size).

Can you also add a test case for this? I can add the test in a separate PR if you're not sure how to go about doing this in our repo. I'd suggest you just test get_dataset_size by mocking the execution of prompt2model.utils.dataset_utils.query using the unittest.patch library.

@hitenvidhani hitenvidhani marked this pull request as ready for review September 11, 2023 20:38
Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work @hitenvidhani! I'm testing this now locally but if it works, this will be good to merge 😄

@viswavi viswavi self-requested a review September 12, 2023 22:56
@viswavi
Copy link
Collaborator

viswavi commented Sep 12, 2023

@hitenvidhani Actually, it seems that something is not working correctly when I built and tested this locally. After starting dataset retrieval, I see:

Here are the datasets I've retrieved for you:
#	Name	Size[MB]	Description
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_para', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 556068, 'num_bytes_memory': 1097064, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
1):	yulongmannlp/dev_para	76.72	Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_orig', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 553991, 'num_bytes_memory': 1095012, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
2):	yulongmannlp/dev_orig	76.71	Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the cause of the incorrectly-displayed search results. Fix that and we're good to merge

@hitenvidhani
Copy link
Contributor Author

Thank you @viswavi, have pushed the requested changes, sorry about that print statement it was added for debugging.

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    return (
        "NA"
        if size_dict is {}
        else "{:.2f}".format(size_dict["dataset"]["num_bytes_memory"] / 1024 / 1024)
    )

There are two suggested new unit tests.

  1. Test when the API call fails.
  2. Test that the returned dict does not contain the demanded column.

prompt2model/utils/dataset_utils.py Outdated Show resolved Hide resolved
@neubig
Copy link
Collaborator

neubig commented Sep 13, 2023

Hey @zhaochenyang20 , thanks for the suggestion!

One counter-suggestion, because @hitenvidhani is a first-time contributor and we've already gone back-and-forth on this PR several times, maybe we can merge the PR for now and then do a follow-up PR for unit tests? Unit tests are important, but it's also important that we welcome new contributors (and we do!) so we can make the process a little simpler this time.

@viswavi viswavi self-requested a review September 13, 2023 13:18
Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @zhaochenyang20's small suggestion (for the method docstring) is good. I agree with Graham that we can handle the added tests ourselves.

Once you respond to Chenyang's suggestion in the docstring, LGTM!

@zhaochenyang20
Copy link
Collaborator

In this case, I gonna merge it and let us add unit tests ourselves. 🤔

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally cool to me!

@neubig neubig merged commit c39b68a into neulab:main Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants