Dataset size on CLI #345

hitenvidhani · 2023-09-10T20:22:46Z

Description

Dataset size on CLI by fetching size on runtime

References

#253

Blocked by

NA

zhaochenyang20

I have a quick question. Is this API call always reliable? I don't see the error handler of the API call.

viswavi · 2023-09-11T00:29:08Z

I tested this out and it seems reasonably fast (~0.1s per dataset) and accurate! While this will slow down our script by ~3 seconds for 25 datasets, I think that's ok.

+1 to Chenyang's point about API handling (e.g. we should have default behavior if there are exceptions, e.g. log the error and return NaN as the size).

Can you also add a test case for this? I can add the test in a separate PR if you're not sure how to go about doing this in our repo. I'd suggest you just test get_dataset_size by mocking the execution of prompt2model.utils.dataset_utils.query using the unittest.patch library.

viswavi

This is great work @hitenvidhani! I'm testing this now locally but if it works, this will be good to merge 😄

viswavi · 2023-09-12T22:56:58Z

@hitenvidhani Actually, it seems that something is not working correctly when I built and tested this locally. After starting dataset retrieval, I see:

Here are the datasets I've retrieved for you:
#	Name	Size[MB]	Description
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_para', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'num_bytes_original_files': 31562059, 'num_bytes_parquet_files': 15014382, 'num_bytes_memory': 80443172, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_para', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 556068, 'num_bytes_memory': 1097064, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
1):	yulongmannlp/dev_para	76.72	Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
{'size': {'dataset': {'dataset': 'yulongmannlp/dev_orig', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661}, 'configs': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'num_bytes_original_files': 31560015, 'num_bytes_parquet_files': 15012305, 'num_bytes_memory': 80441120, 'num_rows': 88661, 'num_columns': 5}], 'splits': [{'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'train', 'num_bytes_parquet_files': 14458314, 'num_bytes_memory': 79346108, 'num_rows': 87599, 'num_columns': 5}, {'dataset': 'yulongmannlp/dev_orig', 'config': 'plain_text', 'split': 'validation', 'num_bytes_parquet_files': 553991, 'num_bytes_memory': 1095012, 'num_rows': 1062, 'num_columns': 5}]}, 'pending': [], 'failed': [], 'partial': False}
2):	yulongmannlp/dev_orig	76.71	Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

prompt2model/utils/dataset_utils.py

viswavi

Found the cause of the incorrectly-displayed search results. Fix that and we're good to merge

hitenvidhani · 2023-09-13T05:46:34Z

Thank you @viswavi, have pushed the requested changes, sorry about that print statement it was added for debugging.

zhaochenyang20

    return (
        "NA"
        if size_dict is {}
        else "{:.2f}".format(size_dict["dataset"]["num_bytes_memory"] / 1024 / 1024)
    )

There are two suggested new unit tests.

Test when the API call fails.
Test that the returned dict does not contain the demanded column.

prompt2model/utils/dataset_utils.py

neubig · 2023-09-13T10:46:35Z

Hey @zhaochenyang20 , thanks for the suggestion!

One counter-suggestion, because @hitenvidhani is a first-time contributor and we've already gone back-and-forth on this PR several times, maybe we can merge the PR for now and then do a follow-up PR for unit tests? Unit tests are important, but it's also important that we welcome new contributors (and we do!) so we can make the process a little simpler this time.

viswavi

I think @zhaochenyang20's small suggestion (for the method docstring) is good. I agree with Graham that we can handle the added tests ourselves.

Once you respond to Chenyang's suggestion in the docstring, LGTM!

zhaochenyang20 · 2023-09-13T23:21:27Z

In this case, I gonna merge it and let us add unit tests ourselves. 🤔

zhaochenyang20

It's generally cool to me!

hitenvidhani and others added 6 commits September 4, 2023 01:02

init

d0aafa1

lint

d3704bf

fix

de7ca22

Merge branch 'neulab:main' into main

35f7a38

lint

81ae6d3

Merge branch 'neulab:main' into main

8122111

hitenvidhani marked this pull request as draft September 10, 2023 20:23

zhaochenyang20 requested changes Sep 11, 2023

View reviewed changes

hitenvidhani added 2 commits September 12, 2023 01:20

lint

eb0eec0

Merge branch 'main' of https://github.com/hitenvidhani/prompt2model

6d14bef

hitenvidhani marked this pull request as ready for review September 11, 2023 20:38

Merge branch 'main' into main

90ca420

neubig requested a review from zhaochenyang20 September 12, 2023 13:42

viswavi approved these changes Sep 12, 2023

View reviewed changes

viswavi self-requested a review September 12, 2023 22:56

viswavi reviewed Sep 12, 2023

View reviewed changes

prompt2model/utils/dataset_utils.py Outdated Show resolved Hide resolved

viswavi requested changes Sep 12, 2023

View reviewed changes

hitenvidhani added 2 commits September 13, 2023 11:12

rm print stmt

c6a216b

Merge branch 'main' into main

53d038e

zhaochenyang20 requested changes Sep 13, 2023

View reviewed changes

prompt2model/utils/dataset_utils.py Outdated Show resolved Hide resolved

viswavi self-requested a review September 13, 2023 13:18

viswavi approved these changes Sep 13, 2023

View reviewed changes

Update dataset_utils.py

f5ba4df

hitenvidhani requested a review from viswavi September 13, 2023 19:48

zhaochenyang20 approved these changes Sep 13, 2023

View reviewed changes

neubig merged commit c39b68a into neulab:main Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset size on CLI #345

Dataset size on CLI #345

hitenvidhani commented Sep 10, 2023 •

edited

Loading

zhaochenyang20 left a comment

viswavi commented Sep 11, 2023

viswavi left a comment

viswavi commented Sep 12, 2023

viswavi left a comment

hitenvidhani commented Sep 13, 2023

zhaochenyang20 left a comment

neubig commented Sep 13, 2023

viswavi left a comment

zhaochenyang20 commented Sep 13, 2023

zhaochenyang20 left a comment

Dataset size on CLI #345

Dataset size on CLI #345

Conversation

hitenvidhani commented Sep 10, 2023 • edited Loading

Description

References

Blocked by

zhaochenyang20 left a comment

Choose a reason for hiding this comment

viswavi commented Sep 11, 2023

viswavi left a comment

Choose a reason for hiding this comment

viswavi commented Sep 12, 2023

viswavi left a comment

Choose a reason for hiding this comment

hitenvidhani commented Sep 13, 2023

zhaochenyang20 left a comment

Choose a reason for hiding this comment

neubig commented Sep 13, 2023

viswavi left a comment

Choose a reason for hiding this comment

zhaochenyang20 commented Sep 13, 2023

zhaochenyang20 left a comment

Choose a reason for hiding this comment

hitenvidhani commented Sep 10, 2023 •

edited

Loading