-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data transformation capability to dataset retrieval step #385
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! This looks good but I have some suggested changes.
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
Hi @saum7800 I can take another look when you've had a moment to make the above revisions! |
Hey @neubig , I have resolved the comments that seemed like easy fixes, and left comments for a couple of them we can discuss. Please re-review whenever you get a chance. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for making the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few changes required, particularly about deduplicating functions that do similar things and improving function/variable names
if "train" in dataset: | ||
dataset = dataset["train"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but it suggests that this function is not well-typed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had to do it because certain ones had a train
split that had to be accessed to access samples, whereas some had direct access to samples. either way, I suggest we keep this as it is for now and resolve in a future PR, since that is not the focus for the PR. have tested that this works for now
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
for dataset in sorted_list: | ||
print(f"Trying {dataset.name}") | ||
try: | ||
canonicalized_dataset = self.canonicalize_dataset_auto( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that now we're conflating two concepts:
- canonicalization (which I thought of as just column selection)
- transformation (which may make major changes to the dataset's contents, thus not really making it more canonical, according to the definition of that word)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have merged such that reranking happens separately and colsel+ (if user specifies) transformation separately. we can discuss if any renaming is required in your re-review.
prompt2model/dataset_retriever/description_dataset_retriever.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished my first pass here.
It looks like you might need to pull/merge latest main
and update your base branch, since some of the files being modified here (e.g. parse_json_responses.py
) don't exist on main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I've left a few minor comments, all focusing on superficial things. The general design and code quality is great.
Feel free to merge this after addressing the few remaining comments.
Description
We are adding a new component
DatasetTransformer
. The code currently contains one version ofDatasetTransformer
:PromptBasedDatasetTransformer
. It is used with the Dataset Retrieval step in order to transform retrieved data to a format that is more directly relevant to the task given by a user. Here is the broad flow:PromptBasedDatasetTransformer
object, and call thetransform_data
function with thePromptSpec
and the loaded dataset.transform_data
creates a prompt and calls the APIAgent which returns a "plan" for carrying out the transformation. This prompt uses the prompt_spec and example rows of the retrieved dataset.PromptSpec
. It then requests a batch complete from the APIAgent for all the transform_prompts.