Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data transformation capability to dataset retrieval step #385

Merged
merged 11 commits into from
Jan 15, 2024

Conversation

saum7800
Copy link
Collaborator

@saum7800 saum7800 commented Dec 3, 2023

Description

We are adding a new component DatasetTransformer. The code currently contains one version of DatasetTransformer: PromptBasedDatasetTransformer. It is used with the Dataset Retrieval step in order to transform retrieved data to a format that is more directly relevant to the task given by a user. Here is the broad flow:

  1. Dataset retriever chooses a dataset
  2. Automatic column selection chooses relevant columns from the dataset
  3. We remove columns from the dataset that were not chosen by the automatic column selection
  4. We define a new PromptBasedDatasetTransformer object, and call the transform_data function with the PromptSpec and the loaded dataset.
  5. transform_data creates a prompt and calls the APIAgent which returns a "plan" for carrying out the transformation. This prompt uses the prompt_spec and example rows of the retrieved dataset.
  6. Next, it creates a list of "transform_prompts", where each transform prompt contains the "plan", one sample from the retrieved dataset, and the PromptSpec. It then requests a batch complete from the APIAgent for all the transform_prompts.
  7. Finally, we extract the "input" and "output" keys from each response, and canonicalize the inputs and outputs into a dataset.

@saum7800 saum7800 requested review from neubig and viswavi December 3, 2023 10:36
Copy link
Collaborator

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This looks good but I have some suggested changes.

prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
prompt2model_demo_auto.py Outdated Show resolved Hide resolved
prompt2model_demo_auto.py Outdated Show resolved Hide resolved
@neubig
Copy link
Collaborator

neubig commented Dec 13, 2023

Hi @saum7800 I can take another look when you've had a moment to make the above revisions!

@saum7800 saum7800 requested a review from neubig December 13, 2023 20:32
@saum7800
Copy link
Collaborator Author

Hey @neubig , I have resolved the comments that seemed like easy fixes, and left comments for a couple of them we can discuss. Please re-review whenever you get a chance. Thanks!

Copy link
Collaborator

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for making the changes!

Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few changes required, particularly about deduplicating functions that do similar things and improving function/variable names

Comment on lines +134 to +135
if "train" in dataset:
dataset = dataset["train"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but it suggests that this function is not well-typed...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had to do it because certain ones had a train split that had to be accessed to access samples, whereas some had direct access to samples. either way, I suggest we keep this as it is for now and resolve in a future PR, since that is not the focus for the PR. have tested that this works for now

for dataset in sorted_list:
print(f"Trying {dataset.name}")
try:
canonicalized_dataset = self.canonicalize_dataset_auto(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that now we're conflating two concepts:

  • canonicalization (which I thought of as just column selection)
  • transformation (which may make major changes to the dataset's contents, thus not really making it more canonical, according to the definition of that word)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged such that reranking happens separately and colsel+ (if user specifies) transformation separately. we can discuss if any renaming is required in your re-review.

prompt2model/dataset_transformer/base.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/base.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished my first pass here.

It looks like you might need to pull/merge latest main and update your base branch, since some of the files being modified here (e.g. parse_json_responses.py) don't exist on main

prompt2model/utils/parse_json_responses.py Outdated Show resolved Hide resolved
tests/dataset_transformer_test.py Outdated Show resolved Hide resolved
prompt2model_demo.py Outdated Show resolved Hide resolved
prompt2model_demo.py Outdated Show resolved Hide resolved
@saum7800 saum7800 requested a review from viswavi January 14, 2024 20:28
Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I've left a few minor comments, all focusing on superficial things. The general design and code quality is great.

Feel free to merge this after addressing the few remaining comments.

prompt2model/dataset_transformer/base.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/base.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/dataset_transformer/prompt_based.py Outdated Show resolved Hide resolved
prompt2model_demo.py Outdated Show resolved Hide resolved
prompt2model_demo.py Outdated Show resolved Hide resolved
@saum7800 saum7800 merged commit 67673b1 into main Jan 15, 2024
8 checks passed
@saum7800 saum7800 deleted the saumya_data_transform branch January 15, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants