Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column Selection: Improvements #366

Merged
merged 3 commits into from
Oct 13, 2023
Merged

Conversation

ritugala
Copy link
Collaborator

@ritugala ritugala commented Oct 7, 2023

Description

  • Tried to add information about columns from huggingface, but I couldn't find anything super relevant. So I decided to just add the dataset description instead. Made the appropriate changes in the base prompt as well.
  • Updated the prompt for column selection by (1) adding an example where there is a mismatch between the task and dataset chosen, so there are no relevant columns (eg using a machine translation dataset for a summarization task), and (2) swapped an old example for another one
  • Fixed bug in truncate_row -- only add the "..." if len(row) > max_length
  • Added support for datasets like opus100 (which have nested columns), so that these datasets can be used and we don't try to generate a dataset -- the solution was pretty straightforward, using flatten() function of huggingface + some trivial column preprocessing

@ritugala ritugala marked this pull request as ready for review October 7, 2023 01:13
@ritugala ritugala changed the title Ritu column selection huggingface Column Selection: Improvements Oct 7, 2023
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally well-structured

Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just left two very minor suggestions! Feel free to merge after making these changes.

prompt2model/dataset_retriever/column_selection_prompt.py Outdated Show resolved Hide resolved
if "train" not in dataset:
raise ValueError("The dataset must contain a `train` split.")

columns_mapping = {
col: col.replace(".", "_") for col in dataset["train"].column_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if there is a column conflict here? E.g. we have two columns, foo.baz and foo_baz? Then I think these will conflict and one of the columns will be erased.

I know this is unlikely but might be good to check and handle this scenario.

Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritugala LGTM. Please test this in the CLI demo before merging.

@ritugala ritugala merged commit 9ef37f7 into main Oct 13, 2023
@ritugala ritugala deleted the ritu-column-selection-huggingface branch October 13, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants