Column Selection: Improvements #366

ritugala · 2023-10-07T00:49:30Z

Description

Tried to add information about columns from huggingface, but I couldn't find anything super relevant. So I decided to just add the dataset description instead. Made the appropriate changes in the base prompt as well.
Updated the prompt for column selection by (1) adding an example where there is a mismatch between the task and dataset chosen, so there are no relevant columns (eg using a machine translation dataset for a summarization task), and (2) swapped an old example for another one
Fixed bug in truncate_row -- only add the "..." if len(row) > max_length
Added support for datasets like opus100 (which have nested columns), so that these datasets can be used and we don't try to generate a dataset -- the solution was pretty straightforward, using flatten() function of huggingface + some trivial column preprocessing

zhaochenyang20

Generally well-structured

viswavi

Looks good to me, just left two very minor suggestions! Feel free to merge after making these changes.

prompt2model/dataset_retriever/column_selection_prompt.py

viswavi · 2023-10-09T19:37:32Z

prompt2model/dataset_retriever/description_dataset_retriever.py

        if "train" not in dataset:
            raise ValueError("The dataset must contain a `train` split.")
+
+        columns_mapping = {
+            col: col.replace(".", "_") for col in dataset["train"].column_names


What will happen if there is a column conflict here? E.g. we have two columns, foo.baz and foo_baz? Then I think these will conflict and one of the columns will be erased.

I know this is unlikely but might be good to check and handle this scenario.

viswavi

@ritugala LGTM. Please test this in the CLI demo before merging.

ritugala added 2 commits October 6, 2023 16:36

change one example for column selection

e3a2f47

updated prompt and fixed truncate bug

60dd47f

ritugala marked this pull request as ready for review October 7, 2023 01:13

ritugala requested review from neubig, zhaochenyang20 and viswavi October 7, 2023 01:15

ritugala changed the title ~~Ritu column selection huggingface~~ Column Selection: Improvements Oct 7, 2023

zhaochenyang20 reviewed Oct 7, 2023

View reviewed changes

viswavi approved these changes Oct 9, 2023

View reviewed changes

included modifications

a088e1a

viswavi approved these changes Oct 13, 2023

View reviewed changes

ritugala merged commit 9ef37f7 into main Oct 13, 2023

ritugala deleted the ritu-column-selection-huggingface branch October 13, 2023 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column Selection: Improvements #366

Column Selection: Improvements #366

ritugala commented Oct 7, 2023 •

edited

Loading

zhaochenyang20 left a comment

viswavi left a comment

viswavi Oct 9, 2023

viswavi left a comment

Column Selection: Improvements #366

Column Selection: Improvements #366

Conversation

ritugala commented Oct 7, 2023 • edited Loading

Description

zhaochenyang20 left a comment

Choose a reason for hiding this comment

viswavi left a comment

Choose a reason for hiding this comment

viswavi Oct 9, 2023

Choose a reason for hiding this comment

viswavi left a comment

Choose a reason for hiding this comment

ritugala commented Oct 7, 2023 •

edited

Loading