Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking permutations of all columns doesn't scale #12

Open
sebastko opened this issue Oct 21, 2022 · 0 comments
Open

Checking permutations of all columns doesn't scale #12

sebastko opened this issue Oct 21, 2022 · 0 comments

Comments

@sebastko
Copy link

Hi,
The current code compares data frames by trying all permutations of the columns, without using the information about the column names. I guess this is OK for very small number of columns, but it doesn't scale to some other datasets - e.g., in the KaggleDBQA dataset, some queries return 26 columns. 26! = 403,291,461,126,605,635,584,000,000 ~= pretty much infinity :) The evaluation just hangs.

For my local experiments w/ KaggleDBQA, I changed the matching code to build Pandas DataFrames with named columns, then sort the order of the columns by their names before comparing the DataFrame values. The column names don't have to match exactly between ground-truth and prediction, but their order needs to match.

I wanted to ask here if you'd be willing to accept such change in your repository. If yes, I can clean up the code and publish a PR.

If you're worried that it may change the results on the existing datasets, perhaps we could still do the column permutation thing IF number of column is low (maybe up to 5 or so? or whatever is max number of returned columns in the current datasets you officially support?). What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant