Checking permutations of all columns doesn't scale #12

sebastko · 2022-10-21T18:14:49Z

Hi,
The current code compares data frames by trying all permutations of the columns, without using the information about the column names. I guess this is OK for very small number of columns, but it doesn't scale to some other datasets - e.g., in the KaggleDBQA dataset, some queries return 26 columns. 26! = 403,291,461,126,605,635,584,000,000 ~= pretty much infinity :) The evaluation just hangs.

For my local experiments w/ KaggleDBQA, I changed the matching code to build Pandas DataFrames with named columns, then sort the order of the columns by their names before comparing the DataFrame values. The column names don't have to match exactly between ground-truth and prediction, but their order needs to match.

I wanted to ask here if you'd be willing to accept such change in your repository. If yes, I can clean up the code and publish a PR.

If you're worried that it may change the results on the existing datasets, perhaps we could still do the column permutation thing IF number of column is low (maybe up to 5 or so? or whatever is max number of returned columns in the current datasets you officially support?). What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking permutations of all columns doesn't scale #12

Checking permutations of all columns doesn't scale #12

sebastko commented Oct 21, 2022

Checking permutations of all columns doesn't scale #12

Checking permutations of all columns doesn't scale #12

Comments

sebastko commented Oct 21, 2022