Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter to show records with duplicate values in a set of columns #501

Closed
kgodey opened this issue Jul 28, 2021 · 14 comments · Fixed by #569
Closed

Add filter to show records with duplicate values in a set of columns #501

kgodey opened this issue Jul 28, 2021 · 14 comments · Fixed by #569
Assignees
Labels
needs: unblocking Blocked by other work work: backend Related to Python, Django, and simple SQL

Comments

@kgodey
Copy link
Contributor

kgodey commented Jul 28, 2021

Problem

The "Working with Columns" design spec adds a filter for duplicate values for a column, to assist users with resolving non-unique values when they want to set a unique constraint for a column.

Proposed solution

We need to add a filter for this to our Record API filters.

Additional context

@kgodey kgodey added this to the 07. Initial Data Types milestone Jul 28, 2021
@kgodey kgodey added ready Ready for implementation work: backend Related to Python, Django, and simple SQL work: database labels Jul 28, 2021
@mathemancer
Copy link
Contributor

I think that simply finding the rows that have duplicate values should probably be its own issue in the backend. It will be a bit complex (the best way I know of involves a window function at the DB level).

@kgodey
Copy link
Contributor Author

kgodey commented Jul 30, 2021

@mathemancer It seems fairly simple according to this Stack Overflow answer: https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table. Do you see any issues with the recommended approach?

@mathemancer
Copy link
Contributor

mathemancer commented Aug 2, 2021

@mathemancer It seems fairly simple according to this Stack Overflow answer: https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table. Do you see any issues with the recommended approach?

The problem with the queries from that post will be that you won't be able to SELECT all the columns (as noted in some of the comments) in the resulting table. From the PostgreSQL docs:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

In the case where we want to find entire rows that are duplicates (shouldn't happen for tables with mathesar_id), the queries suggested (or similar) work. If we want to find rows where only some subset of the columns are duplicates, this means that to produce the rows in their entirety for the user to work on (assuming that we want to show the whole rows; otherwise, how would they know which they might want to delete or modify?), we'd need to do some more querying, or do fancy things with CTEs or window functions (or joining on a sub-select perhaps).

@kgodey kgodey added needs: unblocking Blocked by other work and removed ready Ready for implementation work: database labels Aug 2, 2021
@kgodey
Copy link
Contributor Author

kgodey commented Aug 2, 2021

@mathemancer Thanks! I created a separate DB issue.

@eito-fis
Copy link
Contributor

What should the actual interface for this look like? Is setting get_duplicate=True on the record list endpoint, then returning the appropriately paginated and offset duplicates alright?

@kgodey
Copy link
Contributor Author

kgodey commented Aug 16, 2021

@eito-fis I think it should be a filter like everything else, it should just be another type of filter. I think currently those are being passed in as a filters argument, right?

@eito-fis
Copy link
Contributor

@kgodey I initially didn't think filters would work, but I think it might end up being pretty clean. Will give it a shot 👍

@eito-fis
Copy link
Contributor

@kgodey Having gotten started, I'm again unsure on how to implement this. I think the core issue is that all of the currently supported filtering parameters are things that could fit inside of a WHERE condition, and as a result all work with our conjunctions and nesting of conditionals.

To support the duplication filter in the nested structure we could:

  • Allow it to appear anywhere in the filter expression. This would raise some questions about how to define what columns to check duplicates on.
  • Making it a weird special case, where we could only specify it once at the top level of the filters and no where else.

If we don't need the nested interactions, I think a separate check_duplicate_cols parameters that takes a list of columns would be best, since avoid on special cases / extra validation.

If we did want to be able to use conjunctions, I think having check_duplicate_cols would still be useful, since we wouldn't have to deal with different definitions for columns to filter on. Alternatively, we could enforce that all duplication filters look at the same columns, or we update our filtering to be able to handle multiple sets of duplication columns. The latter seems the nicest of these options, but would also be the most complicated to implement.

Sorry - this got a bit longer than expected. I guess the core question is - do we need to be able to handle the is_duplicate condition inside conjunctions? Or can it just be a single, top level filter?

@kgodey
Copy link
Contributor Author

kgodey commented Aug 16, 2021

If it's easier to do a single top level condition (which it sounds like it is), let's do that. We can refactor if needed when we are finalizing our API v1 structure.

@eito-fis
Copy link
Contributor

@kgodey as a separate parameter, or as part of the filters?

@kgodey
Copy link
Contributor Author

kgodey commented Aug 16, 2021

If it's the same effort to implement, let's make it part of the filters.

@kgodey
Copy link
Contributor Author

kgodey commented Aug 17, 2021

Fixed by #569. @eito-fis FYI you need to have the word "Fixes" before each issue number to auto-close issues.

@kgodey kgodey closed this as completed Aug 17, 2021
@eito-fis
Copy link
Contributor

I think I had 'fixes' instead of 'Fixes' ):

@kgodey
Copy link
Contributor Author

kgodey commented Aug 17, 2021

@eito-fis You had both issues comma separated (Fixes A, B), you need to have "fixes A, fixes B".

@kgodey kgodey linked a pull request Aug 17, 2021 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: unblocking Blocked by other work work: backend Related to Python, Django, and simple SQL
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants