Add filter to show records with duplicate values in a set of columns #501

kgodey · 2021-07-28T11:21:26Z

Problem

The "Working with Columns" design spec adds a filter for duplicate values for a column, to assist users with resolving non-unique values when they want to set a unique constraint for a column.

Proposed solution

We need to add a filter for this to our Record API filters.

Additional context

Blocked by database issue: DB function to get rows with duplicate values #515
Frontend issue: Add UI to filter on the presence of duplicate values in one column #770
Design issue: Design for editing column properties and table constraints #324
Design discussion: Working with Columns #436
Mathesar's API Standards
We're using our fork of sqlalchemy-filters to provide filtering. See: https://github.com/centerofci/sqlalchemy-filters.

The text was updated successfully, but these errors were encountered:

mathemancer · 2021-07-29T15:47:04Z

I think that simply finding the rows that have duplicate values should probably be its own issue in the backend. It will be a bit complex (the best way I know of involves a window function at the DB level).

kgodey · 2021-07-30T14:53:39Z

@mathemancer It seems fairly simple according to this Stack Overflow answer: https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table. Do you see any issues with the recommended approach?

mathemancer · 2021-08-02T09:39:23Z

@mathemancer It seems fairly simple according to this Stack Overflow answer: https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table. Do you see any issues with the recommended approach?

The problem with the queries from that post will be that you won't be able to SELECT all the columns (as noted in some of the comments) in the resulting table. From the PostgreSQL docs:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

In the case where we want to find entire rows that are duplicates (shouldn't happen for tables with mathesar_id), the queries suggested (or similar) work. If we want to find rows where only some subset of the columns are duplicates, this means that to produce the rows in their entirety for the user to work on (assuming that we want to show the whole rows; otherwise, how would they know which they might want to delete or modify?), we'd need to do some more querying, or do fancy things with CTEs or window functions (or joining on a sub-select perhaps).

kgodey · 2021-08-02T12:25:09Z

@mathemancer Thanks! I created a separate DB issue.

eito-fis · 2021-08-16T18:31:11Z

What should the actual interface for this look like? Is setting get_duplicate=True on the record list endpoint, then returning the appropriately paginated and offset duplicates alright?

kgodey · 2021-08-16T18:40:42Z

@eito-fis I think it should be a filter like everything else, it should just be another type of filter. I think currently those are being passed in as a filters argument, right?

eito-fis · 2021-08-16T19:02:07Z

@kgodey I initially didn't think filters would work, but I think it might end up being pretty clean. Will give it a shot 👍

eito-fis · 2021-08-16T20:25:01Z

@kgodey Having gotten started, I'm again unsure on how to implement this. I think the core issue is that all of the currently supported filtering parameters are things that could fit inside of a WHERE condition, and as a result all work with our conjunctions and nesting of conditionals.

To support the duplication filter in the nested structure we could:

Allow it to appear anywhere in the filter expression. This would raise some questions about how to define what columns to check duplicates on.
Making it a weird special case, where we could only specify it once at the top level of the filters and no where else.

If we don't need the nested interactions, I think a separate check_duplicate_cols parameters that takes a list of columns would be best, since avoid on special cases / extra validation.

If we did want to be able to use conjunctions, I think having check_duplicate_cols would still be useful, since we wouldn't have to deal with different definitions for columns to filter on. Alternatively, we could enforce that all duplication filters look at the same columns, or we update our filtering to be able to handle multiple sets of duplication columns. The latter seems the nicest of these options, but would also be the most complicated to implement.

Sorry - this got a bit longer than expected. I guess the core question is - do we need to be able to handle the is_duplicate condition inside conjunctions? Or can it just be a single, top level filter?

kgodey · 2021-08-16T20:46:38Z

If it's easier to do a single top level condition (which it sounds like it is), let's do that. We can refactor if needed when we are finalizing our API v1 structure.

eito-fis · 2021-08-16T21:28:11Z

@kgodey as a separate parameter, or as part of the filters?

kgodey · 2021-08-16T22:34:36Z

If it's the same effort to implement, let's make it part of the filters.

kgodey · 2021-08-17T21:10:36Z

Fixed by #569. @eito-fis FYI you need to have the word "Fixes" before each issue number to auto-close issues.

eito-fis · 2021-08-17T21:11:40Z

I think I had 'fixes' instead of 'Fixes' ):

kgodey · 2021-08-17T21:13:05Z

@eito-fis You had both issues comma separated (Fixes A, B), you need to have "fixes A, fixes B".

kgodey added the type: enhancement label Jul 28, 2021

kgodey added this to the 07. Initial Data Types milestone Jul 28, 2021

kgodey added ready Ready for implementation work: backend Related to Python, Django, and simple SQL work: database labels Jul 28, 2021

kgodey modified the milestones: 07. Initial Data Types, 06. Working with Tables Jul 28, 2021

kgodey mentioned this issue Jul 28, 2021

Implement setting a column as non-nullable or unique #499

Closed

3 tasks

kgodey mentioned this issue Aug 2, 2021

DB function to get rows with duplicate values #515

Closed

kgodey added needs: unblocking Blocked by other work and removed ready Ready for implementation work: database labels Aug 2, 2021

kgodey assigned eito-fis Aug 3, 2021

eito-fis mentioned this issue Aug 16, 2021

DB function to get rows with duplicate values #569

Merged

7 tasks

kgodey closed this as completed Aug 17, 2021

kgodey linked a pull request Aug 17, 2021 that will close this issue

DB function to get rows with duplicate values #569

Merged

7 tasks

kgodey added this to [NO LONGER USED] Mathesar work tracker Sep 28, 2021

seancolsen mentioned this issue Oct 28, 2021

Add UI to filter on the presence of duplicate values in one column #770

Closed

kgodey mentioned this issue Nov 26, 2021

Duplicates filter should work with other filters #846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filter to show records with duplicate values in a set of columns #501

Add filter to show records with duplicate values in a set of columns #501

kgodey commented Jul 28, 2021 •

edited by seancolsen

Loading

mathemancer commented Jul 29, 2021

kgodey commented Jul 30, 2021

mathemancer commented Aug 2, 2021 •

edited

Loading

kgodey commented Aug 2, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

eito-fis commented Aug 16, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

kgodey commented Aug 17, 2021

eito-fis commented Aug 17, 2021

kgodey commented Aug 17, 2021

Add filter to show records with duplicate values in a set of columns #501

Add filter to show records with duplicate values in a set of columns #501

Comments

kgodey commented Jul 28, 2021 • edited by seancolsen Loading

Problem

Proposed solution

Additional context

mathemancer commented Jul 29, 2021

kgodey commented Jul 30, 2021

mathemancer commented Aug 2, 2021 • edited Loading

kgodey commented Aug 2, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

eito-fis commented Aug 16, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

eito-fis commented Aug 16, 2021

kgodey commented Aug 16, 2021

kgodey commented Aug 17, 2021

eito-fis commented Aug 17, 2021

kgodey commented Aug 17, 2021

kgodey commented Jul 28, 2021 •

edited by seancolsen

Loading

mathemancer commented Aug 2, 2021 •

edited

Loading