-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: table read query deduplicates identical rows #2608
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=795627bf407c0478f187ad092dd7d0837c708b81 |
gtarpenning
commented
Oct 4, 2024
gtarpenning
changed the title
test: table query sortby breaks
fix: table read query properly deduplicates identical rows
Oct 4, 2024
gtarpenning
changed the title
fix: table read query properly deduplicates identical rows
fix: table read query deduplicates identical rows
Oct 7, 2024
tssweeney
reviewed
Oct 7, 2024
tssweeney
reviewed
Oct 7, 2024
FROM ( | ||
SELECT {row_digests_selection} as row_digests | ||
SELECT {row_digests_selection} as row_digests, row_number() OVER (PARTITION BY project_id, digest) AS rn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you don't actually care about ordering, i think you can just do LIMIT 1
in the inner query in both cases and reduce the overall cost of this query
tssweeney
approved these changes
Oct 7, 2024
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New table query introduced in: #2534 does not do row deduplication.
This pr
Previously, we used a subquery for de-duplification of table rows by
project_id, digest
. See the old query here:But now, in prod, we do not do this deduplification, instead relying on a
DISTINCT
clause. However, we now userow_order
which does not have the same properties asrow_number
, and we no longer return unique digests. Here is the current query:In this pr we make the minimally invasive change of adding
LIMIT 1
to the subquery, which restricts identical rows to the first occurencePerformance:
tldr: identical performance
Query runs in prod, on a small table that requires deduplication (but is not getting it properly):
This pr, on the same table, returning the correct result:
Testing
test_table_query_with_duplicate_row_digests
should cover our bases there.In prod:
In branch: