Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: table read query deduplicates identical rows #2608

Merged
merged 13 commits into from
Oct 7, 2024

Conversation

gtarpenning
Copy link
Member

@gtarpenning gtarpenning commented Oct 4, 2024

New table query introduced in: #2534 does not do row deduplication.

This pr

Previously, we used a subquery for de-duplification of table rows by project_id, digest. See the old query here:

SELECT tr.digest, tr.val_dump
FROM (
    SELECT project_id, row_digest
    FROM (
        SELECT *
        FROM (
                SELECT *,
                    row_number() OVER (PARTITION BY project_id, digest) AS rn  -- <<<<<<
                FROM tables
                WHERE project_id = {project_id:String} AND digest = {digest:String}
            )
        WHERE rn = 1 -- <<<<<<
        ORDER BY project_id, digest
    )
    ARRAY JOIN row_digests AS row_digest
    WHERE digest = {digest:String}
) AS t

But now, in prod, we do not do this deduplification, instead relying on a DISTINCT clause. However, we now use row_order which does not have the same properties as row_number, and we no longer return unique digests. Here is the current query:

SELECT DISTINCT tr.digest, tr.val_dump, t.row_order
    FROM table_rows tr
    INNER JOIN (
        SELECT row_digest, row_number() OVER () AS row_order
        FROM (
            SELECT {row_digests_selection} as row_digests
            FROM tables
            WHERE project_id = {{{project_id_name}: String}}
            AND digest = {{{digest_name}: String}}
        )
        ARRAY JOIN row_digests AS row_digest
    ) AS t ON tr.digest = t.row_digest
    WHERE tr.project_id = {{{project_id_name}: String}} 
    ORDER BY row_order ASC

In this pr we make the minimally invasive change of adding LIMIT 1 to the subquery, which restricts identical rows to the first occurence

Performance:

tldr: identical performance

Query runs in prod, on a small table that requires deduplication (but is not getting it properly):

  1. Elapsed: 0.124s. Read: 50,071 rows (81.28 MB)
  2. Elapsed: 0.126s. Read: 50,071 rows (81.28 MB)
  3. Elapsed: 0.110s. Read: 50,071 rows (81.28 MB)

This pr, on the same table, returning the correct result:

  1. Elapsed: 0.099s. Read: 50,071 rows (81.28 MB)
  2. Elapsed: 0.102s. Read: 50,071 rows (81.28 MB)
  3. Elapsed: 0.093s. Read: 50,071 rows (81.28 MB)

Testing

  • Adds test when recreating an identical dataset. The act of re-creating an identical dataset inserts identical rows. This test ensures that when we read the dataset we get the original set of rows.
  • We want to be careful we aren't too aggressive in our deduplication, but the test test_table_query_with_duplicate_row_digests should cover our bases there.
  • Manual testing via SDK:

In prod:

recreating dataset
📦 Published to https://wandb.ai/griffin_wb/dedupe-prod/weave/objects/grammar-3/versions/p2BSs16ikm1jQGdSufWL2cBRXVrXUxeO5ZYHdPOzPhw
WeaveObject(ObjectRecord({'name': 'grammar-3', 'description': None, 'rows': <weave.trace.vals.WeaveTable object at 0x3229ecb20>, '_class_name': 'Dataset', '_bases': ['Object', 'BaseModel']}))
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})
Screenshot 2024-10-04 at 12 20 44 PM

In branch:

recreating dataset
📦 Published to https://wandb.ai/griffin_wb/dedupe-prod/weave/objects/grammar-3/versions/p2BSs16ikm1jQGdSufWL2cBRXVrXUxeO5ZYHdPOzPhw
WeaveObject(ObjectRecord({'name': 'grammar-3', 'description': None, 'rows': <weave.trace.vals.WeaveTable object at 0x3200ec9d0>, '_class_name': 'Dataset', '_bases': ['Object', 'BaseModel']}))
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})
Screenshot 2024-10-04 at 12 20 57 PM

@circle-job-mirror
Copy link

circle-job-mirror bot commented Oct 4, 2024

@gtarpenning gtarpenning changed the title does this break test: table query sortby breaks Oct 4, 2024
@gtarpenning gtarpenning marked this pull request as ready for review October 4, 2024 19:06
@gtarpenning gtarpenning requested a review from a team as a code owner October 4, 2024 19:06
@gtarpenning gtarpenning changed the title test: table query sortby breaks fix: table read query properly deduplicates identical rows Oct 4, 2024
@gtarpenning gtarpenning changed the title fix: table read query properly deduplicates identical rows fix: table read query deduplicates identical rows Oct 7, 2024
FROM (
SELECT {row_digests_selection} as row_digests
SELECT {row_digests_selection} as row_digests, row_number() OVER (PARTITION BY project_id, digest) AS rn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you don't actually care about ordering, i think you can just do LIMIT 1 in the inner query in both cases and reduce the overall cost of this query

@gtarpenning gtarpenning requested a review from tssweeney October 7, 2024 16:59
@gtarpenning gtarpenning merged commit 483ae68 into master Oct 7, 2024
80 checks passed
@gtarpenning gtarpenning deleted the griffin/table-query-test branch October 7, 2024 17:05
@github-actions github-actions bot locked and limited conversation to collaborators Oct 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants