fix: table read query deduplicates identical rows #2608

gtarpenning · 2024-10-04T16:24:55Z

New table query introduced in: #2534 does not do row deduplication.

This pr

introduces a new test that is breaking in prod
adds deduplication to the table query on row number

Previously, we used a subquery for de-duplification of table rows by project_id, digest. See the old query here:

SELECT tr.digest, tr.val_dump
FROM (
    SELECT project_id, row_digest
    FROM (
        SELECT *
        FROM (
                SELECT *,
                    row_number() OVER (PARTITION BY project_id, digest) AS rn  -- <<<<<<
                FROM tables
                WHERE project_id = {project_id:String} AND digest = {digest:String}
            )
        WHERE rn = 1 -- <<<<<<
        ORDER BY project_id, digest
    )
    ARRAY JOIN row_digests AS row_digest
    WHERE digest = {digest:String}
) AS t

But now, in prod, we do not do this deduplification, instead relying on a DISTINCT clause. However, we now use row_order which does not have the same properties as row_number, and we no longer return unique digests. Here is the current query:

SELECT DISTINCT tr.digest, tr.val_dump, t.row_order
    FROM table_rows tr
    INNER JOIN (
        SELECT row_digest, row_number() OVER () AS row_order
        FROM (
            SELECT {row_digests_selection} as row_digests
            FROM tables
            WHERE project_id = {{{project_id_name}: String}}
            AND digest = {{{digest_name}: String}}
        )
        ARRAY JOIN row_digests AS row_digest
    ) AS t ON tr.digest = t.row_digest
    WHERE tr.project_id = {{{project_id_name}: String}} 
    ORDER BY row_order ASC

In this pr we make the minimally invasive change of adding LIMIT 1 to the subquery, which restricts identical rows to the first occurence

Performance:

tldr: identical performance

Query runs in prod, on a small table that requires deduplication (but is not getting it properly):

Elapsed: 0.124s. Read: 50,071 rows (81.28 MB)
Elapsed: 0.126s. Read: 50,071 rows (81.28 MB)
Elapsed: 0.110s. Read: 50,071 rows (81.28 MB)

This pr, on the same table, returning the correct result:

Elapsed: 0.099s. Read: 50,071 rows (81.28 MB)
Elapsed: 0.102s. Read: 50,071 rows (81.28 MB)
Elapsed: 0.093s. Read: 50,071 rows (81.28 MB)

Testing

Adds test when recreating an identical dataset. The act of re-creating an identical dataset inserts identical rows. This test ensures that when we read the dataset we get the original set of rows.
We want to be careful we aren't too aggressive in our deduplication, but the test test_table_query_with_duplicate_row_digests should cover our bases there.
Manual testing via SDK:

In prod:

recreating dataset
📦 Published to https://wandb.ai/griffin_wb/dedupe-prod/weave/objects/grammar-3/versions/p2BSs16ikm1jQGdSufWL2cBRXVrXUxeO5ZYHdPOzPhw
WeaveObject(ObjectRecord({'name': 'grammar-3', 'description': None, 'rows': <weave.trace.vals.WeaveTable object at 0x3229ecb20>, '_class_name': 'Dataset', '_bases': ['Object', 'BaseModel']}))
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})

In branch:

recreating dataset
📦 Published to https://wandb.ai/griffin_wb/dedupe-prod/weave/objects/grammar-3/versions/p2BSs16ikm1jQGdSufWL2cBRXVrXUxeO5ZYHdPOzPhw
WeaveObject(ObjectRecord({'name': 'grammar-3', 'description': None, 'rows': <weave.trace.vals.WeaveTable object at 0x3200ec9d0>, '_class_name': 'Dataset', '_bases': ['Object', 'BaseModel']}))
WeaveDict({'id': '0', 'sentence': 'He no likes ice cream.', 'correction': "He doesn't like ice cream."})
WeaveDict({'id': '1', 'sentence': 'She goed to the store.', 'correction': 'She went to the store.'})
WeaveDict({'id': '2', 'sentence': 'They plays video games all day.', 'correction': 'They play video games all day.'})

circle-job-mirror · 2024-10-04T16:31:21Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=795627bf407c0478f187ad092dd7d0837c708b81

tests/trace/test_table_query.py

tssweeney · 2024-10-07T16:49:08Z

weave/trace_server/table_query_builder.py

        FROM (
-            SELECT {row_digests_selection} as row_digests
+            SELECT {row_digests_selection} as row_digests, row_number() OVER (PARTITION BY project_id, digest) AS rn


Since you don't actually care about ordering, i think you can just do LIMIT 1 in the inner query in both cases and reduce the overall cost of this query

gtarpenning added 2 commits October 4, 2024 09:23

does this break

297bcf1

add assert

f27f2fa

?

0c475d2

gtarpenning commented Oct 4, 2024

View reviewed changes

tests/trace/test_table_query.py Outdated Show resolved Hide resolved

gtarpenning changed the title ~~does this break~~ test: table query sortby breaks Oct 4, 2024

gtarpenning added 7 commits October 4, 2024 09:42

b

627897d

test

3ea22da

old server code

6ee8a2a

test

4206ef0

failing

7767dbf

naming

631e6ef

do row deduplication in table query

7073806

gtarpenning requested a review from jamie-rasmussen October 4, 2024 19:06

gtarpenning marked this pull request as ready for review October 4, 2024 19:06

gtarpenning requested a review from a team as a code owner October 4, 2024 19:06

gtarpenning changed the title ~~test: table query sortby breaks~~ fix: table read query properly deduplicates identical rows Oct 4, 2024

gtarpenning changed the title ~~fix: table read query properly deduplicates identical rows~~ fix: table read query deduplicates identical rows Oct 7, 2024

tssweeney reviewed Oct 7, 2024

View reviewed changes

tests/trace/test_table_query.py Show resolved Hide resolved

tssweeney reviewed Oct 7, 2024

View reviewed changes

gtarpenning added 3 commits October 7, 2024 09:50

Merge branch 'master' into griffin/table-query-test

4c80b80

test limit 1

b34b1c2

undo-comment

9e81d5d

gtarpenning requested a review from tssweeney October 7, 2024 16:59

tssweeney approved these changes Oct 7, 2024

View reviewed changes

gtarpenning merged commit 483ae68 into master Oct 7, 2024
80 checks passed

gtarpenning deleted the griffin/table-query-test branch October 7, 2024 17:05

github-actions bot locked and limited conversation to collaborators Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: table read query deduplicates identical rows #2608

fix: table read query deduplicates identical rows #2608

gtarpenning commented Oct 4, 2024 •

edited

Loading

circle-job-mirror bot commented Oct 4, 2024 •

edited

Loading

tssweeney Oct 7, 2024

fix: table read query deduplicates identical rows #2608

fix: table read query deduplicates identical rows #2608

Conversation

gtarpenning commented Oct 4, 2024 • edited Loading

Performance:

Testing

circle-job-mirror bot commented Oct 4, 2024 • edited Loading

tssweeney Oct 7, 2024

Choose a reason for hiding this comment

gtarpenning commented Oct 4, 2024 •

edited

Loading

circle-job-mirror bot commented Oct 4, 2024 •

edited

Loading