Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Django introspection query performance regression (still) #100871

Closed
timgraham opened this issue Apr 6, 2023 · 8 comments · Fixed by #101937
Closed

sql: Django introspection query performance regression (still) #100871

timgraham opened this issue Apr 6, 2023 · 8 comments · Fixed by #101937
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner

Comments

@timgraham
Copy link
Contributor

timgraham commented Apr 6, 2023

Describe the problem

A regression is causing the following query (which Django's test suite runs quite often) to take 1-2 seconds (instead of milliseconds).

To Reproduce

SELECT
    c.relname,
    CASE
        WHEN c.relispartition THEN 'p'
        WHEN c.relkind IN ('m', 'v') THEN 'v'
        ELSE 't'
    END
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('f', 'm', 'p', 'r', 'v')
    AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
    AND pg_catalog.pg_table_is_visible(c.oid)

Environment:

  • CockroachDB version v23.1.0-alpha.8-dev-af563f889
    Build Time: 2023/04/06 20:05:50

Additional context

The same issue was reported in #93955 where performance was improved a little.

Related PRs:

The full Django test suite runtime is up from ~40 minutes with 22.2.x to ~1.5 hrs (though I'm not sure it's entirely attributable to this query).

Jira issue: CRDB-26663

@timgraham timgraham added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 6, 2023

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 6, 2023
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 6, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Apr 6, 2023
@yuzefovich yuzefovich added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Apr 6, 2023
@yuzefovich
Copy link
Member

I think it's a GA-blocker until we triage further, cc @mgartner

@rafiss
Copy link
Collaborator

rafiss commented Apr 7, 2023

I believe we had previously narrowed this down to pg_table_is_visible being changed to be a pre-defined SQL function, but we should double check.

@DrewKimball DrewKimball self-assigned this Apr 11, 2023
@DrewKimball
Copy link
Collaborator

Unless changes have been made to the descriptor cache since 22.2, this might be due to some difference between how UDFs interact with the cache and how the internal executor does it. Either way, the fix might be to improve the behavior with many uncached descriptor ids.

I loaded a database with 15000 tables similar to the example here. Running the query from this issue on that database took about 1.4s on my machine. I tried marking the builtin as volatile so it wouldn't be inlined, and that increased the time to ~3s. Running the query on 22.2.2 ranged from ~600ms to ~900ms on my machine. Fetching uncached descriptors in cachedCatalogReader.GetByIDs happens twice when the query is executed, and one of those calls takes about 1s. I tried running with this diff (which may or may not be correct):

diff --git a/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go b/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
index 0f4e0e2b0d4..ea5fbdefed0 100644
--- a/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
+++ b/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
@@ -308,6 +308,9 @@ func (c *cachedCatalogReader) GetByIDs(
                ids[i], ids[numUncached] = ids[numUncached], id
                numUncached++
        }
+       if numUncached > 1000 {
+               return c.ScanAll(ctx, txn)
+       }
        if numUncached > 0 && !(c.hasScanAll && !isDescriptorRequired) {
                uncachedIDs := ids[:numUncached]
                read, err := c.cr.GetByIDs(ctx, txn, uncachedIDs, isDescriptorRequired, expectedType)

to pessimistically scan for all IDs at once if too many were uncached. ScanAll takes ~300-400ms in this case, and the resulting overall execution time on my machine was consistently ~800ms.

@rafiss
Copy link
Collaborator

rafiss commented Apr 12, 2023

Nice findings @DrewKimball! let me check with others on the schema team about your patch.

@rafiss
Copy link
Collaborator

rafiss commented Apr 14, 2023

@DrewKimball I think we can use your patch. Do you plan to push it, or would your like our team to take over?

@DrewKimball
Copy link
Collaborator

I'm not sure I'd want to push it as-is - I didn't do any testing to figure out whether this was a good threshold to transition to ScanAll, just picked it to trigger for my 15,000 descriptor test case. Is there any intuition about whether 1000 is really a good value or not?

@fqazi
Copy link
Collaborator

fqazi commented Apr 17, 2023

@DrewKimball I think we can get away with dropping it a bit lower to about ~500 descriptors and treat those as reads that need every single descriptor. I believe this is a reasonable solution shorter term and long term we need some type of caching to fix these scenarios

@fqazi fqazi assigned fqazi and unassigned DrewKimball Apr 19, 2023
@fqazi fqazi added the T-sql-schema-deprecated Use T-sql-foundations instead label Apr 19, 2023
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 20, 2023
Previously, the catalog descriptor would fetch descriptors
via point lookups using Get when scanning large batches
of descriptors. This was further extended to also look up
ZoneConfigs and comments in a similar way. Recently, we
we started seeing regression on the Django test suite involving the
pg_catalog tables, which tend to do read large number of
descriptors, likely linked to extra overhead linked
to both comments and zone configs in 23.1. To address this,
this patch ill now start using scans for runs of descriptor IDs for batch
scans which reduces the overall cost of fetching a large number
of descriptors hiding this cost.

Fixes: cockroachdb#100871

Release note: None
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 20, 2023
Previously, the catalog descriptor would fetch descriptors
via point lookups using Get when scanning large batches
of descriptors. This was further extended to also look up
ZoneConfigs and comments in a similar way. Recently, we
we started seeing regression on the Django test suite involving the
pg_catalog tables, which tend to do read large number of
descriptors, likely linked to extra overhead linked
to both comments and zone configs in 23.1. To address this,
this patch ill now start using scans for runs of descriptor IDs for batch
scans which reduces the overall cost of fetching a large number
of descriptors hiding this cost.

Fixes: cockroachdb#100871

Release note: None
craig bot pushed a commit that referenced this issue Apr 20, 2023
101937: catalog: utilize scans for a large number of descriptors r=fqazi a=fqazi

Previously, the catalog descriptor would fetch descriptors via point lookups using Get when scanning large batches of descriptors. This was further extended to also look up ZoneConfigs and comments in a similar way. Recently, we we started seeing regression on the Django test suite involving the pg_catalog tables, which tend to do read large number of descriptors, likely linked to extra overhead linked to both comments and zone configs in 23.1. To address this, this patch ill now start using scans for runs of descriptor IDs for batch scans which reduces the overall cost of fetching a large number of descriptors hiding this cost.

Fixes: #100871

Release note: None

101943: build,release: provide way to inject build tag override, use in MAPB r=rail a=rickystewart

When we build nightly builds, we build them in "release" configuration. Because we do this, the build tag reported by these binaries from `cockroach version` has been identical to that of an actual release binary, which is very confusing. Here we update the script to build these nightlies (`make-and-publish-build-artifacts.sh`) to inject an appropriate, identifiable build tag.

It is probably wrong that we are building these nightlies as if they were "releases". We'll follow up with work to fix this and refine the build tags further.

Closes #100532.
Epic: None
Release note (build change): Update reported `Build Tag` for nightly (non-release) builds

101944: sem/tree: remove TODO r=mgartner a=mgartner

This commit removes a TODO that was addressed by #96045.

Epic: None

Release note: None

101952: ui: fix linegraph render for large clusters r=zachlite a=zachlite

We were feeding `Math.min` and `Math.max` very large arrays (length equal to ~10e7). These functions take variadic arguments, and the runtime can't handle that many arguments. As a result we blow the stack, causing uncaught range errors.

Instead, we'll use lodash min and max which are not variadic. Resolves #101377 Epic: None
Release note: None


Reproduction and Fix demonstrated on the command line:

<img width="722" alt="Screenshot 2023-04-20 at 4 56 41 PM" src="https://user-images.githubusercontent.com/5423191/233486016-fd740a98-9a0d-4fef-a6d8-67e9cb4e318e.png">


Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Zach Lite <[email protected]>
@craig craig bot closed this as completed in eb8bdeb Apr 20, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 20, 2023
Previously, the catalog descriptor would fetch descriptors
via point lookups using Get when scanning large batches
of descriptors. This was further extended to also look up
ZoneConfigs and comments in a similar way. Recently, we
we started seeing regression on the Django test suite involving the
pg_catalog tables, which tend to do read large number of
descriptors, likely linked to extra overhead linked
to both comments and zone configs in 23.1. To address this,
this patch ill now start using scans for runs of descriptor IDs for batch
scans which reduces the overall cost of fetching a large number
of descriptors hiding this cost.

Fixes: #100871

Release note: None
@exalate-issue-sync exalate-issue-sync bot removed T-sql-schema-deprecated Use T-sql-foundations instead T-sql-queries SQL Queries Team labels Apr 20, 2023
@fqazi fqazi reopened this Apr 21, 2023
fqazi added a commit that referenced this issue Apr 21, 2023
Previously, the catalog descriptor would fetch descriptors
via point lookups using Get when scanning large batches
of descriptors. This was further extended to also look up
ZoneConfigs and comments in a similar way. Recently, we
we started seeing regression on the Django test suite involving the
pg_catalog tables, which tend to do read large number of
descriptors, likely linked to extra overhead linked
to both comments and zone configs in 23.1. To address this,
this patch ill now start using scans for runs of descriptor IDs for batch
scans which reduces the overall cost of fetching a large number
of descriptors hiding this cost.

Fixes: #100871

Release note: None
@fqazi fqazi closed this as completed Apr 21, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants