sql: Django introspection query performance regression (still) #100871

timgraham · 2023-04-06T20:41:12Z

Describe the problem

A regression is causing the following query (which Django's test suite runs quite often) to take 1-2 seconds (instead of milliseconds).

To Reproduce

SELECT
    c.relname,
    CASE
        WHEN c.relispartition THEN 'p'
        WHEN c.relkind IN ('m', 'v') THEN 'v'
        ELSE 't'
    END
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('f', 'm', 'p', 'r', 'v')
    AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
    AND pg_catalog.pg_table_is_visible(c.oid)

Environment:

CockroachDB version v23.1.0-alpha.8-dev-af563f889
Build Time: 2023/04/06 20:05:50

Additional context

The same issue was reported in #93955 where performance was improved a little.

Related PRs:

The full Django test suite runtime is up from ~40 minutes with 22.2.x to ~1.5 hrs (though I'm not sure it's entirely attributable to this query).

Jira issue: CRDB-26663

blathers-crl · 2023-04-06T20:41:14Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/sql-sessions (found keywords: pg_,Django)
@rafiss (commented on sql: Django introspection query performance regression #93955)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich · 2023-04-06T21:02:50Z

I think it's a GA-blocker until we triage further, cc @mgartner

rafiss · 2023-04-07T04:17:54Z

I believe we had previously narrowed this down to pg_table_is_visible being changed to be a pre-defined SQL function, but we should double check.

DrewKimball · 2023-04-12T04:24:38Z

Unless changes have been made to the descriptor cache since 22.2, this might be due to some difference between how UDFs interact with the cache and how the internal executor does it. Either way, the fix might be to improve the behavior with many uncached descriptor ids.

I loaded a database with 15000 tables similar to the example here. Running the query from this issue on that database took about 1.4s on my machine. I tried marking the builtin as volatile so it wouldn't be inlined, and that increased the time to ~3s. Running the query on 22.2.2 ranged from ~600ms to ~900ms on my machine. Fetching uncached descriptors in cachedCatalogReader.GetByIDs happens twice when the query is executed, and one of those calls takes about 1s. I tried running with this diff (which may or may not be correct):

diff --git a/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go b/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
index 0f4e0e2b0d4..ea5fbdefed0 100644
--- a/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
+++ b/pkg/sql/catalog/internal/catkv/catalog_reader_cached.go
@@ -308,6 +308,9 @@ func (c *cachedCatalogReader) GetByIDs(
                ids[i], ids[numUncached] = ids[numUncached], id
                numUncached++
        }
+       if numUncached > 1000 {
+               return c.ScanAll(ctx, txn)
+       }
        if numUncached > 0 && !(c.hasScanAll && !isDescriptorRequired) {
                uncachedIDs := ids[:numUncached]
                read, err := c.cr.GetByIDs(ctx, txn, uncachedIDs, isDescriptorRequired, expectedType)

to pessimistically scan for all IDs at once if too many were uncached. ScanAll takes ~300-400ms in this case, and the resulting overall execution time on my machine was consistently ~800ms.

rafiss · 2023-04-12T16:11:08Z

Nice findings @DrewKimball! let me check with others on the schema team about your patch.

rafiss · 2023-04-14T19:28:02Z

@DrewKimball I think we can use your patch. Do you plan to push it, or would your like our team to take over?

DrewKimball · 2023-04-17T16:01:43Z

I'm not sure I'd want to push it as-is - I didn't do any testing to figure out whether this was a good threshold to transition to ScanAll, just picked it to trigger for my 15,000 descriptor test case. Is there any intuition about whether 1000 is really a good value or not?

fqazi · 2023-04-17T18:17:43Z

@DrewKimball I think we can get away with dropping it a bit lower to about ~500 descriptors and treat those as reads that need every single descriptor. I believe this is a reasonable solution shorter term and long term we need some type of caching to fix these scenarios

Previously, the catalog descriptor would fetch descriptors via point lookups using Get when scanning large batches of descriptors. This was further extended to also look up ZoneConfigs and comments in a similar way. Recently, we we started seeing regression on the Django test suite involving the pg_catalog tables, which tend to do read large number of descriptors, likely linked to extra overhead linked to both comments and zone configs in 23.1. To address this, this patch ill now start using scans for runs of descriptor IDs for batch scans which reduces the overall cost of fetching a large number of descriptors hiding this cost. Fixes: cockroachdb#100871 Release note: None

101937: catalog: utilize scans for a large number of descriptors r=fqazi a=fqazi Previously, the catalog descriptor would fetch descriptors via point lookups using Get when scanning large batches of descriptors. This was further extended to also look up ZoneConfigs and comments in a similar way. Recently, we we started seeing regression on the Django test suite involving the pg_catalog tables, which tend to do read large number of descriptors, likely linked to extra overhead linked to both comments and zone configs in 23.1. To address this, this patch ill now start using scans for runs of descriptor IDs for batch scans which reduces the overall cost of fetching a large number of descriptors hiding this cost. Fixes: #100871 Release note: None 101943: build,release: provide way to inject build tag override, use in MAPB r=rail a=rickystewart When we build nightly builds, we build them in "release" configuration. Because we do this, the build tag reported by these binaries from `cockroach version` has been identical to that of an actual release binary, which is very confusing. Here we update the script to build these nightlies (`make-and-publish-build-artifacts.sh`) to inject an appropriate, identifiable build tag. It is probably wrong that we are building these nightlies as if they were "releases". We'll follow up with work to fix this and refine the build tags further. Closes #100532. Epic: None Release note (build change): Update reported `Build Tag` for nightly (non-release) builds 101944: sem/tree: remove TODO r=mgartner a=mgartner This commit removes a TODO that was addressed by #96045. Epic: None Release note: None 101952: ui: fix linegraph render for large clusters r=zachlite a=zachlite We were feeding `Math.min` and `Math.max` very large arrays (length equal to ~10e7). These functions take variadic arguments, and the runtime can't handle that many arguments. As a result we blow the stack, causing uncaught range errors. Instead, we'll use lodash min and max which are not variadic. Resolves #101377 Epic: None Release note: None Reproduction and Fix demonstrated on the command line: <img width="722" alt="Screenshot 2023-04-20 at 4 56 41 PM" src="https://user-images.githubusercontent.com/5423191/233486016-fd740a98-9a0d-4fef-a6d8-67e9cb4e318e.png"> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Zach Lite <[email protected]>

Previously, the catalog descriptor would fetch descriptors via point lookups using Get when scanning large batches of descriptors. This was further extended to also look up ZoneConfigs and comments in a similar way. Recently, we we started seeing regression on the Django test suite involving the pg_catalog tables, which tend to do read large number of descriptors, likely linked to extra overhead linked to both comments and zone configs in 23.1. To address this, this patch ill now start using scans for runs of descriptor IDs for batch scans which reduces the overall cost of fetching a large number of descriptors hiding this cost. Fixes: #100871 Release note: None

timgraham added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 6, 2023

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 6, 2023

blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 6, 2023

timgraham mentioned this issue Apr 6, 2023

sql: Django introspection query performance regression #93955

Closed

blathers-crl bot added the T-sql-queries SQL Queries Team label Apr 6, 2023

yuzefovich added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Apr 6, 2023

DrewKimball self-assigned this Apr 11, 2023

fqazi assigned fqazi and unassigned DrewKimball Apr 19, 2023

fqazi added the T-sql-schema-deprecated Use T-sql-foundations instead label Apr 19, 2023

fqazi mentioned this issue Apr 20, 2023

catalog: utilize scans for a large number of descriptors #101937

Merged

craig bot closed this as completed in eb8bdeb Apr 20, 2023

blathers-crl bot mentioned this issue Apr 20, 2023

release-23.1: catalog: utilize scans for a large number of descriptors #101966

Merged

blathers-crl bot mentioned this issue Apr 20, 2023

release-23.1.0: catalog: utilize scans for a large number of descriptors #101968

Merged

exalate-issue-sync bot removed T-sql-schema-deprecated Use T-sql-foundations instead T-sql-queries SQL Queries Team labels Apr 20, 2023

fqazi reopened this Apr 21, 2023

fqazi closed this as completed Apr 21, 2023

dikshant mentioned this issue Apr 28, 2023

roachtest: add roachtest/workload for testing latency of ORM introspection queries #96359

Closed

timgraham mentioned this issue May 7, 2023

75% increase in Django test suite run time between CockroachDB 22.2 and 23.1 #102851

Closed

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

timgraham mentioned this issue Aug 7, 2023

pg_table_is_visible slower in CockroachDB 23.1 #108334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Django introspection query performance regression (still) #100871

sql: Django introspection query performance regression (still) #100871

timgraham commented Apr 6, 2023 •

edited by rafiss

Loading

blathers-crl bot commented Apr 6, 2023

yuzefovich commented Apr 6, 2023

rafiss commented Apr 7, 2023

DrewKimball commented Apr 12, 2023

rafiss commented Apr 12, 2023

rafiss commented Apr 14, 2023

DrewKimball commented Apr 17, 2023

fqazi commented Apr 17, 2023

sql: Django introspection query performance regression (still) #100871

sql: Django introspection query performance regression (still) #100871

Comments

timgraham commented Apr 6, 2023 • edited by rafiss Loading

blathers-crl bot commented Apr 6, 2023

yuzefovich commented Apr 6, 2023

rafiss commented Apr 7, 2023

DrewKimball commented Apr 12, 2023

rafiss commented Apr 12, 2023

rafiss commented Apr 14, 2023

DrewKimball commented Apr 17, 2023

fqazi commented Apr 17, 2023

timgraham commented Apr 6, 2023 •

edited by rafiss

Loading