Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Crash in YbPgInheritsCacheDelete: Assert(entry->refcount == 1); #19807

Closed
1 task done
jasonyb opened this issue Nov 2, 2023 · 0 comments
Closed
1 task done
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@jasonyb
Copy link
Contributor

jasonyb commented Nov 2, 2023

Jira Link: DB-8741

Description

./yb_build.sh fastdebug --gcc11 --java-test TestPgRegressIndex crashes on yb_pg_indexing test. Analyzing the coredump,

(gdb) bt
#0  0x00007f80a9770acf in raise () from /lib64/libc.so.6
#1  0x00007f80a9743ea5 in abort () from /lib64/libc.so.6
#2  0x0000000000a4344f in ExceptionalCondition (conditionName=conditionName@entry=0xd197fc "!(entry->refcount == 1)", errorType=errorType@entry=0xb48880 "FailedAssertion",
    fileName=fileName@entry=0xd193f8 "../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c", lineNumber=lineNumber@entry=317)
    at ../../../../../../../src/postgres/src/backend/utils/error/assert.c:54
#3  0x0000000000a4311a in YbPgInheritsCacheDelete (entry=entry@entry=0x678ff8f1728) at ../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c:317
#4  0x0000000000a4330a in YbPgInheritsCacheInvalidate (relid=relid@entry=16456) at ../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c:364
#5  0x0000000000a433ab in YbPgInheritsCacheRelCallback (arg=<optimized out>, relid=16456) at ../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c:80
#6  0x0000000000a22e5d in LocalExecuteInvalidationMessage (msg=0x678ff342880) at ../../../../../../../src/postgres/src/backend/utils/cache/inval.c:603
#7  0x0000000000a21c59 in ProcessInvalidationMessages (hdr=0x678ff4ea200, func=func@entry=0xa22d6d <LocalExecuteInvalidationMessage>)
    at ../../../../../../../src/postgres/src/backend/utils/cache/inval.c:464
#8  0x0000000000a22537 in CommandEndInvalidationMessages () at ../../../../../../../src/postgres/src/backend/utils/cache/inval.c:1136
#9  0x000000000055c05e in AtCCI_LocalCache () at ../../../../../../../src/postgres/src/backend/access/transam/xact.c:1470
#10 CommandCounterIncrement () at ../../../../../../../src/postgres/src/backend/access/transam/xact.c:1002
#11 0x00000000005a0024 in deleteOneObject (object=<optimized out>, depRel=depRel@entry=0x7fff79653338, flags=flags@entry=0)
    at ../../../../../../src/postgres/src/backend/catalog/dependency.c:1127
#12 0x00000000005a00dc in deleteObjectsInList (targetObjects=targetObjects@entry=0x678ff5ca4e0, depRel=depRel@entry=0x7fff79653338, flags=flags@entry=0)
    at ../../../../../../src/postgres/src/backend/catalog/dependency.c:273
#13 0x00000000005a02a4 in performMultipleDeletions (objects=objects@entry=0x678ff5ca238, behavior=DROP_RESTRICT, flags=flags@entry=0)
    at ../../../../../../src/postgres/src/backend/catalog/dependency.c:430
#14 0x00000000006c5604 in RemoveRelations (drop=drop@entry=0x678ffdd3b90) at ../../../../../../src/postgres/src/backend/commands/tablecmds.c:1430
#15 0x00000000008f95ef in ExecDropStmt (stmt=stmt@entry=0x678ffdd3b90, isTopLevel=isTopLevel@entry=true) at ../../../../../../src/postgres/src/backend/tcop/utility.c:1835
#16 0x00000000008fcb98 in ProcessUtilitySlow (pstate=pstate@entry=0x678ff5ca120, pstmt=pstmt@entry=0x678ffdd3c60, queryString=queryString@entry=0x678ffdd2120 "drop table idxpart;",
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=0x678ffdd3f50, completionTag=0x7fff79654030 "")
    at ../../../../../../src/postgres/src/backend/tcop/utility.c:1680
#17 0x00000000008fb41d in standard_ProcessUtility (pstmt=pstmt@entry=0x678ffdd3c60, queryString=queryString@entry=0x678ffdd2120 "drop table idxpart;",
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x678ffdd3f50, completionTag=0x7fff79654030 "")
    at ../../../../../../src/postgres/src/backend/tcop/utility.c:878
#18 0x00000000008fb6db in YBProcessUtilityDefaultHook (pstmt=0x678ffdd3c60, queryString=0x678ffdd2120 "drop table idxpart;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../../src/postgres/src/backend/tcop/utility.c:3612
#19 0x00007f80981fb681 in pgss_ProcessUtility (pstmt=0x678ffdd3c60, queryString=0x678ffdd2120 "drop table idxpart;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:1405
#20 0x00007f80ab900540 in ybpgm_ProcessUtility (pstmt=0x678ffdd3c60, queryString=0x678ffdd2120 "drop table idxpart;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:694
#21 0x00007f80981d4755 in pgaudit_NextProcessUtility_hook (pstmt=pstmt@entry=0x678ffdd3c60, queryString=queryString@entry=0x678ffdd2120 "drop table idxpart;",
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x678ffdd3f50, completionTag=0x7fff79654030 "")
    at ../../../../../src/postgres/contrib/pgaudit/pgaudit.c:1192
#22 0x00007f80981d5e7a in pgaudit_ProcessUtility_hook (pstmt=0x678ffdd3c60, queryString=0x678ffdd2120 "drop table idxpart;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../src/postgres/contrib/pgaudit/pgaudit.c:1220
#23 0x00007f80981c0b07 in pg_hint_plan_ProcessUtility (pstmt=<optimized out>, queryString=<optimized out>, context=<optimized out>, params=<optimized out>,
    queryEnv=<optimized out>, dest=<optimized out>, completionTag=0x7fff79654030 "") at ../../../../../src/postgres/third-party-extensions/pg_hint_plan/pg_hint_plan.c:3056
#24 0x0000000000a7e7a6 in YBTxnDdlProcessUtility (pstmt=0x678ffdd3c60, queryString=0x678ffdd2120 "drop table idxpart;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1996
#25 0x00000000008fb71a in ProcessUtility (pstmt=pstmt@entry=0x678ffdd3c60, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=dest@entry=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../../src/postgres/src/backend/tcop/utility.c:377
#26 0x00000000008f72ec in PortalRunUtility (portal=portal@entry=0x678ff8f2120, pstmt=pstmt@entry=0x678ffdd3c60, isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x678ffdd3f50, completionTag=completionTag@entry=0x7fff79654030 "")
    at ../../../../../../src/postgres/src/backend/tcop/pquery.c:1206
#27 0x00000000008f7d6e in PortalRunMulti (portal=portal@entry=0x678ff8f2120, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x678ffdd3f50, altdest=altdest@entry=0x678ffdd3f50, completionTag=completionTag@entry=0x7fff79654030 "")
    at ../../../../../../src/postgres/src/backend/tcop/pquery.c:1366
#28 0x00000000008f8bde in PortalRun (portal=portal@entry=0x678ff8f2120, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
    dest=dest@entry=0x678ffdd3f50, altdest=altdest@entry=0x678ffdd3f50, completionTag=0x7fff79654030 "") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:815
#29 0x00000000008f34eb in exec_simple_query (query_string=0x678ffdd2120 "drop table idxpart;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1220
#30 yb_exec_simple_query_impl (query_string=query_string@entry=0x678ffdd2120) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4767
#31 0x00000000008f075a in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x678ffdd2000, restart_data=restart_data@entry=0x7fff796542d0,
    functor=functor@entry=0x8f2f04 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x678ffdd2120, attempt=attempt@entry=0, retry=retry@entry=0x7fff7965428f)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4735
#32 0x00000000008f1f74 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x678ffdd2000, restart_data=restart_data@entry=0x7fff796542d0,
    functor=functor@entry=0x8f2f04 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x678ffdd2120)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4759
#33 0x00000000008f1fc8 in yb_exec_simple_query (query_string=query_string@entry=0x678ffdd2120 "drop table idxpart;", exec_context=exec_context@entry=0x678ffdd2000)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4782
#34 0x00000000008f49a7 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x678ffdc96d0, dbname=<optimized out>, username=<optimized out>)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5407
#35 0x0000000000854a2b in BackendRun (port=0x678ff8643c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4674
#36 BackendStartup (port=0x678ff8643c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4312
#37 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775
#38 0x0000000000856eb3 in PostmasterMain (argc=argc@entry=21, argv=argv@entry=0x678ffc1c210) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431
#39 0x000000000079fb34 in PostgresServerProcessMain (argc=21, argv=0x678ffc1c210) at ../../../../../../src/postgres/src/backend/main/main.c:234
#40 0x000000000079fb54 in main ()
(gdb) frame 3
#3  0x0000000000a4311a in YbPgInheritsCacheDelete (entry=entry@entry=0x678ff8f1728) at ../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c:317
317             Assert(entry->refcount == 1);
(gdb) p entry
$1 = (YbPgInheritsCacheEntry) 0x678ff8f1728
(gdb) p *entry
$2 = {parentOid = 16455, childTuples = 0x678ff2d9220, refcount = 2}

See #16408 (comment).

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@jasonyb jasonyb added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Nov 2, 2023
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Nov 2, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 14, 2023
jasonyb pushed a commit that referenced this issue Dec 7, 2023
Summary:
Test TestPgRegressIndex has been often failing since commit
465ee2c, titled

    [#18082] YSQL: Stop using ForeignScan for YB relations

That commit changes user-initiated system table requests such as

    SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version;

to use YbSeqScan rather than ForeignScan.  A difference between the two
is that YbSeqScan does not set catalog_version in the request to DocDB.
The function ybcBeginScan, which is shared by both YbSeqScan and
internal system table scans, has logic to avoid setting catalog_version
for system tables.  This logic has been in place so that the internal
system table scans don't hit catalog version mismatch error (which is
actually less correct, but that is a discussion for another day).

Therefore, on server-side (in this case, master), the catalog version
check does not happen for this query, so catalog version mismatch is not
detected.  Then, the subsequent queries almost always run off an
outdated catalog because the true catalog version doesn't propagate fast
enough. The test relies on these queries to execute with an up-to-date
catalog, so this results in failure.

A simple fix for the test is to add sleeps so that the catalog version
can propagate.  This strategy is already being used by other tests, and
this situation could be treated as not much different.

Instead, bring back the old behavior of sending ysql_catalog_version in
user-initiated system table requests.  Do so by further requiring the
scans to be internally-generated in order to skip setting
catalog_version.  The IsSystemRelation condition is still needed because
internally-generated scans can scan user tables, such as in index build.

Add the assumptions about queries causing catalog version mismatch and
cache refresh in the test.

Leave issue #16408 open since the overall java test is still failing due
to issue #19807 among other reasons.

Jira: DB-8984

Test Plan:
TestPgRegressIndex is flaky for multiple reasons, such as

- issue #19807
- DropTable RPC timed out
- expired or aborted by a conflict: 40001
- Restart read required at
- could not serialize access due to concurrent update

Manually check the following:

    ./yb_build.sh fastdebug --gcc11 \
      --java-test TestPgRegressIndex -n 10

Backport-through: 2.20
Close: #20017

Reviewers: myang

Reviewed By: myang

Subscribers: kfranz, amartsinchyk, yql

Differential Revision: https://phorge.dev.yugabyte.com/D30412
jasonyb pushed a commit that referenced this issue Dec 8, 2023
Summary:
Test TestPgRegressIndex has been often failing since commit
465ee2c, titled

    [#18082] YSQL: Stop using ForeignScan for YB relations

That commit changes user-initiated system table requests such as

    SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version;

to use YbSeqScan rather than ForeignScan.  A difference between the two
is that YbSeqScan does not set catalog_version in the request to DocDB.
The function ybcBeginScan, which is shared by both YbSeqScan and
internal system table scans, has logic to avoid setting catalog_version
for system tables.  This logic has been in place so that the internal
system table scans don't hit catalog version mismatch error (which is
actually less correct, but that is a discussion for another day).

Therefore, on server-side (in this case, master), the catalog version
check does not happen for this query, so catalog version mismatch is not
detected.  Then, the subsequent queries almost always run off an
outdated catalog because the true catalog version doesn't propagate fast
enough. The test relies on these queries to execute with an up-to-date
catalog, so this results in failure.

A simple fix for the test is to add sleeps so that the catalog version
can propagate.  This strategy is already being used by other tests, and
this situation could be treated as not much different.

Instead, bring back the old behavior of sending ysql_catalog_version in
user-initiated system table requests.  Do so by further requiring the
scans to be internally-generated in order to skip setting
catalog_version.  The IsSystemRelation condition is still needed because
internally-generated scans can scan user tables, such as in index build.

Add the assumptions about queries causing catalog version mismatch and
cache refresh in the test.

Leave issue #16408 open since the overall java test is still failing due
to issue #19807 among other reasons.

Jira: DB-8984

Test Plan:
TestPgRegressIndex is flaky for multiple reasons, such as

- issue #19807
- DropTable RPC timed out
- expired or aborted by a conflict: 40001
- Restart read required at
- could not serialize access due to concurrent update

Manually check the following:

    ./yb_build.sh fastdebug --gcc11 \
      --java-test TestPgRegressIndex -n 10

Backport-through: 2.20
Original commit: 1350cef / D30412

Reviewers: myang

Reviewed By: myang

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30851
deeps1991 pushed a commit that referenced this issue Dec 18, 2023
…nt == 1)

Summary:
Java test TestPgRegressIndex was failing because the assert mentioned
in the title was firing. This assert ensures that, when a cached entry
in the YbPgInheritsCache is being removed, there are no references
pointing to it at this point.

In normal scenarios, the above assumption is true, because every
code path accessing the cache carefully releases any references to
this cache at the end of its use.

However, in some cases, it is possible for a query to lookup the cache
and fail with error before it could release its reference to the cache.
In this case, the assertion fires and causes PG backend to crash.

To fix this issue, this cache should be cleaned up the same way the
relcache is cleaned. Relcache is associated with a ResourceOwner.
When a transaction or subtransaction ends, the resource owners
perform cleanup. At this point, if the resource owner sees objects with
unreleased references, they are released silently if the transaction was
aborted. However, a WARNING is printed if the transaction was committed
as we expect all resources to be cleaned up neatly in case of committed
transaction/subtransactions.

The same reasoning is applied for the YbPgInheritsCache, and is also
associated with a ResourceOwner ensuring that references here are
cleaned up in the same way

Since yb_pg_indexing test is used to validate the above change, added a
minor unrelated fix to the yb_pg_indexing.out file. The test was failing because
the out file did not contain the NOTICE that is printed when table rewrite
operations are carried out.
Jira: DB-8741

Test Plan:
./yb_build.sh --java-test TestPgRegressIndex : specifically yb_pg_indexing file
./yb_build.sh --java-test org.yb.pgsql.TestPgCacheConsistency#testPgInheritsCacheConsistency
./yb_build.sh --java-test org.yb.pgsql.TestPgRegressPartitions

Minimal test to reproduce this issue (fetched from yb_pg_indexing)
```
create table idxpart (a int, b int, PRIMARY KEY(a,b)) partition by range (a, b);
create table idxpart1 partition of idxpart for values from (0, 0) to (10, 10);
create index idxpart_a_b_idx on only idxpart (a, b);
create index nonconcurrently idxpart1_a_b_idx on idxpart1 (a, b);
alter index idxpart_a_b_idx attach partition idxpart1_a_b_idx;
create index nonconcurrently idxpart1_2_a_b on idxpart1 (a, b);

-- This statement fails due to duplicate index already attached to the
-- partitioned index. The failure occurs when a reference to the parent
-- index cache entry is taken but not yet released.
alter index idxpart_a_b_idx attach partition idxpart1_2_a_b;

-- Without this fix, this statement was failing as we try to invalidate
-- the cache entry and notice that the refcount != 1. With the fix, this
-- statement completes successfully
drop table idxpart;
```

Reviewers: dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D30004
@github-project-automation github-project-automation bot moved this to Done in YSQL Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
Status: Done
Development

No branches or pull requests

3 participants