[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

amitanandaiyer · 2023-05-15T01:47:49Z

Jira Link: DB-6543

Description

Master stale reads could also happen after fixing the RequestScope in issue #14926

ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498386 1860923392 transaction.cc:1148] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Commit, seal_only: 0, tablets: [{00000000000000000000000000000000, {          num_batches: 2 num_completed_batches: 6 has_metadata: 1 }}], status: OK
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498435 1860923392 transaction.cc:1248] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Transaction wasn't promoted, not sending abort to old status tablet
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498775 1859776512 transaction.cc:1343] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Committed: OK
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498783 1859776512 transaction.cc:1362] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Commit done: OK
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499019 1843458048 ysql_backends_manager.cc:107] WaitForYsqlBackendsCatalogVersion: database_oid: 13297 catalog_version: 323
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499033 1843458048 sys_catalog.cc:907] Start ReadYsqlDBCatalogVersionImpl
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499042 1843458048 tablet.cc:1230] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Created new Iterator reading at { read: { physical:             1684114861498432 } local_limit: { physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499100 1843458048 running_transaction.cc:134] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7: Existing    status knowledge (CREATED, <min>) does not satisfy requested: { id: bc216574-24d0-4e03-9645-f9762465dea7 read_ht: { physical: 1684114861498432 } global_limit_ht: { physical: 1684114861498432 } serial_no: 0 reason: get commit    time flags: [kCleanup]}, sending: 10955
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499215 1827975168 running_transaction.cc:302] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7:             DoStatusReceived(OK, status: COMMITTED status_hybrid_time: 6898134472697745408 propagated_hybrid_time: 6898134472700612608 aborted_subtxn_set { }, 10955)
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499238 1843458048 transaction_status_cache.cc:163] Transaction_id bc216574-24d0-4e03-9645-f9762465dea7 at { read: { physical: 1684114861498432 } local_limit: {          physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }: status: PENDING, status_time: { physical: 1684114861498473 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499258 1843458048 sys_catalog.cc:1000] done ReadYsqlDBCatalogVersionImpl returning 322
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499265 1843458048 transaction_participant.cc:464] Tablet peer checkpoint is expired with the current time: 254465 expiration time: 268838 checkpoint op_id: -1.-1
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499269 1843458048 transaction_participant.cc:464] Tablet peer checkpoint is expired with the current time: 254465 expiration time: 268838 checkpoint op_id: -1.-1
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499271 1843458048 catalog_manager.cc:9914] GetYsqlDBCatalogVersion version 322

seems like transaction_status_cache is returning PENDING for a transaction that was committed by the client/pg.

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

amitanandaiyer · 2023-05-15T01:50:56Z

declare -x YB_EXTRA_MASTER_FLAGS="--vmodule=sys_catalog=2,remove_intents_task=2,*transaction*=4 --log_ysql_catalog_versions=true --TEST_docdb_log_write_batches=true"
declare -x YB_EXTRA_TSERVER_FLAGS="--vmodule=transaction=4"

ybd release --java-test 'org.yb.pgsql.TestPgRegressAuthorization#testPgRegressAuthorization'  -n 100 --tp 1

amitanandaiyer · 2023-05-15T02:26:14Z

m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499100 1843458048 running_transaction.cc:134] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7: Existing    status knowledge (CREATED, <min>) does not satisfy requested: { id: bc216574-24d0-4e03-9645-f9762465dea7 read_ht: { physical: 1684114861498432 } global_limit_ht: { physical: 1684114861498432 } serial_no: 0 reason: get commit    time flags: [kCleanup]}, sending: 10955



 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499215 1827975168 running_transaction.cc:302] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7:             DoStatusReceived(OK, status: COMMITTED status_hybrid_time: 6898134472697745408 propagated_hybrid_time: 6898134472700612608 aborted_subtxn_set { }, 10955)

 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499238 1843458048 transaction_status_cache.cc:163] Transaction_id bc216574-24d0-4e03-9645-f9762465dea7 at { read: { physical: 1684114861498432 } local_limit: {          physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }: status: PENDING, status_time: { physical: 1684114861498473 }

 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499341 1824534528 transaction_participant.cc:676] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Apply: { leader_term: 1 transaction_id:         bc216574-24d0-4e03-9645-f9762465dea7 op_id: 1.2720 commit_ht: { physical: 1684114861498473 } log_ht: { physical: 1684114861498867 } sealed: 0 status_tablet: 86547f7b3809402e8919eef0affb9466 apply_state: <NULL> is_external: 0 }

 compare: read-time vs commit_ht
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499042 1843458048 tablet.cc:1230] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Created new Iterator reading at
 { read: {
     physical: 1684114861498432 } local_limit: { physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499341 1824534528 transaction_participant.cc:676] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Apply: { leader_term: 1 transaction_id:         bc216574-24d0-4e03-9645-f9762465dea7 op_id: 1.2720
 commit_ht: {
     physical: 1684114861498473 } log_ht: { physical: 1684114861498867 } sealed: 0 status_tablet: 86547f7b3809402e8919eef0affb9466 apply_state: <NULL> is_external: 0 }


     looks like indeed commit_ht > read-time.

     was the read-time chosen in the past? instead of choosing something >= now?
     ```

bmatican · 2023-05-15T04:03:01Z

@amitanandaiyer I think in the last debugging I did on this, indeed, the iterator would choose a time slightly in the past, due to some safetime logic / iterator creation method argument

bmatican · 2023-05-15T04:25:58Z

From Sergei on Slack previously (https://yugabyte.slack.com/archives/C03ULJYE0KG/p1663614223304439?thread_ts=1663356281.778909&cid=C03ULJYE0KG)

I think the issue that we use single time here:
      : ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse)));

we should use read time range and restart read when necessary

amitanandaiyer · 2023-05-15T07:21:36Z

Yes. you're right. This is the same issue you/sergei have been commenting on.

SysCatalog/CatalogMaster doesn't specify the read time, so this internally gets translated to
: ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse)));

https://yugabyte.slack.com/archives/C03ULJYE0KG/p1663614223304439?thread_ts=1663356281.778909&cid=C03ULJYE0KG

However, IMO, the issue is not with using SingleTime, but with getting the safe time with RequireLease::kFalse -- this essentially is a "read from follower" like semantics where we may not see the latest update.

using RequireLease::kTrue should fix the issue.

I also see a few places in SysCatalog use ReadHybridTime::Max() .. this might work too, in terms of not missing out on the latest write -- however, it seems risky to do this as I think the iterator may see other "future" writes and not get a consistent view. Probably worth revisiting its usage to see if we are ok with this?

The reason these "fixes" didn't quite work when you tried is because of the RequestScope issue, which also contributes to stale reads from master.

rthallamko3 · 2023-06-15T03:43:49Z

@deeps1991, Do you think this should be backported? If so upto what stable branch? cc @amitanandaiyer

deeps1991 · 2023-06-15T16:00:40Z

Based on #5030 the issue of stale master reads has been affecting us ever since we had online index backfill. I believe we've had online index backfill open for customers for a very long time, since 2.2. So I would recommend back porting to all stable branches that are open.
cc @jaki

…pecify the ReadHybridTime Summary: Syscatalog currently does not specify a read_hybrid_time when reading the catalog version. This causes the docdb layer to pick up a read-hybrid time without a lease -- which is ok for consistent prefix/stale-reads, however it may not be the most up to date. This can cause stale data to be returned while reading the sys catalog, and may cause issues like in GH 5030. ``` docdb::ReadOperationData read_operation_data = { .deadline = deadline, .read_time = read_hybrid_time ? read_hybrid_time : ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse))), }; ``` Ideally, we'd want the pg layer to specify the read-hybrid-timestamp at which the read needs to be done. Choosing the read-hybrid-time based on SingleTime(SafeTime) at the master, even if done with a lease, could cause a stale read. A transaction that has already been committed on a different node/status-tablet with a future timestamp, could go unseen. An alternative to this is to use ReadHybridTime::kMax, however this may expose future writes and give an inconsistent view of the system. This should generally be avoided. This issue is avoided in the DML case as the client specifies the read-hybrid-time accounting for clock skew `NowWithRange` and the reads are restarted. This isn't the case with sys-catalog reads as the read-timestamp is chosen at the master itself, and there are no read restarts being done. Choosing the "current time at master" as the read time has an additional complication that it may not yet be safe to read at -- thus requiring a wait for the safe-time to catch up. This can delay the read(s) by up to a heart-beat interval (500ms) if there are no active writes. -- This can be alleviated by choosing the current SafeTime as the read time, which should avoid the wait in the common case. However, if there are future intents for a committed transaction within the clock-skew window, we may still have to wait. Thus to summarize: (1) ReadHybridTime() => SafeTime(kFalse) => Consistent but possibly stale results. Safest to use and should be preferred if the clients are ok with possibly stale reads. (2) Use RHT::Max() => Possibly inconsistent if there are concurrent writes. However, should be ok if there are no concurrent writes. (3) Use NowWithRange() => Consistent and safe to use, however may cause the rpc to wait. Also, callers need to handle read-restarts. (4) Use SafeTimeWithRange => Consistent and safe to use. No waits in the common case. Callers need to handle read-restarts. Can lead to waiting for safe time in the uncommon case. Given that PG ensures that until DDL transaction verification is completed, no other DDL can occur on the table, we this diff updates them to use (4) to read the latest written values in the context of catalog reads which need the latest value.. The following clients require up-to-date state of PG catalog and have been switched to do reads with restarts: 1. YSQL catalog version reads 2. ysql_transaction_ddl These clients do not require the absolute latest state, but need consistent reads. 1. CDC 2. Tablespace refresh task 3. Backup-restore that tries to determine the tablegroup version. They have been updated to explicitly specify ReadHybridTime() as the timestamp used. This shouldn't result in any behavioral change for such use cases, as this is the same timestamp that was used prior to this diff. Test Plan: ybd release --cxx-test pg_backends-test --gtest_filter PgBackendsTest.Stress --tp 1 -n 10 Reviewers: dsrinivasan, jason, pjain, sergei, rsami, rthallam, kpopali Reviewed By: dsrinivasan, sergei Subscribers: hsunder, bogdan, jason, ybase Differential Revision: https://phorge.dev.yugabyte.com/D25342

… from ReadYsqlDBCatalogVersionImpl Summary: b7fadbe / D25342 introduced updating RestartReadHt to prevent stale reads. However, this wasn't being updated when control exits early from ReadYsqlDBCatalogVersionImpl. Update the code path to ensure that RestartReadHt is updated. Jira: DB-6543, DB-7433 Test Plan: Jenkins Reviewers: dsrinivasan Reviewed By: dsrinivasan Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D27366

amitanandaiyer · 2023-08-01T01:45:10Z

a1729c3 is not present in 2.18 and prior branches.

Based on chatting with @deeps1991 and @rthallamko3, we shouldn't want to backport this diff to older branches.

amitanandaiyer added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels May 15, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 15, 2023

amitanandaiyer self-assigned this May 15, 2023

amitanandaiyer changed the title ~~[DocDB] Master stale reads~~ [DocDB] Master stale reads: Use RequireLease::kTrue to ensure that SysCatalog reads the latest state May 15, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label May 15, 2023

amitanandaiyer changed the title ~~[DocDB] Master stale reads: Use RequireLease::kTrue to ensure that SysCatalog reads the latest state~~ [DocDB] Master stale reads: Not specifying read-hybrid-time may cause SysCatalog to read stale data May 17, 2023

amitanandaiyer changed the title ~~[DocDB] Master stale reads: Not specifying read-hybrid-time may cause SysCatalog to read stale data~~ [DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data May 17, 2023

jasonyb mentioned this issue Jul 6, 2023

[YSQL][Colocation]: NOTICE: Retrying wait for backends catalog version: Requested catalog version is too high: req version 158, master version 157 #18112

Closed

1 task

m-iancu mentioned this issue Jul 13, 2023

[YSQL] flaky test: org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex #16408

Closed

1 task

rthallamko3 added 2.14 Backport Required 2.16 Backport Required 2.18 Backport Required labels Jul 17, 2023

amitanandaiyer mentioned this issue Jul 28, 2023

[YSQL] Ensure that RestartReadHt is updated when control exits early in ReadYsqlDBCatalogVersionImpl #18456

Closed

1 task

amitanandaiyer closed this as completed Aug 1, 2023

amitanandaiyer removed 2.14 Backport Required 2.16 Backport Required 2.18 Backport Required labels Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

amitanandaiyer commented May 15, 2023 •

edited by jira bot

Loading

amitanandaiyer commented May 15, 2023

amitanandaiyer commented May 15, 2023

bmatican commented May 15, 2023

bmatican commented May 15, 2023

amitanandaiyer commented May 15, 2023 •

edited

Loading

rthallamko3 commented Jun 15, 2023

deeps1991 commented Jun 15, 2023

amitanandaiyer commented Aug 1, 2023

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

Comments

amitanandaiyer commented May 15, 2023 • edited by jira bot Loading

Description

Warning: Please confirm that this issue does not contain any sensitive information

amitanandaiyer commented May 15, 2023

amitanandaiyer commented May 15, 2023

bmatican commented May 15, 2023

bmatican commented May 15, 2023

amitanandaiyer commented May 15, 2023 • edited Loading

rthallamko3 commented Jun 15, 2023

deeps1991 commented Jun 15, 2023

amitanandaiyer commented Aug 1, 2023

amitanandaiyer commented May 15, 2023 •

edited by jira bot

Loading

amitanandaiyer commented May 15, 2023 •

edited

Loading