Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

Closed
1 task done
amitanandaiyer opened this issue May 15, 2023 · 8 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented May 15, 2023

Jira Link: DB-6543

Description

Master stale reads could also happen after fixing the RequestScope in issue #14926

ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498386 1860923392 transaction.cc:1148] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Commit, seal_only: 0, tablets: [{00000000000000000000000000000000, {          num_batches: 2 num_completed_batches: 6 has_metadata: 1 }}], status: OK
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498435 1860923392 transaction.cc:1248] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Transaction wasn't promoted, not sending abort to old status tablet
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498775 1859776512 transaction.cc:1343] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Committed: OK
 ts1|pid63586|:14506|http://127.0.0.205:27627 I0515 09:41:01.498783 1859776512 transaction.cc:1362] bc216574-24d0-4e03-9645-f9762465dea7 [Session 24]: Commit done: OK
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499019 1843458048 ysql_backends_manager.cc:107] WaitForYsqlBackendsCatalogVersion: database_oid: 13297 catalog_version: 323
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499033 1843458048 sys_catalog.cc:907] Start ReadYsqlDBCatalogVersionImpl
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499042 1843458048 tablet.cc:1230] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Created new Iterator reading at { read: { physical:             1684114861498432 } local_limit: { physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499100 1843458048 running_transaction.cc:134] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7: Existing    status knowledge (CREATED, <min>) does not satisfy requested: { id: bc216574-24d0-4e03-9645-f9762465dea7 read_ht: { physical: 1684114861498432 } global_limit_ht: { physical: 1684114861498432 } serial_no: 0 reason: get commit    time flags: [kCleanup]}, sending: 10955
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499215 1827975168 running_transaction.cc:302] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7:             DoStatusReceived(OK, status: COMMITTED status_hybrid_time: 6898134472697745408 propagated_hybrid_time: 6898134472700612608 aborted_subtxn_set { }, 10955)
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499238 1843458048 transaction_status_cache.cc:163] Transaction_id bc216574-24d0-4e03-9645-f9762465dea7 at { read: { physical: 1684114861498432 } local_limit: {          physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }: status: PENDING, status_time: { physical: 1684114861498473 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499258 1843458048 sys_catalog.cc:1000] done ReadYsqlDBCatalogVersionImpl returning 322
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499265 1843458048 transaction_participant.cc:464] Tablet peer checkpoint is expired with the current time: 254465 expiration time: 268838 checkpoint op_id: -1.-1
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499269 1843458048 transaction_participant.cc:464] Tablet peer checkpoint is expired with the current time: 254465 expiration time: 268838 checkpoint op_id: -1.-1
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499271 1843458048 catalog_manager.cc:9914] GetYsqlDBCatalogVersion version 322

seems like transaction_status_cache is returning PENDING for a transaction that was committed by the client/pg.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@amitanandaiyer amitanandaiyer added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels May 15, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 15, 2023
@amitanandaiyer amitanandaiyer self-assigned this May 15, 2023
@amitanandaiyer
Copy link
Contributor Author

declare -x YB_EXTRA_MASTER_FLAGS="--vmodule=sys_catalog=2,remove_intents_task=2,*transaction*=4 --log_ysql_catalog_versions=true --TEST_docdb_log_write_batches=true"
declare -x YB_EXTRA_TSERVER_FLAGS="--vmodule=transaction=4"

ybd release --java-test 'org.yb.pgsql.TestPgRegressAuthorization#testPgRegressAuthorization'  -n 100 --tp 1

@amitanandaiyer
Copy link
Contributor Author

m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499100 1843458048 running_transaction.cc:134] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7: Existing    status knowledge (CREATED, <min>) does not satisfy requested: { id: bc216574-24d0-4e03-9645-f9762465dea7 read_ht: { physical: 1684114861498432 } global_limit_ht: { physical: 1684114861498432 } serial_no: 0 reason: get commit    time flags: [kCleanup]}, sending: 10955



 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499215 1827975168 running_transaction.cc:302] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3 ID bc216574-24d0-4e03-9645-f9762465dea7:             DoStatusReceived(OK, status: COMMITTED status_hybrid_time: 6898134472697745408 propagated_hybrid_time: 6898134472700612608 aborted_subtxn_set { }, 10955)

 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499238 1843458048 transaction_status_cache.cc:163] Transaction_id bc216574-24d0-4e03-9645-f9762465dea7 at { read: { physical: 1684114861498432 } local_limit: {          physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }: status: PENDING, status_time: { physical: 1684114861498473 }

 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499341 1824534528 transaction_participant.cc:676] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Apply: { leader_term: 1 transaction_id:         bc216574-24d0-4e03-9645-f9762465dea7 op_id: 1.2720 commit_ht: { physical: 1684114861498473 } log_ht: { physical: 1684114861498867 } sealed: 0 status_tablet: 86547f7b3809402e8919eef0affb9466 apply_state: <NULL> is_external: 0 }

 compare: read-time vs commit_ht
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499042 1843458048 tablet.cc:1230] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Created new Iterator reading at
 { read: {
     physical: 1684114861498432 } local_limit: { physical: 1684114861498432 } global_limit: { physical: 1684114861498432 } in_txn_limit: <max> serial_no: 0 }
 m3|pid63573|:32151|http://127.0.2.230:18973 I0515 09:41:01.499341 1824534528 transaction_participant.cc:676] T 00000000000000000000000000000000 P b029c749f7f84bdaac51ef0114eefbe3: Apply: { leader_term: 1 transaction_id:         bc216574-24d0-4e03-9645-f9762465dea7 op_id: 1.2720
 commit_ht: {
     physical: 1684114861498473 } log_ht: { physical: 1684114861498867 } sealed: 0 status_tablet: 86547f7b3809402e8919eef0affb9466 apply_state: <NULL> is_external: 0 }


     looks like indeed commit_ht > read-time.

     was the read-time chosen in the past? instead of choosing something >= now?
     ```

@bmatican
Copy link
Contributor

@amitanandaiyer I think in the last debugging I did on this, indeed, the iterator would choose a time slightly in the past, due to some safetime logic / iterator creation method argument

@bmatican
Copy link
Contributor

From Sergei on Slack previously (https://yugabyte.slack.com/archives/C03ULJYE0KG/p1663614223304439?thread_ts=1663356281.778909&cid=C03ULJYE0KG)

I think the issue that we use single time here:
      : ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse)));

we should use read time range and restart read when necessary

@amitanandaiyer amitanandaiyer changed the title [DocDB] Master stale reads [DocDB] Master stale reads: Use RequireLease::kTrue to ensure that SysCatalog reads the latest state May 15, 2023
@amitanandaiyer
Copy link
Contributor Author

amitanandaiyer commented May 15, 2023

Yes. you're right. This is the same issue you/sergei have been commenting on.

SysCatalog/CatalogMaster doesn't specify the read time, so this internally gets translated to
: ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse)));

https://yugabyte.slack.com/archives/C03ULJYE0KG/p1663614223304439?thread_ts=1663356281.778909&cid=C03ULJYE0KG

However, IMO, the issue is not with using SingleTime, but with getting the safe time with RequireLease::kFalse -- this essentially is a "read from follower" like semantics where we may not see the latest update.

using RequireLease::kTrue should fix the issue.

I also see a few places in SysCatalog use ReadHybridTime::Max() .. this might work too, in terms of not missing out on the latest write -- however, it seems risky to do this as I think the iterator may see other "future" writes and not get a consistent view. Probably worth revisiting its usage to see if we are ok with this?

The reason these "fixes" didn't quite work when you tried is because of the RequestScope issue, which also contributes to stale reads from master.

@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label May 15, 2023
@amitanandaiyer amitanandaiyer changed the title [DocDB] Master stale reads: Use RequireLease::kTrue to ensure that SysCatalog reads the latest state [DocDB] Master stale reads: Not specifying read-hybrid-time may cause SysCatalog to read stale data May 17, 2023
@amitanandaiyer amitanandaiyer changed the title [DocDB] Master stale reads: Not specifying read-hybrid-time may cause SysCatalog to read stale data [DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data May 17, 2023
@rthallamko3
Copy link
Contributor

@deeps1991, Do you think this should be backported? If so upto what stable branch? cc @amitanandaiyer

@deeps1991
Copy link
Contributor

Based on #5030 the issue of stale master reads has been affecting us ever since we had online index backfill. I believe we've had online index backfill open for customers for a very long time, since 2.2. So I would recommend back porting to all stable branches that are open.
cc @jaki

amitanandaiyer added a commit that referenced this issue Jul 17, 2023
…pecify the ReadHybridTime

Summary:
Syscatalog currently does not specify a read_hybrid_time when reading the catalog version.

This causes the docdb layer to pick up a read-hybrid time without a lease -- which is ok for
consistent prefix/stale-reads, however it may not be the most up to date.
This can cause stale data to be returned while reading the sys catalog, and may cause issues like in GH 5030.

```
 docdb::ReadOperationData read_operation_data = {
    .deadline = deadline,
    .read_time = read_hybrid_time
        ? read_hybrid_time
        : ReadHybridTime::SingleTime(VERIFY_RESULT(SafeTime(RequireLease::kFalse))),
  };
```

Ideally, we'd want the pg layer to specify the read-hybrid-timestamp at which the read needs to be done.
Choosing the read-hybrid-time based on SingleTime(SafeTime) at the master, even if done with a lease, could cause a stale read. A transaction that has already been committed on a different node/status-tablet with a future timestamp, could go unseen.

An alternative to this is to use ReadHybridTime::kMax, however this may expose future writes and give an inconsistent
view of the system. This should generally be avoided.

This issue is avoided in the DML case as the client specifies the read-hybrid-time accounting for clock skew `NowWithRange` and the reads are restarted. This isn't the case with sys-catalog reads as the read-timestamp is chosen at the master itself, and there are no read restarts being done.

Choosing the "current time at master" as the read time has an additional complication that it may not yet be safe to read at -- thus requiring a wait for the safe-time to catch up. This can delay the read(s) by up to a heart-beat interval (500ms) if there are no active writes.  -- This can be alleviated by choosing the current SafeTime as the read time, which should avoid the wait in the common case. However, if there are future intents for a committed transaction within the clock-skew
window, we may still have to wait.

 Thus to summarize:
   (1) ReadHybridTime() => SafeTime(kFalse) => Consistent but possibly stale results.
     Safest to use and should be preferred if the clients are ok with possibly stale reads.
   (2) Use RHT::Max() => Possibly inconsistent if there are concurrent writes. However, should be ok if there are no concurrent writes.
   (3) Use NowWithRange() => Consistent and safe to use, however may cause the rpc to wait. Also, callers need to handle read-restarts.
   (4) Use SafeTimeWithRange => Consistent and safe to use. No waits in the common case. Callers need to handle read-restarts. Can lead to waiting for safe time in the uncommon case.

Given that PG ensures that until DDL transaction verification is completed, no other DDL can occur on
the table, we this diff updates them to use (4) to read the latest written values in the context of catalog reads which need the latest value..

The following clients require up-to-date state of PG catalog and have been switched to do reads with restarts:
1. YSQL catalog version reads
2. ysql_transaction_ddl

These clients do not require the absolute latest state, but need consistent reads.
1. CDC
2. Tablespace refresh task
3. Backup-restore that tries to determine the tablegroup version.
They have been updated to explicitly specify ReadHybridTime() as the timestamp used. This shouldn't result in any behavioral change for such use cases, as this is the same timestamp that was used prior to this diff.

Test Plan: ybd release --cxx-test pg_backends-test --gtest_filter PgBackendsTest.Stress --tp 1 -n 10

Reviewers: dsrinivasan, jason, pjain, sergei, rsami, rthallam, kpopali

Reviewed By: dsrinivasan, sergei

Subscribers: hsunder, bogdan, jason, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D25342
amitanandaiyer added a commit that referenced this issue Jul 31, 2023
… from ReadYsqlDBCatalogVersionImpl

Summary:
b7fadbe / D25342 introduced updating RestartReadHt to prevent
stale reads. However, this wasn't being updated when control exits early from ReadYsqlDBCatalogVersionImpl.

Update the code path to ensure that RestartReadHt is updated.
Jira: DB-6543, DB-7433

Test Plan: Jenkins

Reviewers: dsrinivasan

Reviewed By: dsrinivasan

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27366
@amitanandaiyer
Copy link
Contributor Author

a1729c3 is not present in 2.18 and prior branches.

Based on chatting with @deeps1991 and @rthallamko3, we shouldn't want to backport this diff to older branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

5 participants