[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

frozenspider · 2020-07-09T20:53:38Z

Jira Link: DB-2503
During YSQL index backfill process, in indexcmds.c we're doing this:

// DefineIndex
...
elog(LOG, "committing pg_index tuple with indislive=true");
CommitTransactionCommand();
YBDecrementDdlNestingLevel(true /* success */);
YBIncrementDdlNestingLevel();
StartTransactionCommand();

HandleYBStatus(YBCPgAsyncUpdateIndexPermissions(MyDatabaseId, relationId));

elog(LOG, "waiting for YB_INDEX_PERM_WRITE_AND_DELETE");
HandleYBStatus(YBCPgWaitUntilIndexPermissionsAtLeast(MyDatabaseId,
                                                     relationId,
                                                     indexRelationId,
                                                     YB_INDEX_PERM_WRITE_AND_DELETE,
                                                     &actual_index_permissions));
...

We expect YBDecrementDdlNestingLevel to make pg_index row (with indislive=true) to become visible to master for YBCPgAsyncUpdateIndexPermissions, but in the logs we see:

ts1|pid95670|:17795|http://127.0.0.52:18103 2020-07-09 22:09:47.135 +05 [95726] LOG:  committing pg_index tuple with indislive=true
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.139787 27136000 catalog_manager.cc:3607] Servicing AlterTable request from 127.0.0.52:57697: table { table_id: "000030a9000030008000000000004001" } force_send_alter_request: true
ts1|pid95670|:17795|http://127.0.0.52:18103 2020-07-09 22:09:47.140 +05 [95726] LOG:  waiting for YB_INDEX_PERM_WRITE_AND_DELETE
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.141741 36257792 backfill_index.cc:448] Locking indexed table
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.143888 36257792 backfill_index.cc:221] Index 000030a900003000800000000000400b is not yet live, skipping permission update
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.143927 36257792 backfill_index.cc:336] Index permissions update skipped, leaving schema_version at 1

(i.e. ShouldProceedWithPgsqlIndexPermissionUpdate in backfill_index.cc did not find the row in pg_index)
To reproduce, enable YSQL backfill and run

ybd --java-test 'org.yb.pgsql.TestPgSelect#testNullPushdown' --extra-daemon-flags "--vmodule=backfill_index=3"

(logs)

UPD:
Same issue with TestPgRegressMisc (logs), TestPgRegressPlanner (logs), and many other TestPgRegress* tests

The text was updated successfully, but these errors were encountered:

m-iancu · 2020-07-09T20:59:44Z

For context, the pg_index read during the backfill permissions check is done in this function: https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/master/backfill_index.cc#L154.

Summary: `org.yb.pgsql.TestIndexBackfill` test has been failing for the following reasons: * Error message it was expecting to catch was changed in 8246ca4#diff-35b7aae825e891e35b810e7bc838ffe1L159 * For "schema version mismatch" transparent restart, when aborting transaction, we weren't cleaning up `debug_query_string` which remained pointing to the query allocated in the destroyed memory context - briefly, but it was subsequently used when logging, and was caught by ASAN It might still sometimes fail in ASAN though: * Because of #5030 * Weird case when test succeeds but our test framework reports it as a failure, possibly because of stray `postgres` processes left over after the test (Resolves #5044) Test Plan: ybd --java-test org.yb.pgsql.TestIndexBackfill (and ASAN as well) Reviewers: jason, mihnea Reviewed By: mihnea Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D8878

Summary: `org.yb.pgsql.TestIndexBackfill` test has been failing for the following reasons: * Error message it was expecting to catch was changed in 8246ca4#diff-35b7aae825e891e35b810e7bc838ffe1L159 * For "schema version mismatch" transparent restart, when aborting transaction, we weren't cleaning up `debug_query_string` which remained pointing to the query allocated in the destroyed memory context - briefly, but it was subsequently used when logging, and was caught by ASAN It might still sometimes fail in ASAN though: * Because of #5030 * Weird case when test succeeds but our test framework reports it as a failure, possibly because of stray `postgres` processes left over after the test (Resolves #5044) Test Plan: ybd --java-test org.yb.pgsql.TestIndexBackfill (and ASAN as well) Reviewers: jason, mihnea Reviewed By: mihnea Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D8878 (cherry picked from commit 91733d4)

jaki · 2020-08-13T01:39:14Z

According to BasePgSQLTest.java, the replication factor for those tests is 3. Would be nice to know if this issue occurs with rf 1 as well.

jaki · 2020-08-14T00:47:08Z

I confirm that this happens even with rf 1.

jaki · 2020-08-17T06:45:23Z

I think I've narrowed down to an explanation for what's going on. It looks like we commit the pg_index row but may not apply it before master reads it. Here's a concrete example:

apply patch on commit 0542f16:

diff --git i/src/yb/master/backfill_index.cc w/src/yb/master/backfill_index.cc
index d63999fe4..89d30db59 100644
--- i/src/yb/master/backfill_index.cc
+++ w/src/yb/master/backfill_index.cc
@@ -155,6 +155,7 @@ Result<bool> GetPgIndexStatus(
     CatalogManager* catalog_manager,
     const TableId& idx_id,
     const std::string& status_col_name) {
+  LOG(WARNING) << __func__ << " ENTR";
   const auto pg_index_id =
       GetPgsqlTableId(VERIFY_RESULT(GetPgsqlDatabaseOid(idx_id)), kPgIndexTableOid);
 
@@ -199,10 +200,12 @@ Result<bool> GetPgIndexStatus(
   QLTableRow row;
   if (VERIFY_RESULT(iter->HasNext())) {
     RETURN_NOT_OK(iter->NextRow(&row));
+    LOG(WARNING) << __func__ << " EXIT";
     return row.GetColumn(status_col_id)->bool_value();
   }
 
   // For practical purposes, an absent index is the same as having false status column value.
+  LOG(WARNING) << __func__ << " EXIT";
   return false;
 }
 
@@ -217,6 +220,12 @@ Result<bool> ShouldProceedWithPgsqlIndexPermissionUpdate(
   switch (new_perm) {
     case INDEX_PERM_WRITE_AND_DELETE: {
       auto live = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indislive"));
+      if (!live && catalog_manager->doaretry) {
+        LOG(WARNING) << "TRY AGAIN";
+        SleepFor(MonoDelta::FromMilliseconds(5000));
+        LOG(WARNING) << "DONE SLEEP";
+        live = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indislive"));
+      }
       if (!live) {
         VLOG(1) << "Index " << idx_id << " is not yet live, skipping permission update";
       }
@@ -224,6 +233,12 @@ Result<bool> ShouldProceedWithPgsqlIndexPermissionUpdate(
     }
     case INDEX_PERM_DO_BACKFILL: {
       auto ready = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indisready"));
+      if (!ready && catalog_manager->doaretry) {
+        LOG(WARNING) << "TRY AGAIN R";
+        SleepFor(MonoDelta::FromMilliseconds(5000));
+        LOG(WARNING) << "DONE SLEEP R";
+        ready = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indisready"));
+      }
       if (!ready) {
         VLOG(1) << "Index " << idx_id << " is not yet ready, skipping permission update";
       }
@@ -911,7 +926,7 @@ BackfillTablet::BackfillTablet(
   } else if (done()) {
     VLOG(1) << tablet_->ToString() << " backfill already done";
   } else {
-    VLOG(1) << tablet_->ToString() << " begining backfill from "
+    VLOG(1) << tablet_->ToString() << " beginning backfill from "
             << "<start-of-the-tablet>";
   }
 }
diff --git i/src/yb/master/catalog_manager.cc w/src/yb/master/catalog_manager.cc
index b3b1dd0f6..3602b53e2 100644
--- i/src/yb/master/catalog_manager.cc
+++ w/src/yb/master/catalog_manager.cc
@@ -6057,10 +6057,12 @@ Result<uint64_t> CatalogManager::IncrementYsqlCatalogVersion() {
   auto l = CHECK_NOTNULL(ysql_catalog_config_.get())->LockForWrite();
   uint64_t new_version = l->data().pb.ysql_catalog_config().version() + 1;
   l->mutable_data()->pb.mutable_ysql_catalog_config()->set_version(new_version);
+  LOG(WARNING) << "Incremented to " << new_version;
 
   // Write to sys_catalog and in memory.
   RETURN_NOT_OK(sys_catalog_->UpdateItem(ysql_catalog_config_.get(), leader_ready_term()));
   l->Commit();
+  LOG(WARNING) << "COMMITTT";
 
   return new_version;
 }
@@ -6378,6 +6380,7 @@ Status CatalogManager::StartRemoteBootstrap(const StartRemoteBootstrapRequestPB&
 
 void CatalogManager::SendAlterTableRequest(const scoped_refptr<TableInfo>& table,
                                            const AlterTableRequestPB* req) {
+  if (req == nullptr) { doaretry = false; } else { doaretry = true; }
   vector<scoped_refptr<TabletInfo>> tablets;
   table->GetAllTablets(&tablets);
 
diff --git i/src/yb/master/catalog_manager.h w/src/yb/master/catalog_manager.h
index c4424fa6d..e14b46f01 100644
--- i/src/yb/master/catalog_manager.h
+++ w/src/yb/master/catalog_manager.h
@@ -164,6 +164,7 @@ class CatalogManager : public tserver::TabletPeerLookupIf {
   };
 
  public:
+  volatile bool doaretry;
   // Some code refers to ScopedLeaderSharedLock as CatalogManager::ScopedLeaderSharedLock.
   using ScopedLeaderSharedLock = ::yb::master::ScopedLeaderSharedLock;
 
diff --git i/src/yb/master/master_tablet_service.cc w/src/yb/master/master_tablet_service.cc
index dbef676bf..1229907e1 100644
--- i/src/yb/master/master_tablet_service.cc
+++ w/src/yb/master/master_tablet_service.cc
@@ -71,6 +71,7 @@ void MasterTabletServiceImpl::Write(const tserver::WriteRequestPB* req,
   }
 
   tserver::TabletServiceImpl::Write(req, resp, std::move(context));
+  LOG(WARNING) << "write done";
 }
 
 void MasterTabletServiceImpl::IsTabletServerReady(

run:

./bin/yb-ctl create --master_flags "vmodule='backfill_index=3,tablet_service=2',TEST_docdb_log_write_batches=true,ysql_disable_index_backfill=false" --tserver_flags "vmodule='tablet_service=2',TEST_docdb_log_write_batches=true,ysql_disable_index_backfill=false"
./bin/ysqlsh

create table t1 (a int, b int, primary key (a asc)); insert into t1 values (1, 1), (2, 2), (3, 3); insert into t1 values (4, null), (5, null); create index on t1 (b asc); drop table t1; -- repeat until it gets stuck for 5 seconds (without patch, it would have timed out after 60 seconds)

postgres log:

I0816 15:22:30.120858  7337 table_creator.cc:307] Created index yugabyte.t1_b_idx of type PGSQL_TABLE_TYPE                                                                                                                                                                                                               
2020-08-16 15:22:30.157 PDT [7337] LOG:  waiting for YB_INDEX_PERM_DELETE_ONLY                                                                                                                                                                                                                                           
2020-08-16 15:22:30.157 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:30.193 PDT [7337] LOG:  committing pg_index tuple with indislive=true                                                                                                                                                                                                                                   
2020-08-16 15:22:30.193 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:30.196 PDT [7337] LOG:  waiting for YB_INDEX_PERM_WRITE_AND_DELETE                                                                                                                                                                                                                                      
2020-08-16 15:22:30.196 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:35.314 PDT [7337] LOG:  committing pg_index tuple with indisready=true                                                                                                                                                                                                                                  
2020-08-16 15:22:35.314 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:35.325 PDT [7337] LOG:  waiting for Yugabyte index read permission                                                                                                                                                                                                                                      
2020-08-16 15:22:35.325 PDT [7337] STATEMENT:  create index on t1 (b asc);

tserver log:

I0816 15:22:30.193893  7276 tablet_service.cc:912] UpdateTransaction: tablet_id: "29678050581f49e487c23eecfdf77d02" state { transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" status: COMMITTED tablets: "00000000000000000000000000000000" } propagated_hybrid_time: 6543837389592064000, context: Call yb.tserver.TabletServerService.UpdateTransaction 127.0.0.51:53777 => 127.0.0.51:9100 (request call id 1721)
I0816 15:22:30.196712  7275 tablet_service.cc:796] Received Change Metadata RPC: tablet_id: "6b36498ad35f4fd995e06ed4cdbeefb4"
schema {
  columns {
    id: 10
    name: "a"
    type {
      main: INT32
    }
    is_key: true
    is_nullable: false
    is_static: false
    is_counter: false
    sorting_type: 3
    order: 1
  }
  columns {
    id: 11
    name: "b"
    type {
      main: INT32
    }
    is_key: false
    is_nullable: true
    is_static: false
    is_counter: false
    sorting_type: 0
    order: 2
  }
  table_properties {
    contain_counters: false
    is_transactional: true
    consistency_level: STRONG
    use_mangled_column_name: false
    num_tablets: 2
    is_ysql_catalog_table: false
    is_backfilling: false
  }
  colocated_table_id {
  }
}
schema_version: 1
new_table_name: "t1"
dest_uuid: "f7caffe546774086aea5eec920d4e1ab"
propagated_hybrid_time: 6543837389603315712
indexes {
  table_id: "000030a9000030008000000000004023"
  indexed_table_id: "000030a900003000800000000000401e"
  index_permissions: INDEX_PERM_DELETE_ONLY
}
alter_table_id: "000030a900003000800000000000401e"
I0816 15:22:30.200268 10244 version_set.cc:2839] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab [I]: Recovered from manifest file:/path/to/build/debug-gcc-dynamic-ninja/cluster_data/node-1/disk-1/yb-data/tserver/data/rocksdb/table-000030a9000030008000000000004023/tablet-e6332c0eabcf4b559605d77bdc2f8ae9.intents/MANIFEST-000001 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is <NULL>
I0816 15:22:30.200300 10244 version_set.cc:2847] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab [I]: Column family [default] (ID 0), log number is 0
I0816 15:22:30.202481 10244 tablet.cc:741] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab: Successfully opened a RocksDB database at /path/to/build/debug-gcc-dynamic-ninja/cluster_data/node-1/disk-1/yb-data/tserver/data/rocksdb/table-000030a9000030008000000000004023/tablet-e6332c0eabcf4b559605d77bdc2f8ae9, obj: 0x3b7c000
I0816 15:22:30.202522 10244 tablet_bootstrap.cc:539] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab: Time spent opening tablet: real 0.061s   user 0.005s     sys 0.004s
I0816 15:22:30.202754  7692 tablet_service.cc:912] UpdateTransaction: tablet_id: "29678050581f49e487c23eecfdf77d02" state { transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" status: APPLIED_IN_ONE_OF_INVOLVED_TABLETS tablets: "00000000000000000000000000000000" }, context: Call yb.tserver.TabletServerService.UpdateTransaction 127.0.0.51:60301 => 127.0.0.51:9100 (request call id 130)

master log:

W0816 15:22:30.198561 10245 backfill_index.cc:158] GetPgIndexStatus ENTR
I0816 15:22:30.201946 10243 tablet.cc:1158] T 00000000000000000000000000000000 P 91a68dce44604096a04f7a4cfa683336: Wrote 93 key/value pairs to kRegular RocksDB:
Frontiers: { smallest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } largest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } }
1. PutCF(SubDocKey(DocKey(CoTableId=eb040000-0000-0080-0030-0000a9300000, [], [16419]), [SystemColumnId(0); HT{ physical: 1597616550194232 }]), '#\x80\x01\xa77\xe04\xa0\x96\x80J$' (238001A737E034A096804A24))
...
72. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(19); HT{ physical: 1597616550194232 w: 71 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xebF' (238001A737E034A0968036EB46))
73. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(20); HT{ physical: 1597616550194232 w: 72 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xcbF' (238001A737E034A0968036CB46))
74. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(21); HT{ physical: 1597616550194232 w: 73 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xabF' (238001A737E034A0968036AB46))
75. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(22); HT{ physical: 1597616550194232 w: 74 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\x8bT' (238001A737E034A09680368B54))
...
I0816 15:22:30.203896  7697 tablet.cc:1158] T 00000000000000000000000000000000 P 91a68dce44604096a04f7a4cfa683336: Wrote 227 key/value pairs to kIntents RocksDB:
Frontiers: { smallest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } largest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } }
1. SingleDeleteCF(TXN META df8616b4-7823-44cb-8d44-b7a6114476aa)
...
144. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(19)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 71 })
145. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 71 })
146. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(20)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 72 })
147. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 72 })
148. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(21)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 73 })
149. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 73 })
150. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(22)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 74 })
151. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 74 })
...
W0816 15:22:30.206459 10245 backfill_index.cc:208] GetPgIndexStatus EXIT
W0816 15:22:30.206537 10245 backfill_index.cc:224] TRY AGAIN

CoTableId=320a0000-0000-0080-0030-0000a9300000 refers to table with bytes reversed: UUID 000030a9000030008000000000000a32 = yugabyte.pg_catalog.pg_index
transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" is in octal, so show it in hex using
```
printf "\337\206\026\264x#D\313\215D\267\246\021Dv\252" | xxd
```
and find it matches df8616b4-7823-44cb-8d44-b7a6114476aa
ColumnId(19) refers to indisvalid; ColumnId(21) refers to indisready; ColumnId(22) refers to indislive. (This I know because of the order of non-pk columns in pg_index, where the first non-pk column is ColumnId(11).) Then, looking at the value (for PutCFs), there's first a hybrid time (value type char #), then other stuff, and the last character is value type kFalse or kTrue appearing as F or T, respectively.
The pg_index write txn committed on 15:22:30.193893, was written to regular db on 15:22:30.201946, and was marked as applied on 15:22:30.202754. Meanwhile, GetPgIndexStatus was called from 15:22:30.198561 to 15:22:30.206459, which includes the pg_index write/apply.

I later did more debugging to see that the problem seems to occur when RETURN_NOT_OK(doc_iter->Init(spec)); in GetPgIndexStatus happens before the pg_index row is written to regular db.

When waiting for indislive (i.e. waiting for the row to first appear), the new pg_index row is never written to intents db. So we have writes to regular db and deletes to intents db. That's strange since the deletes to intents db are then pointless. cc: @mbautin
When waiting for indisready, the pg_index row update is written to intents db, as expected. So we have writes to intents db, writes to regular db, then deletes to intents db.

For indislive, since the row isn't in regular or intents db, GetPgIndexStatus returns false (through the absent path).
For indisready, I have sometimes seen it fail there, too. But it is rare, so I haven't gotten much of a chance to debug it.

jaki · 2020-09-09T02:02:53Z

It appears there may be more than one thing going on here. I have collected logs from several failures, and I grouped them up to two main types:

intent_aware_iterator.cc:153] Intent for transaction w/o metadata: 74b5c32d-3059-497c-a288-3e39b60ad9c6 (indislive and indisready)
doc_reader.cc:547] projection subkey: ColumnId(22) (indislive)

The first causes the newly written row to not get found. In the case of indislive, it returns nothing, even though it clearly sees it in the iterator. In the case of indisready, it returns the old value before update, even though it clearly sees the updated version in the iterator. Both of these may be because of intent_aware_iterator.cc:711] decode_result->value_time <= resolved_intent_time.

The second simply reads the wrong row. ColumnId(22) refers to the indisvalid column. It's really bizzare why it's attempting to read that column rather than ColumnId(19) (indislive).

jaki · 2020-09-30T18:14:36Z

@tedyu pointed out an inaccuracy with my second issue in the previous comment. I confused column ids 19, 21, 22 to be indislive, indisready, indisvalid, respectively. ColumnId(22) refers to the indislive column. The second issue is probably no longer relevant.

Summary: Online schema changes don't properly wait for all clients (postgres backends) to be at an up-to-date schema (system catalog, catalog version) before moving on to the next schema. Right now, there is only one implementation of online schema change: CREATE INDEX. It does a default 1s wait between index state flag changes. This can be problematic if the schema was not fully propagated within 1s. Fix to actually wait for all clients to be at an up-to-date schema. Do so by waiting on all clients (on the same database) to have a requested catalog version. The implementation details of this waiting are in a code comment of ysql_backends_manager.h. In order to get back the old behavior, set tserver flags --ysql_yb_index_state_flags_update_delay=1000 --ysql_yb_disable_wait_for_backends_catalog_version=true. Client-master RPC timeout is controllable by tserver flag --wait_for_ysql_backends_catalog_version_client_master_rpc_timeout_ms and common flag (meaning master and tserver flag expected to be set to the same value) --wait_for_ysql_backends_catalog_version_client_master_rpc_margin_ms. The first flag defaults to 20s and determines the max time spent waiting on master. The second flag defaults to 5s and accounts for any additional time it takes for RPC overhead. The two flags are combined to make the official RPC deadline. The master-tserver RPC timeout is controllable by the master flag --master_ts_rpc_timeout_ms, which is also used for other master-tserver RPCs. A WaitForYsqlBackendsCatalogVersion request typically creates a BackendsCatalogVersionJob, viewable in master UI /tasks page. This job launches the BackendsCatalogVersionTS master-tserver RPC tasks, also viewable in master UI /tasks page. The job details tell how many backends are still lagging behind, so this page can be consulted in case of a long-running CREATE INDEX (that is stuck before backfill). In the future, capabilities could be added to expose which backends are behind, but for now, this information can be manually collected by running the same SELECT query as in tablet_service.cc: SELECT count(*) FROM pg_stat_activity WHERE catalog_version < $0 AND datid = $1 The master flag --ysql_backends_catalog_version_job_expiration_sec determines how long jobs can be untouched by clients before being candidates for expiration. Close: #7376 Test Plan: ./yb_build.sh fastdebug --gcc11 \ --cxx-test pgwrapper_pg_backends-test ./yb_build.sh release \ --cxx-test pgwrapper_pg_backends-test \ --gtest_filter PgBackendsTest.Stress ./yb_build.sh fastdebug --gcc11 \ --cxx-test pgwrapper_pg_index_backfill-test \ --gtest_filter PgIndexBackfillTest.InsertsWhileCreatingIndex ./yb_build.sh fastdebug --gcc11 -n 100 --tp 1 --java-test \ 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' ./yb_build.sh asan -n 10 --tp 1 --java-test \ 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' There could be some flaky failures with error Requested catalog version is too high: due to issue #5030 still not resolved. This is mainly hit by the PgBackendsTest.CachedJob test. Reviewers: amitanand, skedia, myang Reviewed By: skedia, myang Subscribers: tverona, yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19621

Summary: Concurrent index creation enforces state changes where clients cannot be more than one state apart. The state is tracked by the catalog version, and it is enforced by making sure all clients are at the catalog version corresponding to the state change before moving on to the next state. Problem is, the catalog version we wait on does not necessarily correspond to the actual state change. The catalog version used for waiting uses the local catalog version, which is just +1 of the previous version. That is not accurate when other DDLs bump up the catalog version simultaneously. Fix with two parts: 1. wait on a proper catalog version: instead of using the local catalog version, fetch the master's catalog version after the previous state change commit, then wait on that version. The downside is that there is a time window between commit and fetching of master's catalog version such that we may end up waiting on a later version than necessary, but I didn't find an easy way to get the actual version corresponding to the commit, especially considering issue #5030. 1. avoid waiting on the backend that is running the CREATE INDEX: since we are now possibly waiting on catalog versions that are beyond the version that the CREATE INDEX is at, it becomes necessary to whitelist the backend running CREATE INDEX from being taken into consideration during the wait on backends catalog version request. This involves a lot of plumbing to send down that information. The info to identify the backend is the tserver UUID and backend PID. Some considerations: - setting the local catalog version to +1 is an existing optimization that should be reconsidered. If a concurrent DDL happened such that the actual catalog version is higher, this backend may pass breaking catalog version checks even if it doesn't deserve to. Do not deal with this issue in this commit. Filed #25068. - two callers of wait on backends catalog version for the same db+version is no longer as shareable because only one backend is excepted. Whichever caller was first will register their backend tserver/pid for exemption from the check. If a second caller has the same db+version, it will have to wait behind the existing information. It is not safe to add this second caller's tserver/pid for additional exemption because both callers can actually be waiting on different catalog versions, but they happened to pick up a later catalog version since the picking is not perfect right now. In that case, it is incorrect to ignore both backends in case the first caller was corresponding to an actual version later than the second caller's actual version. So such two-caller cases might end up conflicting on each other. Even if these two callers were not combined and instead got their separate jobs, they could still be waiting on each other if their CREATE INDEXes started on old versions. Jira: DB-13866 Test Plan: On Almalinux 8: ./yb_build.sh fastdebug --gcc11 \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh release \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh tsan \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 300 ./yb_build.sh fastdebug --gcc11 \ --cxx-test pg_backends-test Backport-through: 2.20 Reviewers: myang, amartsinchyk Reviewed By: amartsinchyk Subscribers: amartsinchyk, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D39780

Summary: Concurrent index creation enforces state changes where clients cannot be more than one state apart. The state is tracked by the catalog version, and it is enforced by making sure all clients are at the catalog version corresponding to the state change before moving on to the next state. Problem is, the catalog version we wait on does not necessarily correspond to the actual state change. The catalog version used for waiting uses the local catalog version, which is just +1 of the previous version. That is not accurate when other DDLs bump up the catalog version simultaneously. Fix with two parts: 1. wait on a proper catalog version: instead of using the local catalog version, fetch the master's catalog version after the previous state change commit, then wait on that version. The downside is that there is a time window between commit and fetching of master's catalog version such that we may end up waiting on a later version than necessary, but I didn't find an easy way to get the actual version corresponding to the commit, especially considering issue #5030. 1. avoid waiting on the backend that is running the CREATE INDEX: since we are now possibly waiting on catalog versions that are beyond the version that the CREATE INDEX is at, it becomes necessary to whitelist the backend running CREATE INDEX from being taken into consideration during the wait on backends catalog version request. This involves a lot of plumbing to send down that information. The info to identify the backend is the tserver UUID and backend PID. Some considerations: - setting the local catalog version to +1 is an existing optimization that should be reconsidered. If a concurrent DDL happened such that the actual catalog version is higher, this backend may pass breaking catalog version checks even if it doesn't deserve to. Do not deal with this issue in this commit. Filed #25068. - two callers of wait on backends catalog version for the same db+version is no longer as shareable because only one backend is excepted. Whichever caller was first will register their backend tserver/pid for exemption from the check. If a second caller has the same db+version, it will have to wait behind the existing information. It is not safe to add this second caller's tserver/pid for additional exemption because both callers can actually be waiting on different catalog versions, but they happened to pick up a later catalog version since the picking is not perfect right now. In that case, it is incorrect to ignore both backends in case the first caller was corresponding to an actual version later than the second caller's actual version. So such two-caller cases might end up conflicting on each other. Even if these two callers were not combined and instead got their separate jobs, they could still be waiting on each other if their CREATE INDEXes started on old versions. Backport-note: resolve merge conflict in pg_index_backfill-test.cc due to lack of 77958e8. Jira: DB-13866 Test Plan: On Almalinux 8: ./yb_build.sh fastdebug --gcc11 \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh release \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh tsan \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 300 ./yb_build.sh fastdebug --gcc11 \ --cxx-test pg_backends-test Backport-through: 2.20 Original commit: 29076dd / D39780 Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: myang Differential Revision: https://phorge.dev.yugabyte.com/D40400

Summary: Concurrent index creation enforces state changes where clients cannot be more than one state apart. The state is tracked by the catalog version, and it is enforced by making sure all clients are at the catalog version corresponding to the state change before moving on to the next state. Problem is, the catalog version we wait on does not necessarily correspond to the actual state change. The catalog version used for waiting uses the local catalog version, which is just +1 of the previous version. That is not accurate when other DDLs bump up the catalog version simultaneously. Fix with two parts: 1. wait on a proper catalog version: instead of using the local catalog version, fetch the master's catalog version after the previous state change commit, then wait on that version. The downside is that there is a time window between commit and fetching of master's catalog version such that we may end up waiting on a later version than necessary, but I didn't find an easy way to get the actual version corresponding to the commit, especially considering issue #5030. 1. avoid waiting on the backend that is running the CREATE INDEX: since we are now possibly waiting on catalog versions that are beyond the version that the CREATE INDEX is at, it becomes necessary to whitelist the backend running CREATE INDEX from being taken into consideration during the wait on backends catalog version request. This involves a lot of plumbing to send down that information. The info to identify the backend is the tserver UUID and backend PID. Some considerations: - setting the local catalog version to +1 is an existing optimization that should be reconsidered. If a concurrent DDL happened such that the actual catalog version is higher, this backend may pass breaking catalog version checks even if it doesn't deserve to. Do not deal with this issue in this commit. Filed #25068. - two callers of wait on backends catalog version for the same db+version is no longer as shareable because only one backend is excepted. Whichever caller was first will register their backend tserver/pid for exemption from the check. If a second caller has the same db+version, it will have to wait behind the existing information. It is not safe to add this second caller's tserver/pid for additional exemption because both callers can actually be waiting on different catalog versions, but they happened to pick up a later catalog version since the picking is not perfect right now. In that case, it is incorrect to ignore both backends in case the first caller was corresponding to an actual version later than the second caller's actual version. So such two-caller cases might end up conflicting on each other. Even if these two callers were not combined and instead got their separate jobs, they could still be waiting on each other if their CREATE INDEXes started on old versions. Jira: DB-13866 Merge notes: - Remove ASSERT_OK around BumpCatalogVersion because of lack of commit 0f875f5. - Bring missing test class PgIndexBackfillReadCommittedBlockIndislive from commit 68f7de8 which was not backported here. - Bring missing indislive blocking (indexcmds.c, guc.c) from commit 68f7de8 which was not backported here. - Disable test in sanitizers because lack of 0f875f5 occasionally causes Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:432): Fetch 'SELECT yb_increment_all_db_catalog_versions(true)' failed: 7, message: ERROR: The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA (pgsql error XX000) (aux msg ERROR: The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA) Test Plan: On Almalinux 8: ./yb_build.sh fastdebug --gcc11 \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh release \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh fastdebug --gcc11 \ --cxx-test pg_backends-test Backport-through: 2.20 Original commit: 29076dd / D39780 Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: pjain, myang Differential Revision: https://phorge.dev.yugabyte.com/D40815

Summary: Concurrent index creation enforces state changes where clients cannot be more than one state apart. The state is tracked by the catalog version, and it is enforced by making sure all clients are at the catalog version corresponding to the state change before moving on to the next state. Problem is, the catalog version we wait on does not necessarily correspond to the actual state change. The catalog version used for waiting uses the local catalog version, which is just +1 of the previous version. That is not accurate when other DDLs bump up the catalog version simultaneously. Fix with two parts: 1. wait on a proper catalog version: instead of using the local catalog version, fetch the master's catalog version after the previous state change commit, then wait on that version. The downside is that there is a time window between commit and fetching of master's catalog version such that we may end up waiting on a later version than necessary, but I didn't find an easy way to get the actual version corresponding to the commit, especially considering issue #5030. 1. avoid waiting on the backend that is running the CREATE INDEX: since we are now possibly waiting on catalog versions that are beyond the version that the CREATE INDEX is at, it becomes necessary to whitelist the backend running CREATE INDEX from being taken into consideration during the wait on backends catalog version request. This involves a lot of plumbing to send down that information. The info to identify the backend is the tserver UUID and backend PID. Some considerations: - setting the local catalog version to +1 is an existing optimization that should be reconsidered. If a concurrent DDL happened such that the actual catalog version is higher, this backend may pass breaking catalog version checks even if it doesn't deserve to. Do not deal with this issue in this commit. Filed #25068. - two callers of wait on backends catalog version for the same db+version is no longer as shareable because only one backend is excepted. Whichever caller was first will register their backend tserver/pid for exemption from the check. If a second caller has the same db+version, it will have to wait behind the existing information. It is not safe to add this second caller's tserver/pid for additional exemption because both callers can actually be waiting on different catalog versions, but they happened to pick up a later catalog version since the picking is not perfect right now. In that case, it is incorrect to ignore both backends in case the first caller was corresponding to an actual version later than the second caller's actual version. So such two-caller cases might end up conflicting on each other. Even if these two callers were not combined and instead got their separate jobs, they could still be waiting on each other if their CREATE INDEXes started on old versions. Jira: DB-13866 Merge notes: - Remove ASSERT_OK around BumpCatalogVersion because of lack of commit 0f875f5. - Bring missing test class PgIndexBackfillReadCommittedBlockIndislive from commit 68f7de8 which was not backported here. - Bring missing indislive blocking (indexcmds.c, guc.c) from commit 68f7de8 which was not backported here. - Disable test in sanitizers because lack of 0f875f5 occasionally causes Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:432): Fetch 'SELECT yb_increment_all_db_catalog_versions(true)' failed: 7, message: ERROR: The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA (pgsql error XX000) (aux msg ERROR: The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA) - Use PgIndexBackfillTestEnableWait subclass to enable backends manager (this should be the case in master for future-proofing, but it is minor and not worth the hassle) - Use FetchValue instead of FetchRow (see commit 4ec5f1c). Test Plan: On Almalinux 8: ./yb_build.sh fastdebug --gcc11 \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh release \ --gtest_filter PgIndexBackfillTest.CatVerBumps \ -n 100 ./yb_build.sh fastdebug --gcc11 \ --cxx-test pg_backends-test Backport-through: 2.20 Original commit: 29076dd / D39780 Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: myang, pjain Differential Revision: https://phorge.dev.yugabyte.com/D40838

frozenspider added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Jul 9, 2020

frozenspider assigned mbautin, frozenspider and m-iancu Jul 9, 2020

frozenspider changed the title ~~[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on master~~ [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on yb-master Jul 9, 2020

frozenspider changed the title ~~[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on yb-master~~ [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master Jul 9, 2020

ndeodhar assigned jaki and unassigned mbautin, frozenspider and m-iancu Aug 5, 2020

robertsami self-assigned this Sep 30, 2020

jaki mentioned this issue Oct 28, 2020

[YSQL] fix master replying with outdated catalog version #6198

Closed

m-iancu added this to YQL-beta Nov 30, 2021

yugabyte-ci added the priority/medium Medium priority issue label Jun 9, 2022

tverona1 added lifecycle/open and removed lifecycle/open labels Jun 9, 2022

yugabyte-ci assigned jasonyb Jul 9, 2022

yugabyte-ci unassigned robertsami and jaki Jul 9, 2022

jasonyb mentioned this issue Feb 3, 2023

[DocDB] Master not reading up-to-date contents when reading YSQL system tables. #14926

Closed

deeps1991 mentioned this issue Jun 15, 2023

[DocDB] Master stale reads: Not specifying read-hybrid-time causes SysCatalog to read stale data #17342

Closed

1 task

sushantrmishra assigned pkj415 and unassigned jasonyb Nov 20, 2024

m-iancu mentioned this issue Jan 14, 2025

[YSQL] Index corruption during parallel index creation in 2024.1.3 version #24775

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

frozenspider commented Jul 9, 2020 •

edited by yugabyte-ci

Loading

m-iancu commented Jul 9, 2020

jaki commented Aug 13, 2020

jaki commented Aug 14, 2020

jaki commented Aug 17, 2020 •

edited

Loading

jaki commented Sep 9, 2020

jaki commented Sep 30, 2020

[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

Comments

frozenspider commented Jul 9, 2020 • edited by yugabyte-ci Loading

m-iancu commented Jul 9, 2020

jaki commented Aug 13, 2020

jaki commented Aug 14, 2020

jaki commented Aug 17, 2020 • edited Loading

jaki commented Sep 9, 2020

jaki commented Sep 30, 2020

frozenspider commented Jul 9, 2020 •

edited by yugabyte-ci

Loading

jaki commented Aug 17, 2020 •

edited

Loading