Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master #5030

Open
frozenspider opened this issue Jul 9, 2020 · 6 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@frozenspider
Copy link
Contributor

frozenspider commented Jul 9, 2020

Jira Link: DB-2503
During YSQL index backfill process, in indexcmds.c we're doing this:

// DefineIndex
...
elog(LOG, "committing pg_index tuple with indislive=true");
CommitTransactionCommand();
YBDecrementDdlNestingLevel(true /* success */);
YBIncrementDdlNestingLevel();
StartTransactionCommand();

HandleYBStatus(YBCPgAsyncUpdateIndexPermissions(MyDatabaseId, relationId));

elog(LOG, "waiting for YB_INDEX_PERM_WRITE_AND_DELETE");
HandleYBStatus(YBCPgWaitUntilIndexPermissionsAtLeast(MyDatabaseId,
                                                     relationId,
                                                     indexRelationId,
                                                     YB_INDEX_PERM_WRITE_AND_DELETE,
                                                     &actual_index_permissions));
...

We expect YBDecrementDdlNestingLevel to make pg_index row (with indislive=true) to become visible to master for YBCPgAsyncUpdateIndexPermissions, but in the logs we see:

ts1|pid95670|:17795|http://127.0.0.52:18103 2020-07-09 22:09:47.135 +05 [95726] LOG:  committing pg_index tuple with indislive=true
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.139787 27136000 catalog_manager.cc:3607] Servicing AlterTable request from 127.0.0.52:57697: table { table_id: "000030a9000030008000000000004001" } force_send_alter_request: true
ts1|pid95670|:17795|http://127.0.0.52:18103 2020-07-09 22:09:47.140 +05 [95726] LOG:  waiting for YB_INDEX_PERM_WRITE_AND_DELETE
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.141741 36257792 backfill_index.cc:448] Locking indexed table
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.143888 36257792 backfill_index.cc:221] Index 000030a900003000800000000000400b is not yet live, skipping permission update
m1|pid95665|:22577|http://127.0.0.158:13695 I0709 22:09:47.143927 36257792 backfill_index.cc:336] Index permissions update skipped, leaving schema_version at 1

(i.e. ShouldProceedWithPgsqlIndexPermissionUpdate in backfill_index.cc did not find the row in pg_index)
To reproduce, enable YSQL backfill and run

ybd --java-test 'org.yb.pgsql.TestPgSelect#testNullPushdown' --extra-daemon-flags "--vmodule=backfill_index=3"

(logs)

UPD:
Same issue with TestPgRegressMisc (logs), TestPgRegressPlanner (logs), and many other TestPgRegress* tests

@frozenspider frozenspider added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Jul 9, 2020
@frozenspider frozenspider changed the title [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on master [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on yb-master Jul 9, 2020
@frozenspider frozenspider changed the title [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible on yb-master [YSQL] [Index Backfill] Committing a pg_index row is not immediately visible to yb-master Jul 9, 2020
@m-iancu
Copy link
Contributor

m-iancu commented Jul 9, 2020

For context, the pg_index read during the backfill permissions check is done in this function: https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/master/backfill_index.cc#L154.

frozenspider added a commit that referenced this issue Jul 10, 2020
Summary:
`org.yb.pgsql.TestIndexBackfill` test has been failing for the following reasons:
* Error message it was expecting to catch was changed in 8246ca4#diff-35b7aae825e891e35b810e7bc838ffe1L159
* For "schema version mismatch" transparent restart, when aborting transaction, we weren't cleaning up `debug_query_string` which remained pointing to the query allocated in the destroyed memory context - briefly, but it was subsequently used when logging, and was caught by ASAN

It might still sometimes fail in ASAN though:
* Because of #5030
* Weird case when test succeeds but our test framework reports it as a failure, possibly because of stray `postgres` processes left over after the test
(Resolves #5044)

Test Plan:
ybd --java-test org.yb.pgsql.TestIndexBackfill
(and ASAN as well)

Reviewers: jason, mihnea

Reviewed By: mihnea

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D8878
frozenspider added a commit that referenced this issue Jul 10, 2020
Summary:
`org.yb.pgsql.TestIndexBackfill` test has been failing for the following reasons:
* Error message it was expecting to catch was changed in 8246ca4#diff-35b7aae825e891e35b810e7bc838ffe1L159
* For "schema version mismatch" transparent restart, when aborting transaction, we weren't cleaning up `debug_query_string` which remained pointing to the query allocated in the destroyed memory context - briefly, but it was subsequently used when logging, and was caught by ASAN

It might still sometimes fail in ASAN though:
* Because of #5030
* Weird case when test succeeds but our test framework reports it as a failure, possibly because of stray `postgres` processes left over after the test
(Resolves #5044)

Test Plan:
ybd --java-test org.yb.pgsql.TestIndexBackfill
(and ASAN as well)

Reviewers: jason, mihnea

Reviewed By: mihnea

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D8878

(cherry picked from commit 91733d4)
@ndeodhar ndeodhar assigned jaki and unassigned mbautin, frozenspider and m-iancu Aug 5, 2020
@jaki
Copy link
Contributor

jaki commented Aug 13, 2020

According to BasePgSQLTest.java, the replication factor for those tests is 3. Would be nice to know if this issue occurs with rf 1 as well.

@jaki
Copy link
Contributor

jaki commented Aug 14, 2020

I confirm that this happens even with rf 1.

@jaki
Copy link
Contributor

jaki commented Aug 17, 2020

I think I've narrowed down to an explanation for what's going on. It looks like we commit the pg_index row but may not apply it before master reads it. Here's a concrete example:

apply patch on commit 0542f16:

diff --git i/src/yb/master/backfill_index.cc w/src/yb/master/backfill_index.cc
index d63999fe4..89d30db59 100644
--- i/src/yb/master/backfill_index.cc
+++ w/src/yb/master/backfill_index.cc
@@ -155,6 +155,7 @@ Result<bool> GetPgIndexStatus(
     CatalogManager* catalog_manager,
     const TableId& idx_id,
     const std::string& status_col_name) {
+  LOG(WARNING) << __func__ << " ENTR";
   const auto pg_index_id =
       GetPgsqlTableId(VERIFY_RESULT(GetPgsqlDatabaseOid(idx_id)), kPgIndexTableOid);
 
@@ -199,10 +200,12 @@ Result<bool> GetPgIndexStatus(
   QLTableRow row;
   if (VERIFY_RESULT(iter->HasNext())) {
     RETURN_NOT_OK(iter->NextRow(&row));
+    LOG(WARNING) << __func__ << " EXIT";
     return row.GetColumn(status_col_id)->bool_value();
   }
 
   // For practical purposes, an absent index is the same as having false status column value.
+  LOG(WARNING) << __func__ << " EXIT";
   return false;
 }
 
@@ -217,6 +220,12 @@ Result<bool> ShouldProceedWithPgsqlIndexPermissionUpdate(
   switch (new_perm) {
     case INDEX_PERM_WRITE_AND_DELETE: {
       auto live = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indislive"));
+      if (!live && catalog_manager->doaretry) {
+        LOG(WARNING) << "TRY AGAIN";
+        SleepFor(MonoDelta::FromMilliseconds(5000));
+        LOG(WARNING) << "DONE SLEEP";
+        live = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indislive"));
+      }
       if (!live) {
         VLOG(1) << "Index " << idx_id << " is not yet live, skipping permission update";
       }
@@ -224,6 +233,12 @@ Result<bool> ShouldProceedWithPgsqlIndexPermissionUpdate(
     }
     case INDEX_PERM_DO_BACKFILL: {
       auto ready = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indisready"));
+      if (!ready && catalog_manager->doaretry) {
+        LOG(WARNING) << "TRY AGAIN R";
+        SleepFor(MonoDelta::FromMilliseconds(5000));
+        LOG(WARNING) << "DONE SLEEP R";
+        ready = VERIFY_RESULT(GetPgIndexStatus(catalog_manager, idx_id, "indisready"));
+      }
       if (!ready) {
         VLOG(1) << "Index " << idx_id << " is not yet ready, skipping permission update";
       }
@@ -911,7 +926,7 @@ BackfillTablet::BackfillTablet(
   } else if (done()) {
     VLOG(1) << tablet_->ToString() << " backfill already done";
   } else {
-    VLOG(1) << tablet_->ToString() << " begining backfill from "
+    VLOG(1) << tablet_->ToString() << " beginning backfill from "
             << "<start-of-the-tablet>";
   }
 }
diff --git i/src/yb/master/catalog_manager.cc w/src/yb/master/catalog_manager.cc
index b3b1dd0f6..3602b53e2 100644
--- i/src/yb/master/catalog_manager.cc
+++ w/src/yb/master/catalog_manager.cc
@@ -6057,10 +6057,12 @@ Result<uint64_t> CatalogManager::IncrementYsqlCatalogVersion() {
   auto l = CHECK_NOTNULL(ysql_catalog_config_.get())->LockForWrite();
   uint64_t new_version = l->data().pb.ysql_catalog_config().version() + 1;
   l->mutable_data()->pb.mutable_ysql_catalog_config()->set_version(new_version);
+  LOG(WARNING) << "Incremented to " << new_version;
 
   // Write to sys_catalog and in memory.
   RETURN_NOT_OK(sys_catalog_->UpdateItem(ysql_catalog_config_.get(), leader_ready_term()));
   l->Commit();
+  LOG(WARNING) << "COMMITTT";
 
   return new_version;
 }
@@ -6378,6 +6380,7 @@ Status CatalogManager::StartRemoteBootstrap(const StartRemoteBootstrapRequestPB&
 
 void CatalogManager::SendAlterTableRequest(const scoped_refptr<TableInfo>& table,
                                            const AlterTableRequestPB* req) {
+  if (req == nullptr) { doaretry = false; } else { doaretry = true; }
   vector<scoped_refptr<TabletInfo>> tablets;
   table->GetAllTablets(&tablets);
 
diff --git i/src/yb/master/catalog_manager.h w/src/yb/master/catalog_manager.h
index c4424fa6d..e14b46f01 100644
--- i/src/yb/master/catalog_manager.h
+++ w/src/yb/master/catalog_manager.h
@@ -164,6 +164,7 @@ class CatalogManager : public tserver::TabletPeerLookupIf {
   };
 
  public:
+  volatile bool doaretry;
   // Some code refers to ScopedLeaderSharedLock as CatalogManager::ScopedLeaderSharedLock.
   using ScopedLeaderSharedLock = ::yb::master::ScopedLeaderSharedLock;
 
diff --git i/src/yb/master/master_tablet_service.cc w/src/yb/master/master_tablet_service.cc
index dbef676bf..1229907e1 100644
--- i/src/yb/master/master_tablet_service.cc
+++ w/src/yb/master/master_tablet_service.cc
@@ -71,6 +71,7 @@ void MasterTabletServiceImpl::Write(const tserver::WriteRequestPB* req,
   }
 
   tserver::TabletServiceImpl::Write(req, resp, std::move(context));
+  LOG(WARNING) << "write done";
 }
 
 void MasterTabletServiceImpl::IsTabletServerReady(

run:

./bin/yb-ctl create --master_flags "vmodule='backfill_index=3,tablet_service=2',TEST_docdb_log_write_batches=true,ysql_disable_index_backfill=false" --tserver_flags "vmodule='tablet_service=2',TEST_docdb_log_write_batches=true,ysql_disable_index_backfill=false"
./bin/ysqlsh
create table t1 (a int, b int, primary key (a asc)); insert into t1 values (1, 1), (2, 2), (3, 3); insert into t1 values (4, null), (5, null); create index on t1 (b asc); drop table t1; -- repeat until it gets stuck for 5 seconds (without patch, it would have timed out after 60 seconds)

postgres log:

I0816 15:22:30.120858  7337 table_creator.cc:307] Created index yugabyte.t1_b_idx of type PGSQL_TABLE_TYPE                                                                                                                                                                                                               
2020-08-16 15:22:30.157 PDT [7337] LOG:  waiting for YB_INDEX_PERM_DELETE_ONLY                                                                                                                                                                                                                                           
2020-08-16 15:22:30.157 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:30.193 PDT [7337] LOG:  committing pg_index tuple with indislive=true                                                                                                                                                                                                                                   
2020-08-16 15:22:30.193 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:30.196 PDT [7337] LOG:  waiting for YB_INDEX_PERM_WRITE_AND_DELETE                                                                                                                                                                                                                                      
2020-08-16 15:22:30.196 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:35.314 PDT [7337] LOG:  committing pg_index tuple with indisready=true                                                                                                                                                                                                                                  
2020-08-16 15:22:35.314 PDT [7337] STATEMENT:  create index on t1 (b asc);                                                                                                                                                                                                                                               
2020-08-16 15:22:35.325 PDT [7337] LOG:  waiting for Yugabyte index read permission                                                                                                                                                                                                                                      
2020-08-16 15:22:35.325 PDT [7337] STATEMENT:  create index on t1 (b asc);

tserver log:

I0816 15:22:30.193893  7276 tablet_service.cc:912] UpdateTransaction: tablet_id: "29678050581f49e487c23eecfdf77d02" state { transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" status: COMMITTED tablets: "00000000000000000000000000000000" } propagated_hybrid_time: 6543837389592064000, context: Call yb.tserver.TabletServerService.UpdateTransaction 127.0.0.51:53777 => 127.0.0.51:9100 (request call id 1721)
I0816 15:22:30.196712  7275 tablet_service.cc:796] Received Change Metadata RPC: tablet_id: "6b36498ad35f4fd995e06ed4cdbeefb4"
schema {
  columns {
    id: 10
    name: "a"
    type {
      main: INT32
    }
    is_key: true
    is_nullable: false
    is_static: false
    is_counter: false
    sorting_type: 3
    order: 1
  }
  columns {
    id: 11
    name: "b"
    type {
      main: INT32
    }
    is_key: false
    is_nullable: true
    is_static: false
    is_counter: false
    sorting_type: 0
    order: 2
  }
  table_properties {
    contain_counters: false
    is_transactional: true
    consistency_level: STRONG
    use_mangled_column_name: false
    num_tablets: 2
    is_ysql_catalog_table: false
    is_backfilling: false
  }
  colocated_table_id {
  }
}
schema_version: 1
new_table_name: "t1"
dest_uuid: "f7caffe546774086aea5eec920d4e1ab"
propagated_hybrid_time: 6543837389603315712
indexes {
  table_id: "000030a9000030008000000000004023"
  indexed_table_id: "000030a900003000800000000000401e"
  index_permissions: INDEX_PERM_DELETE_ONLY
}
alter_table_id: "000030a900003000800000000000401e"
I0816 15:22:30.200268 10244 version_set.cc:2839] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab [I]: Recovered from manifest file:/path/to/build/debug-gcc-dynamic-ninja/cluster_data/node-1/disk-1/yb-data/tserver/data/rocksdb/table-000030a9000030008000000000004023/tablet-e6332c0eabcf4b559605d77bdc2f8ae9.intents/MANIFEST-000001 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is <NULL>
I0816 15:22:30.200300 10244 version_set.cc:2847] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab [I]: Column family [default] (ID 0), log number is 0
I0816 15:22:30.202481 10244 tablet.cc:741] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab: Successfully opened a RocksDB database at /path/to/build/debug-gcc-dynamic-ninja/cluster_data/node-1/disk-1/yb-data/tserver/data/rocksdb/table-000030a9000030008000000000004023/tablet-e6332c0eabcf4b559605d77bdc2f8ae9, obj: 0x3b7c000
I0816 15:22:30.202522 10244 tablet_bootstrap.cc:539] T e6332c0eabcf4b559605d77bdc2f8ae9 P f7caffe546774086aea5eec920d4e1ab: Time spent opening tablet: real 0.061s   user 0.005s     sys 0.004s
I0816 15:22:30.202754  7692 tablet_service.cc:912] UpdateTransaction: tablet_id: "29678050581f49e487c23eecfdf77d02" state { transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" status: APPLIED_IN_ONE_OF_INVOLVED_TABLETS tablets: "00000000000000000000000000000000" }, context: Call yb.tserver.TabletServerService.UpdateTransaction 127.0.0.51:60301 => 127.0.0.51:9100 (request call id 130)

master log:

W0816 15:22:30.198561 10245 backfill_index.cc:158] GetPgIndexStatus ENTR
I0816 15:22:30.201946 10243 tablet.cc:1158] T 00000000000000000000000000000000 P 91a68dce44604096a04f7a4cfa683336: Wrote 93 key/value pairs to kRegular RocksDB:
Frontiers: { smallest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } largest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } }
1. PutCF(SubDocKey(DocKey(CoTableId=eb040000-0000-0080-0030-0000a9300000, [], [16419]), [SystemColumnId(0); HT{ physical: 1597616550194232 }]), '#\x80\x01\xa77\xe04\xa0\x96\x80J$' (238001A737E034A096804A24))
...
72. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(19); HT{ physical: 1597616550194232 w: 71 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xebF' (238001A737E034A0968036EB46))
73. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(20); HT{ physical: 1597616550194232 w: 72 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xcbF' (238001A737E034A0968036CB46))
74. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(21); HT{ physical: 1597616550194232 w: 73 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\xabF' (238001A737E034A0968036AB46))
75. PutCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(22); HT{ physical: 1597616550194232 w: 74 }]), '#\x80\x01\xa77\xe04\xa0\x96\x806\x8bT' (238001A737E034A09680368B54))
...
I0816 15:22:30.203896  7697 tablet.cc:1158] T 00000000000000000000000000000000 P 91a68dce44604096a04f7a4cfa683336: Wrote 227 key/value pairs to kIntents RocksDB:
Frontiers: { smallest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } largest: { op_id: 1.1252 hybrid_time: { physical: 1597616550195960 } history_cutoff: <invalid> hybrid_time_filter: <invalid> } }
1. SingleDeleteCF(TXN META df8616b4-7823-44cb-8d44-b7a6114476aa)
...
144. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(19)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 71 })
145. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 71 })
146. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(20)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 72 })
147. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 72 })
148. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(21)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 73 })
149. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 73 })
150. SingleDeleteCF(SubDocKey(DocKey(CoTableId=320a0000-0000-0080-0030-0000a9300000, [], [16419]), [ColumnId(22)]) [kStrongRead, kStrongWrite] HT{ physical: 1597616550125417 w: 74 })
151. SingleDeleteCF(TXN REV df8616b4-7823-44cb-8d44-b7a6114476aa HT{ physical: 1597616550125417 w: 74 })
...
W0816 15:22:30.206459 10245 backfill_index.cc:208] GetPgIndexStatus EXIT
W0816 15:22:30.206537 10245 backfill_index.cc:224] TRY AGAIN
  • CoTableId=320a0000-0000-0080-0030-0000a9300000 refers to table with bytes reversed: UUID 000030a9000030008000000000000a32 = yugabyte.pg_catalog.pg_index

  • transaction_id: "\337\206\026\264x#D\313\215D\267\246\021Dv\252" is in octal, so show it in hex using

    printf "\337\206\026\264x#D\313\215D\267\246\021Dv\252" | xxd

    and find it matches df8616b4-7823-44cb-8d44-b7a6114476aa

  • ColumnId(19) refers to indisvalid; ColumnId(21) refers to indisready; ColumnId(22) refers to indislive. (This I know because of the order of non-pk columns in pg_index, where the first non-pk column is ColumnId(11).) Then, looking at the value (for PutCFs), there's first a hybrid time (value type char #), then other stuff, and the last character is value type kFalse or kTrue appearing as F or T, respectively.

  • The pg_index write txn committed on 15:22:30.193893, was written to regular db on 15:22:30.201946, and was marked as applied on 15:22:30.202754. Meanwhile, GetPgIndexStatus was called from 15:22:30.198561 to 15:22:30.206459, which includes the pg_index write/apply.

I later did more debugging to see that the problem seems to occur when RETURN_NOT_OK(doc_iter->Init(spec)); in GetPgIndexStatus happens before the pg_index row is written to regular db.

  • When waiting for indislive (i.e. waiting for the row to first appear), the new pg_index row is never written to intents db. So we have writes to regular db and deletes to intents db. That's strange since the deletes to intents db are then pointless. cc: @mbautin
  • When waiting for indisready, the pg_index row update is written to intents db, as expected. So we have writes to intents db, writes to regular db, then deletes to intents db.

For indislive, since the row isn't in regular or intents db, GetPgIndexStatus returns false (through the absent path).
For indisready, I have sometimes seen it fail there, too. But it is rare, so I haven't gotten much of a chance to debug it.

@jaki
Copy link
Contributor

jaki commented Sep 9, 2020

It appears there may be more than one thing going on here. I have collected logs from several failures, and I grouped them up to two main types:

  • intent_aware_iterator.cc:153] Intent for transaction w/o metadata: 74b5c32d-3059-497c-a288-3e39b60ad9c6 (indislive and indisready)
  • doc_reader.cc:547] projection subkey: ColumnId(22) (indislive)

The first causes the newly written row to not get found. In the case of indislive, it returns nothing, even though it clearly sees it in the iterator. In the case of indisready, it returns the old value before update, even though it clearly sees the updated version in the iterator. Both of these may be because of intent_aware_iterator.cc:711] decode_result->value_time <= resolved_intent_time.

The second simply reads the wrong row. ColumnId(22) refers to the indisvalid column. It's really bizzare why it's attempting to read that column rather than ColumnId(19) (indislive).

@robertsami robertsami self-assigned this Sep 30, 2020
@jaki
Copy link
Contributor

jaki commented Sep 30, 2020

@tedyu pointed out an inaccuracy with my second issue in the previous comment. I confused column ids 19, 21, 22 to be indislive, indisready, indisvalid, respectively. ColumnId(22) refers to the indislive column. The second issue is probably no longer relevant.

jasonyb pushed a commit that referenced this issue Apr 13, 2023
Summary:
Online schema changes don't properly wait for all clients (postgres
backends) to be at an up-to-date schema (system catalog, catalog
version) before moving on to the next schema.  Right now, there is only
one implementation of online schema change: CREATE INDEX.  It does a
default 1s wait between index state flag changes.  This can be
problematic if the schema was not fully propagated within 1s.

Fix to actually wait for all clients to be at an up-to-date schema.  Do
so by waiting on all clients (on the same database) to have a requested
catalog version.  The implementation details of this waiting are in a
code comment of ysql_backends_manager.h.

In order to get back the old behavior, set tserver flags
--ysql_yb_index_state_flags_update_delay=1000
--ysql_yb_disable_wait_for_backends_catalog_version=true.

Client-master RPC timeout is controllable by tserver flag
--wait_for_ysql_backends_catalog_version_client_master_rpc_timeout_ms
and common flag (meaning master and tserver flag expected to be set to
the same value)
--wait_for_ysql_backends_catalog_version_client_master_rpc_margin_ms.
The first flag defaults to 20s and determines the max time spent waiting
on master.  The second flag defaults to 5s and accounts for any
additional time it takes for RPC overhead.  The two flags are combined
to make the official RPC deadline.

The master-tserver RPC timeout is controllable by the master flag
--master_ts_rpc_timeout_ms, which is also used for other master-tserver
RPCs.

A WaitForYsqlBackendsCatalogVersion request typically creates a
BackendsCatalogVersionJob, viewable in master UI /tasks page.  This job
launches the BackendsCatalogVersionTS master-tserver RPC tasks, also
viewable in master UI /tasks page.  The job details tell how many
backends are still lagging behind, so this page can be consulted in case
of a long-running CREATE INDEX (that is stuck before backfill).  In the
future, capabilities could be added to expose which backends are behind,
but for now, this information can be manually collected by running the
same SELECT query as in tablet_service.cc:

    SELECT count(*) FROM pg_stat_activity WHERE catalog_version < $0 AND datid = $1

The master flag --ysql_backends_catalog_version_job_expiration_sec
determines how long jobs can be untouched by clients before being
candidates for expiration.

Close: #7376

Test Plan:
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pgwrapper_pg_backends-test
    ./yb_build.sh release \
      --cxx-test pgwrapper_pg_backends-test \
      --gtest_filter PgBackendsTest.Stress
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pgwrapper_pg_index_backfill-test \
      --gtest_filter PgIndexBackfillTest.InsertsWhileCreatingIndex
    ./yb_build.sh fastdebug --gcc11 -n 100 --tp 1 --java-test \
      'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex'
    ./yb_build.sh asan -n 10 --tp 1 --java-test \
      'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex'

There could be some flaky failures with error

    Requested catalog version is too high:

due to issue #5030 still not resolved.  This is mainly hit by the
PgBackendsTest.CachedJob test.

Reviewers: amitanand, skedia, myang

Reviewed By: skedia, myang

Subscribers: tverona, yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19621
@sushantrmishra sushantrmishra assigned pkj415 and unassigned jasonyb Nov 20, 2024
jasonyb pushed a commit that referenced this issue Nov 23, 2024
Summary:
Concurrent index creation enforces state changes where clients cannot be
more than one state apart.  The state is tracked by the catalog version,
and it is enforced by making sure all clients are at the catalog version
corresponding to the state change before moving on to the next state.

Problem is, the catalog version we wait on does not necessarily
correspond to the actual state change.  The catalog version used for
waiting uses the local catalog version, which is just +1 of the previous
version.  That is not accurate when other DDLs bump up the catalog
version simultaneously.

Fix with two parts:

1. wait on a proper catalog version: instead of using the local catalog
   version, fetch the master's catalog version after the previous state
   change commit, then wait on that version.  The downside is that there
   is a time window between commit and fetching of master's catalog
   version such that we may end up waiting on a later version than
   necessary, but I didn't find an easy way to get the actual version
   corresponding to the commit, especially considering issue #5030.
1. avoid waiting on the backend that is running the CREATE INDEX: since
   we are now possibly waiting on catalog versions that are beyond the
   version that the CREATE INDEX is at, it becomes necessary to
   whitelist the backend running CREATE INDEX from being taken into
   consideration during the wait on backends catalog version request.
   This involves a lot of plumbing to send down that information.  The
   info to identify the backend is the tserver UUID and backend PID.

Some considerations:

- setting the local catalog version to +1 is an existing optimization
  that should be reconsidered.  If a concurrent DDL happened such that
  the actual catalog version is higher, this backend may pass breaking
  catalog version checks even if it doesn't deserve to.  Do not deal
  with this issue in this commit.  Filed #25068.
- two callers of wait on backends catalog version for the same
  db+version is no longer as shareable because only one backend is
  excepted.  Whichever caller was first will register their backend
  tserver/pid for exemption from the check.  If a second caller has the
  same db+version, it will have to wait behind the existing information.
  It is not safe to add this second caller's tserver/pid for additional
  exemption because both callers can actually be waiting on different
  catalog versions, but they happened to pick up a later catalog version
  since the picking is not perfect right now.  In that case, it is
  incorrect to ignore both backends in case the first caller was
  corresponding to an actual version later than the second caller's
  actual version.  So such two-caller cases might end up conflicting on
  each other.  Even if these two callers were not combined and instead
  got their separate jobs, they could still be waiting on each other if
  their CREATE INDEXes started on old versions.
Jira: DB-13866

Test Plan:
On Almalinux 8:

    ./yb_build.sh fastdebug --gcc11 \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh release \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh tsan \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 300
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pg_backends-test

Backport-through: 2.20

Reviewers: myang, amartsinchyk

Reviewed By: amartsinchyk

Subscribers: amartsinchyk, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D39780
jasonyb pushed a commit that referenced this issue Dec 6, 2024
Summary:
Concurrent index creation enforces state changes where clients cannot be
more than one state apart.  The state is tracked by the catalog version,
and it is enforced by making sure all clients are at the catalog version
corresponding to the state change before moving on to the next state.

Problem is, the catalog version we wait on does not necessarily
correspond to the actual state change.  The catalog version used for
waiting uses the local catalog version, which is just +1 of the previous
version.  That is not accurate when other DDLs bump up the catalog
version simultaneously.

Fix with two parts:

1. wait on a proper catalog version: instead of using the local catalog
   version, fetch the master's catalog version after the previous state
   change commit, then wait on that version.  The downside is that there
   is a time window between commit and fetching of master's catalog
   version such that we may end up waiting on a later version than
   necessary, but I didn't find an easy way to get the actual version
   corresponding to the commit, especially considering issue #5030.
1. avoid waiting on the backend that is running the CREATE INDEX: since
   we are now possibly waiting on catalog versions that are beyond the
   version that the CREATE INDEX is at, it becomes necessary to
   whitelist the backend running CREATE INDEX from being taken into
   consideration during the wait on backends catalog version request.
   This involves a lot of plumbing to send down that information.  The
   info to identify the backend is the tserver UUID and backend PID.

Some considerations:

- setting the local catalog version to +1 is an existing optimization
  that should be reconsidered.  If a concurrent DDL happened such that
  the actual catalog version is higher, this backend may pass breaking
  catalog version checks even if it doesn't deserve to.  Do not deal
  with this issue in this commit.  Filed #25068.
- two callers of wait on backends catalog version for the same
  db+version is no longer as shareable because only one backend is
  excepted.  Whichever caller was first will register their backend
  tserver/pid for exemption from the check.  If a second caller has the
  same db+version, it will have to wait behind the existing information.
  It is not safe to add this second caller's tserver/pid for additional
  exemption because both callers can actually be waiting on different
  catalog versions, but they happened to pick up a later catalog version
  since the picking is not perfect right now.  In that case, it is
  incorrect to ignore both backends in case the first caller was
  corresponding to an actual version later than the second caller's
  actual version.  So such two-caller cases might end up conflicting on
  each other.  Even if these two callers were not combined and instead
  got their separate jobs, they could still be waiting on each other if
  their CREATE INDEXes started on old versions.

Backport-note: resolve merge conflict in pg_index_backfill-test.cc due
to lack of 77958e8.

Jira: DB-13866

Test Plan:
On Almalinux 8:

    ./yb_build.sh fastdebug --gcc11 \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh release \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh tsan \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 300
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pg_backends-test

Backport-through: 2.20
Original commit: 29076dd / D39780

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: myang

Differential Revision: https://phorge.dev.yugabyte.com/D40400
jasonyb pushed a commit that referenced this issue Dec 20, 2024
Summary:
Concurrent index creation enforces state changes where clients cannot be
more than one state apart.  The state is tracked by the catalog version,
and it is enforced by making sure all clients are at the catalog version
corresponding to the state change before moving on to the next state.

Problem is, the catalog version we wait on does not necessarily
correspond to the actual state change.  The catalog version used for
waiting uses the local catalog version, which is just +1 of the previous
version.  That is not accurate when other DDLs bump up the catalog
version simultaneously.

Fix with two parts:

1. wait on a proper catalog version: instead of using the local catalog
   version, fetch the master's catalog version after the previous state
   change commit, then wait on that version.  The downside is that there
   is a time window between commit and fetching of master's catalog
   version such that we may end up waiting on a later version than
   necessary, but I didn't find an easy way to get the actual version
   corresponding to the commit, especially considering issue #5030.
1. avoid waiting on the backend that is running the CREATE INDEX: since
   we are now possibly waiting on catalog versions that are beyond the
   version that the CREATE INDEX is at, it becomes necessary to
   whitelist the backend running CREATE INDEX from being taken into
   consideration during the wait on backends catalog version request.
   This involves a lot of plumbing to send down that information.  The
   info to identify the backend is the tserver UUID and backend PID.

Some considerations:

- setting the local catalog version to +1 is an existing optimization
  that should be reconsidered.  If a concurrent DDL happened such that
  the actual catalog version is higher, this backend may pass breaking
  catalog version checks even if it doesn't deserve to.  Do not deal
  with this issue in this commit.  Filed #25068.
- two callers of wait on backends catalog version for the same
  db+version is no longer as shareable because only one backend is
  excepted.  Whichever caller was first will register their backend
  tserver/pid for exemption from the check.  If a second caller has the
  same db+version, it will have to wait behind the existing information.
  It is not safe to add this second caller's tserver/pid for additional
  exemption because both callers can actually be waiting on different
  catalog versions, but they happened to pick up a later catalog version
  since the picking is not perfect right now.  In that case, it is
  incorrect to ignore both backends in case the first caller was
  corresponding to an actual version later than the second caller's
  actual version.  So such two-caller cases might end up conflicting on
  each other.  Even if these two callers were not combined and instead
  got their separate jobs, they could still be waiting on each other if
  their CREATE INDEXes started on old versions.
Jira: DB-13866

Merge notes:
- Remove ASSERT_OK around BumpCatalogVersion because of lack of commit
  0f875f5.
- Bring missing test class PgIndexBackfillReadCommittedBlockIndislive
  from commit 68f7de8 which was not
  backported here.
- Bring missing indislive blocking (indexcmds.c, guc.c) from commit
  68f7de8 which was not backported
  here.
- Disable test in sanitizers because lack of
  0f875f5 occasionally causes

      Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:432): Fetch 'SELECT yb_increment_all_db_catalog_versions(true)' failed: 7, message: ERROR:  The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA (pgsql error XX000) (aux msg ERROR:  The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA)

Test Plan:
On Almalinux 8:

    ./yb_build.sh fastdebug --gcc11 \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh release \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pg_backends-test

Backport-through: 2.20
Original commit: 29076dd / D39780

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: pjain, myang

Differential Revision: https://phorge.dev.yugabyte.com/D40815
jasonyb pushed a commit that referenced this issue Dec 23, 2024
Summary:
Concurrent index creation enforces state changes where clients cannot be
more than one state apart.  The state is tracked by the catalog version,
and it is enforced by making sure all clients are at the catalog version
corresponding to the state change before moving on to the next state.

Problem is, the catalog version we wait on does not necessarily
correspond to the actual state change.  The catalog version used for
waiting uses the local catalog version, which is just +1 of the previous
version.  That is not accurate when other DDLs bump up the catalog
version simultaneously.

Fix with two parts:

1. wait on a proper catalog version: instead of using the local catalog
   version, fetch the master's catalog version after the previous state
   change commit, then wait on that version.  The downside is that there
   is a time window between commit and fetching of master's catalog
   version such that we may end up waiting on a later version than
   necessary, but I didn't find an easy way to get the actual version
   corresponding to the commit, especially considering issue #5030.
1. avoid waiting on the backend that is running the CREATE INDEX: since
   we are now possibly waiting on catalog versions that are beyond the
   version that the CREATE INDEX is at, it becomes necessary to
   whitelist the backend running CREATE INDEX from being taken into
   consideration during the wait on backends catalog version request.
   This involves a lot of plumbing to send down that information.  The
   info to identify the backend is the tserver UUID and backend PID.

Some considerations:

- setting the local catalog version to +1 is an existing optimization
  that should be reconsidered.  If a concurrent DDL happened such that
  the actual catalog version is higher, this backend may pass breaking
  catalog version checks even if it doesn't deserve to.  Do not deal
  with this issue in this commit.  Filed #25068.
- two callers of wait on backends catalog version for the same
  db+version is no longer as shareable because only one backend is
  excepted.  Whichever caller was first will register their backend
  tserver/pid for exemption from the check.  If a second caller has the
  same db+version, it will have to wait behind the existing information.
  It is not safe to add this second caller's tserver/pid for additional
  exemption because both callers can actually be waiting on different
  catalog versions, but they happened to pick up a later catalog version
  since the picking is not perfect right now.  In that case, it is
  incorrect to ignore both backends in case the first caller was
  corresponding to an actual version later than the second caller's
  actual version.  So such two-caller cases might end up conflicting on
  each other.  Even if these two callers were not combined and instead
  got their separate jobs, they could still be waiting on each other if
  their CREATE INDEXes started on old versions.
Jira: DB-13866

Merge notes:
- Remove ASSERT_OK around BumpCatalogVersion because of lack of commit
  0f875f5.
- Bring missing test class PgIndexBackfillReadCommittedBlockIndislive
  from commit 68f7de8 which was not
  backported here.
- Bring missing indislive blocking (indexcmds.c, guc.c) from commit
  68f7de8 which was not backported
  here.
- Disable test in sanitizers because lack of
  0f875f5 occasionally causes

      Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:432): Fetch 'SELECT yb_increment_all_db_catalog_versions(true)' failed: 7, message: ERROR:  The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA (pgsql error XX000) (aux msg ERROR:  The catalog snapshot used for this transaction has been invalidated: expected: 2, got: 1: MISMATCHED_SCHEMA)

- Use PgIndexBackfillTestEnableWait subclass to enable backends manager
  (this should be the case in master for future-proofing, but it is
  minor and not worth the hassle)
- Use FetchValue instead of FetchRow (see commit
  4ec5f1c).

Test Plan:
On Almalinux 8:

    ./yb_build.sh fastdebug --gcc11 \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh release \
      --gtest_filter PgIndexBackfillTest.CatVerBumps \
      -n 100
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pg_backends-test

Backport-through: 2.20
Original commit: 29076dd / D39780

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: myang, pjain

Differential Revision: https://phorge.dev.yugabyte.com/D40838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
Status: No status
Development

No branches or pull requests

9 participants