Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TiFlash node to a cluster with 20k tables is slow #1423

Closed
JaySon-Huang opened this issue Feb 5, 2021 · 7 comments · Fixed by #1751
Closed

Adding TiFlash node to a cluster with 20k tables is slow #1423

JaySon-Huang opened this issue Feb 5, 2021 · 7 comments · Fixed by #1751
Assignees

Comments

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Feb 5, 2021

If we add a TiFlash node to a TiDB cluster that already has 20k tables, TiFlash may be slow to start.
We will DO Full Schema sync when starting the new TiFlash node. And no matter those tables have TiFlash replica or not, we will create empty tables on the TiFlash node.

During this period, the TiFlash node

  • does not register itself as a store on PD
  • does not start reporting metrics
  • does not listen on the coprocessor port

The period could be quite long, especially when the user enables the feature "Encryption at Rest" (in DBaaS, for example), it takes more time for TiFlash to "be ready" on PD.

Actually, we met a situation that adding a TiFlash node (v4.0.9) to a cluster with more than 20k tables. It took about 15~20 minutes to become "normal" on the DBaaS dashboard. The user is confused that why TiFlash is "Unavailable" for 15 minutes after he/she create the node. (DBaaS show "unavailable" if a TiFlash is UP but not found in PD store list)

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Feb 5, 2021

We should break down this problem in different aspect:

  1. Can we shorten the period of showing "unavailable" -> "normal"?
    In this aspect, we should think about how to make users see adding a TiFlash node with "no surprise". Even if the TiFlash node can not serve requests when it is initiating itself.
  2. Is full schema sync a necessary step before serving as a "RaftStore" and serving as a "Coprocessor" and reporting "metrics"?
    a. If we register as a store in PD before full schema sync. I think there will be an issue: PD starts to add region peer on TiFlash store, then TiFlash will trigger schema sync while applying Raft Logs. If schema sync takes a long time, this will block all the Raft threads. Will this make some "surprise" for TiFlash/TiKV/PD?
  3. Can we shorten the period of schema sync?
    a. Can we do not create tables with no TiFlash replica?
    b. Can we create fewer files when creating an empty table? (An empty table will create 17 files at least in v4.0.9)
    c. Can we create fewer encrypt meta when creating an empty table?

@JaySon-Huang
Copy link
Contributor Author

cc @zanmato1984

@flowbehappy
Copy link
Contributor

flowbehappy commented Feb 6, 2021

We should break down this problem in different aspect:

  1. Can we shorten the period of showing "unavailable" -> "normal"?
    In this aspect, we should think about how to make users see adding a TiFlash node with "no surprise". Even if the TiFlash node can not serve requests when it is initiating itself.
  2. Is full schema sync a necessary step before serving as a "RaftStore" and serving as a "Coprocessor" and reporting "metrics"?

I would rather choose to keep the users waiting, instead of using an "unhealthy" service. But we can definitely shorten the waiting time.

a. If we register as a store in PD before full schema sync. I think there will be an issue: PD starts to add region peer on TiFlash store, then TiFlash will trigger schema sync while applying Raft Logs. If schema sync takes a long time, this will block all the Raft threads. Will this make some "surprise" for TiFlash/TiKV/PD?

Can we only sync the specific table schema, instead of ALL schemas, under this scenario?

  1. Can we shorten the period of schema sync?
    a. Can we do not create tables with no TiFlash replica?

Definitely, we should not create tables with no TiFlash replica. I can't see any reason for blocking this issue.

b. Can we create fewer files when creating an empty table? (An empty table will create 17 files at least in v4.0.9)
There are several improvements for this issue IMO:

  • Avoid creating a real IStorage instance, but a fake one. Only materialize it before any data coming in.
  • Only create files for PageStorage and DTFile before the first write request. And only create the first Segment before writing the first rows in DeltaMergeStore
  • Only open more writable files in PageStorage when needed.

c. Can we create fewer encrypt meta when creating an empty table?

After the "log style improvement" for encryption-at-rest, I think this issue considered fixed. Besides, we are going to use EBS's encryption by default in the future.

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Feb 6, 2021

It took about 20 minutes to add a new TiFlash node to a cluster with 20k tables after "log style improvement for encryption-at-rest" enabled.
If we don't pick that fix, it will take hours or tens of hours 😓😓 @flowbehappy

@flowbehappy
Copy link
Contributor

flowbehappy commented Feb 7, 2021

It took 20 minutes after "log style improvement for encryption-at-rest" enable.

And by calculation, an empty table needs ~60ms (20.0 * 60 * 1000 / 20000) to create. Which I don't think is a surprise by the current design. But we still need to improve though.

@JaySon-Huang
Copy link
Contributor Author

In a QA test, adding a TiFlash node to a cluster with 80k tables, with the feature "Encryption at Rest" disabled. It took 20 min for TiFlash to be ready. Creating an empty table is about 15ms under this circumstance.

@JinheLin
Copy link
Contributor

JinheLin commented Apr 9, 2021

lidezhu pushed a commit that referenced this issue May 27, 2021
* Revert "Init store in background task. (#1843,#1896) (#1874)"

This reverts commit 0882461.

Conflicts:
	dbms/src/Storages/StorageDeltaMerge.cpp

* format code

* Revert "Lazily initializing DeltaMergeStore  (#1423) (#1751) (#1868)"

This reverts commit bbce050.

Conflicts:
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_storage_delta_merge.cpp
	dbms/src/Storages/StorageDeltaMerge.cpp
	dbms/src/Storages/StorageDeltaMerge.h

Co-authored-by: JinheLin <[email protected]>
Co-authored-by: Flowyi <[email protected]>
JaySon-Huang pushed a commit to JaySon-Huang/tiflash that referenced this issue Jun 8, 2021
Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: Flowyi <[email protected]>

cherry pick pingcap#1821 to release-5.0 (pingcap#1827)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: xufei <[email protected]>

Remove ReplicatedMergeTree family (pingcap#1805) (pingcap#1826)

cherry pick pingcap#1835 to release-5.0 (pingcap#1840)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: xufei <[email protected]>

Use file path as cache key (pingcap#1852) (pingcap#1862)

Lazily initializing DeltaMergeStore  (pingcap#1423) (pingcap#1751) (pingcap#1868)

Add background gc check thread for DeltaTree storage (pingcap#1742) (pingcap#1828)

* cherry pick pingcap#1742 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* fix conflict

Co-authored-by: lidezhu <[email protected]>

Fix calculate stable property (pingcap#1839) (pingcap#1878)

* cherry pick pingcap#1839 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* fix compile and cherry pick a small fix

Co-authored-by: lidezhu <[email protected]>

Fix client-c mistake setting current_ts to 0 when resolving lock for async commit (pingcap#1856) (pingcap#1870)

support constant folding in dbg coprocessor (pingcap#1881) (pingcap#1884)

Optimize & Reduce time cost about ci test (pingcap#1849) (pingcap#1894)

* reduce time cost about ci test by parallel
* add `-DNO_WERROR=ON` to cmake config for release-darwin build
* Fix tidb_ghpr_tics_test fail (pingcap#1895)

Signed-off-by: Zhigao Tong <[email protected]>

Fix panic with feature `compaction filter` in release-5.0 (pingcap#1891)

Init store in background task. (pingcap#1843,pingcap#1896) (pingcap#1874)

Push down filter on timestamp type column to storage layer (pingcap#1875) (pingcap#1906)

* cherry pick pingcap#1875 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* remove irrevalant code

Co-authored-by: lidezhu <[email protected]>

Fix ExchangeSender: remove duplicated write stream operation (pingcap#1379) (pingcap#1883)

cherry pick pingcap#1917 to release-5.0 (pingcap#1923)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: Fu Zhe <[email protected]>

Pre-handle SSTFiles to DTFiles when applying snapshots (pingcap#1439) (pingcap#1867)

* Add `raft.snapshot.method` in configuration file to control the method of applying snapshots
  - "block" decode snapshot data as a block
  - "file1" decode snapshot data as DTFiles (directory mode) [**By default**]
* Add a stream `SSTFilesToBlockInputStream` to decode snapshot data into blocks
* Add a stream `BoundedSSTFilesToBlockInputStream` to bound the blocks read from SSTFilesToBlockInputStream by column `_tidb_rowid`
* Add a stream `SSTFilesToDTFilesOutputStream` that accept `BoundedSSTFilesToBlockInputStream` as input stream to write blocks into DTFiles
* Make `STORAGE_FORMAT_CURRENT` to be `STORAGE_FORMAT_V2` by default to support ingest DTFile into DT engine
* Fix the bug that the block decoded from SSTFile is not sorted by primary key and version (pingcap#1888)
* Fix panic with feature compaction filter with DTFile
* Fix ingest write amplification after snapshot apply optimization (pingcap#1913)

Co-authored-by: Zhigao Tong <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>

Cast int/real as real (pingcap#1911) (pingcap#1928)

Fix the problem that old dmfile is not removed atomically (pingcap#1918) (pingcap#1925)

Eliminate the table lock between reading,writing and DDL operators for TiFlash (pingcap#1736) (pingcap#1858)

* Eliminate the table lock between reading,writing and DDL operators for TiFlash (pingcap#1736)

1. Port the latest `RWLock` for `IStorage` from Clickhouse. "phase-fair rwlocks have been shown to provide more predictable timings compared to task-fair rwlocks for certain workloads (for relatively low share of writes, <25%). Brandenburg B, Anderson J, (2010)"
2. Refactor the lock model for `IStorage` to eliminate the lock between reading, writing, and DDL operations.
3. Acquire table lock by QueryID/threadName instead of function name
4. Tombstone the database after receiving a "drop" DDL from TiDB. Physically drop the databases / tables after gc safe point.
5. Remove some outdated tests under `tests/mutable-test/txn_schema` (all those tests are ported into `tests/delta-merge-test/raft/schema`)

* Mark FIXME for conflict codes
Conflicts:
  dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp

Signed-off-by: JaySon-Huang <[email protected]>

cached tics code incase of clone same code repeatedly (pingcap#1994) (pingcap#2001)

Add row & bytes threshold config for CompactLog (pingcap#1997) (pingcap#2005)

* cherry pick pingcap#1997 to release-5.0

Signed-off-by: ti-srebot <[email protected]>
Co-authored-by: Zhigao Tong <[email protected]>

Fix the bug that incomplete write batches are not truncated (pingcap#1934) (pingcap#2003)

Revert "Push down filter on timestamp type column to storage layer (pingcap#1875) (pingcap#1906)" (pingcap#2010)

This reverts commit 630b612.

Revert Lazily Init Store (pingcap#2011)

* Revert "Init store in background task. (pingcap#1843,pingcap#1896) (pingcap#1874)"

This reverts commit 0882461.

Conflicts:
	dbms/src/Storages/StorageDeltaMerge.cpp

* format code

* Revert "Lazily initializing DeltaMergeStore  (pingcap#1423) (pingcap#1751) (pingcap#1868)"

This reverts commit bbce050.

Conflicts:
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_storage_delta_merge.cpp
	dbms/src/Storages/StorageDeltaMerge.cpp
	dbms/src/Storages/StorageDeltaMerge.h

Co-authored-by: JinheLin <[email protected]>
Co-authored-by: Flowyi <[email protected]>

Revert "background gc thread" (pingcap#2015)

* Revert "Fix calculate stable property (pingcap#1839) (pingcap#1878)"

This reverts commit 6a9eb58.

* Revert "Add background gc check thread for DeltaTree storage (pingcap#1742) (pingcap#1828)"

This reverts commit cbbbd09.

* remove code

* format code

* format code

* fix test compile

* fix comment

Fix the potential concurrency problem when clone the shared delta index (pingcap#2030) (pingcap#2033)

* cherry pick pingcap#2030 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* Update dbms/src/Storages/DeltaMerge/DeltaIndex.h

Co-authored-by: JaySon <[email protected]>

Co-authored-by: lidezhu <[email protected]>
Co-authored-by: JaySon <[email protected]>

Fix bug for unexpected deleted ingest file (pingcap#2047) (pingcap#2048)

* Only enable GC for DTFiles after they get applied to all segments.
* Refine some loggings

Signed-off-by: JaySon-Huang <[email protected]>

Fix cast int as time (pingcap#1788) (pingcap#1893)

cherry pick pingcap#1903 to release-5.0 (pingcap#1915)

Signed-off-by: ti-srebot <[email protected]>

Fix ingesting file may add ref page to deleted page (pingcap#2054) (pingcap#2055)

1. Put the ingest file id into `storage_pool.data` before ingesting them into segments
2. Add ref pages to the ingest files for each segment
3. Delete the original page id after all

Signed-off-by: JaySon-Huang <[email protected]>

Fix compile error on Mac Clang 12.0.5 (pingcap#2058) (pingcap#2070)

cherry pick pingcap#2017 pingcap#2020 pingcap#2051 to release-5.0 (pingcap#2074)

Revert "cherry pick pingcap#2017 pingcap#2020 pingcap#2051 to release-5.0 (pingcap#2074)"

This reverts commit 8e0712c.

Revert "Fix compile error on Mac Clang 12.0.5 (pingcap#2058) (pingcap#2070)"

This reverts commit 3c31b92.

Revert "Fix ingesting file may add ref page to deleted page (pingcap#2054) (pingcap#2055)"

This reverts commit c773d61.

Revert "Fix bug for unexpected deleted ingest file (pingcap#2047) (pingcap#2048)"

This reverts commit b329618.

Revert "Pre-handle SSTFiles to DTFiles when applying snapshots (pingcap#1439) (pingcap#1867)"

This reverts commit c947fd6.

 Conflicts:
	dbms/src/Common/FailPoint.cpp
	dbms/src/Storages/DeltaMerge/File/DMFileBlockOutputStream.h
	dbms/src/Storages/DeltaMerge/File/DMFileWriter.h
	dbms/src/Storages/DeltaMerge/ReorganizeBlockInputStream.h
	dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp
	dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.h
	dbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp
	dbms/src/Storages/DeltaMerge/Segment.cpp
	dbms/src/Storages/DeltaMerge/StableValueSpace.cpp
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_file.cpp
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_storage_delta_merge.cpp
	dbms/src/Storages/StorageDeltaMerge.cpp
	dbms/src/Storages/StorageDeltaMerge.h
	dbms/src/Storages/Transaction/ApplySnapshot.cpp
	dbms/src/Storages/Transaction/PartitionStreams.cpp
	tests/delta-merge-test/raft/schema/drop_on_restart.test

Fix comment

Signed-off-by: JaySon-Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants