Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pre-split index regions before creating index #57552

Closed
tangenta opened this issue Nov 20, 2024 · 0 comments · Fixed by #57553
Closed

Support pre-split index regions before creating index #57552

tangenta opened this issue Nov 20, 2024 · 0 comments · Fixed by #57553
Assignees
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@tangenta
Copy link
Contributor

tangenta commented Nov 20, 2024

Feature Request

Is your feature request related to a problem? Please describe:

CREATE TABLE `test` (
    `a` bigint NOT NULL,
    `b` bigint NOT NULL,
    `c` bigint DEFAULT NULL,
    PRIMARY KEY (`a`, `b`)
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4 COLLATE = utf8mb4_bin;
CREATE INDEX `idx` ON `test` (`a`, `c`, `b`);

Right after the "CREATE INDEX" statement was sent, the cluster latency increased significantly, because one TiKV instance is overloaded due to the index write hotspot. The latency was reduced when PD split and scheduled the hotspot regions to other TiKV instances.

The cause of write hotspot could be one of follows:

  • Sequential Inserts: When new rows are inserted with sequential or monotonically increasing values for the indexed column, such as timestamps or auto-incrementing primary keys.
  • Skewed Data Distribution: When the data distribution is heavily skewed, causing a disproportionate number of writes to a specific range of index keys.
  • High Write Frequency: When there is a high frequency of write operations (inserts, updates, deletes) targeting the same index keys or a small range of keys.

Describe the feature you'd like:

Split temp index regions(or index regions in "txn" mode) before the index state becomes "delete-only". Because TiDB doesn't have information about the workload, we have to guess the upper bound and lower bound index values of incoming traffic and split several regions.

A possible extension is introducing the pre-split "index_option" to let users provide more information about the workload.

-- pre-split into 4 regions. The range is calculated automatically.
ALTER TABLE t ADD INDEX idx(col1, col2) PRE_SPLIT_REGIONS=4;
CREATE INDEX idx on t (col1, col2) PRE_SPLIT_REGIONS=4;

-- pre-split into 4 regions and specify lower and upper bound.
ALTER TABLE t ADD INDEX idx(col1, col2) PRE_SPLIT_REGIONS = (BETWEEN ('a', 10) AND ('z', 100) REGIONS 4);

-- pre-split on specified index keys.
ALTER TABLE t ADD INDEX idx(col1, col2) PRE_SPLIT_REGIONS = (BY ('a', 10), ('b', 20), ('c', 30));

PRE_SPLIT_REGION...BETWEEN analogs the behavior of BETWEEN AND clause:

SPLIT TABLE ... INDEX ... BETWEEN (...) AND (...) REGIONS ...;

And PRE_SPLIT_REGION...BY analogs the behavior of SPLIT BY clause:

SPLIT TABLE ... INDEX ... BY (...), (...), ...

TiDB has already supported PRE_SPLIT_REGIONS as a table attribute for CREATE TABLE statements, but there is no attribute similar to PRE_SPLIT_REGIONS...BETWEEN. This is because the timing for users to switch write traffic is under control. Before switching, users can choose to perform SPLIT TABLE. However, this is not the same use case as adding an index.

Describe alternatives you've considered:

  • Use pd-ctl / TiDB HTTP API, but they don't support splitting a region.
  • Use SPLIT TABLE, but it doesn't support splitting a non-exists region. We have to block DDL so that the index state can remain at "delete-only".

Teachability, Documentation, Adoption, Migration Strategy:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant