Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Support vector index and ANN hint #9261

Merged
merged 12 commits into from
Aug 12, 2024

Conversation

Lloyd-Pottiger
Copy link
Contributor

@Lloyd-Pottiger Lloyd-Pottiger commented Jul 29, 2024

What problem does this PR solve?

Issue Number: ref #9032

Problem Summary:

What is changed and how it works?

Pick https://github.com/tidbcloud/tiflash-cse/pull/156, https://github.com/tidbcloud/tiflash-cse/pull/162, https://github.com/tidbcloud/tiflash-cse/pull/163, https://github.com/tidbcloud/tiflash-cse/pull/164


Changes:

  • Support one or no vector index on the vector column
  • Enable ExtendColumnStat in DMFile meta for storing the meta of vector index
  • Write: When vector index exist, DMFileWriter::addStreams will generate the index while writing down a new DMFile
    • class VectorIndex generates the index by HNSW algorithm
  • Read: DMFileBlockInputStreamBuilder will try to generate a DMFileWithVectorIndexBlockInputStream. If vector index is unavailable, the read fallback to normal read and filter. In DMFileWithVectorIndexBlockInputStream, it
    • load vector index and apply ANN vector search algorithm, this will generate a rowid bitmap for later reading
    • load the vector column data (from vector index) and other column data from disk

TODO:

  • Support building vector index in background pool
  • Support building multiple vector index on the same vector column
  • Support adding/dropping vector index following TiDB DDL
  • Confirm the following logic
    // Fast Scan and Clean Read does not affect our behavior. (TODO: Confirm?)
    // if (is_fast_scan || enable_del_clean_read || enable_handle_clean_read)
    // return fallback();

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 29, 2024
@Lloyd-Pottiger Lloyd-Pottiger changed the base branch from master to feature/vector-index July 29, 2024 06:06
Lloyd-Pottiger and others added 4 commits August 1, 2024 10:14
Signed-off-by: Lloyd-Pottiger <[email protected]>
Signed-off-by: Lloyd-Pottiger <[email protected]>
Signed-off-by: Lloyd-Pottiger <[email protected]>
@Lloyd-Pottiger
Copy link
Contributor Author

/build

Signed-off-by: Lloyd-Pottiger <[email protected]>
dbms/src/Flash/Coprocessor/DAGQueryInfo.h Show resolved Hide resolved
dbms/src/TiDB/Schema/TiDB.h Show resolved Hide resolved
dbms/src/TiDB/Schema/TiDB.cpp Outdated Show resolved Hide resolved
Signed-off-by: Lloyd-Pottiger <[email protected]>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Aug 7, 2024
Comment on lines -183 to -189
#if 1
writeColumnStatToBuffer(tmp_buffer),
#else
// ExtendColumnStat is not enabled yet because it cause downgrade compatibility, wait
// to be released with other binary format changes.
writeExtendColumnStatToBuffer(tmp_buffer),
#endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaySon-Huang Please confirm those changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm. It is OK because we need to bump the STORAGE_FORMAT_CURRENT in later PR

@ti-chi-bot ti-chi-bot bot added the lgtm label Aug 12, 2024
Copy link
Contributor

ti-chi-bot bot commented Aug 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: breezewish, JaySon-Huang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,breezewish]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Aug 12, 2024
Copy link
Contributor

ti-chi-bot bot commented Aug 12, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-08-07 06:57:58.525180002 +0000 UTC m=+421608.392279075: ☑️ agreed by breezewish.
  • 2024-08-12 03:40:53.066618827 +0000 UTC m=+152937.770088471: ☑️ agreed by JaySon-Huang.

@ti-chi-bot ti-chi-bot bot merged commit 32d911b into pingcap:feature/vector-index Aug 12, 2024
5 of 7 checks passed
@Lloyd-Pottiger Lloyd-Pottiger deleted the cherry-pick-1 branch August 12, 2024 06:49
@JaySon-Huang JaySon-Huang mentioned this pull request Sep 30, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants