diff --git a/TOC-tidb-cloud.md b/TOC-tidb-cloud.md index 4e6ef3365db0d..144979bdde275 100644 --- a/TOC-tidb-cloud.md +++ b/TOC-tidb-cloud.md @@ -10,7 +10,7 @@ - [Roadmap](/tidb-cloud/tidb-cloud-roadmap.md) - Get Started - [Try Out TiDB Cloud](/tidb-cloud/tidb-cloud-quickstart.md) - - [Try Out TiDB + AI](/tidb-cloud/vector-search-get-started-using-python.md) + - [Try Out TiDB + AI](/vector-search-get-started-using-python.md) - [Try Out HTAP](/tidb-cloud/tidb-cloud-htap-quickstart.md) - [Try Out TiDB Cloud CLI](/tidb-cloud/get-started-with-cli.md) - [Perform a PoC](/tidb-cloud/tidb-cloud-poc.md) @@ -241,27 +241,27 @@ - Explore Data - [Chat2Query (Beta) in SQL Editor](/tidb-cloud/explore-data-with-chat2query.md) - Vector Search (Beta) - - [Overview](/tidb-cloud/vector-search-overview.md) + - [Overview](/vector-search-overview.md) - Get Started - - [Get Started with SQL](/tidb-cloud/vector-search-get-started-using-sql.md) - - [Get Started with Python](/tidb-cloud/vector-search-get-started-using-python.md) + - [Get Started with SQL](/vector-search-get-started-using-sql.md) + - [Get Started with Python](/vector-search-get-started-using-python.md) - Integrations - - [Overview](/tidb-cloud/vector-search-integration-overview.md) + - [Overview](/vector-search-integration-overview.md) - AI Frameworks - - [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md) - - [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md) + - [LlamaIndex](/vector-search-integrate-with-llamaindex.md) + - [Langchain](/vector-search-integrate-with-langchain.md) - Embedding Models/Services - - [Jina AI](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md) + - [Jina AI](/vector-search-integrate-with-jinaai-embedding.md) - ORM Libraries - - [SQLAlchemy](/tidb-cloud/vector-search-integrate-with-sqlalchemy.md) - - [peewee](/tidb-cloud/vector-search-integrate-with-peewee.md) - - [Django ORM](/tidb-cloud/vector-search-integrate-with-django-orm.md) + - [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md) + - [peewee](/vector-search-integrate-with-peewee.md) + - [Django ORM](/vector-search-integrate-with-django-orm.md) - Reference - - [Vector Data Types](/tidb-cloud/vector-search-data-types.md) - - [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md) - - [Vector Index](/tidb-cloud/vector-search-index.md) - - [Improve Performance](/tidb-cloud/vector-search-improve-performance.md) - - [Limitations](/tidb-cloud/vector-search-limitations.md) + - [Vector Data Types](/vector-search-data-types.md) + - [Vector Functions and Operators](/vector-search-functions-and-operators.md) + - [Vector Index](/vector-search-index.md) + - [Improve Performance](/vector-search-improve-performance.md) + - [Limitations](/vector-search-limitations.md) - [Changelogs](/tidb-cloud/vector-search-changelogs.md) - Data Service (Beta) - [Overview](/tidb-cloud/data-service-overview.md) diff --git a/TOC.md b/TOC.md index 83bc18ed6e2ff..dadb23849064b 100644 --- a/TOC.md +++ b/TOC.md @@ -81,6 +81,24 @@ - [Follower Read](/develop/dev-guide-use-follower-read.md) - [Stale Read](/develop/dev-guide-use-stale-read.md) - [HTAP Queries](/develop/dev-guide-hybrid-oltp-and-olap-queries.md) + - Vector Search + - [Overview](/vector-search-overview.md) + - Get Started + - [Get Started with SQL](/vector-search-get-started-using-sql.md) + - [Get Started with Python](/vector-search-get-started-using-python.md) + - Integrations + - [Overview](/vector-search-integration-overview.md) + - AI Frameworks + - [LlamaIndex](/vector-search-integrate-with-llamaindex.md) + - [Langchain](/vector-search-integrate-with-langchain.md) + - Embedding Models/Services + - [Jina AI](/vector-search-integrate-with-jinaai-embedding.md) + - ORM Libraries + - [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md) + - [peewee](/vector-search-integrate-with-peewee.md) + - [Django](/vector-search-integrate-with-django-orm.md) + - [Improve Performance](/vector-search-improve-performance.md) + - [Limitations](/vector-search-limitations.md) - Transaction - [Overview](/develop/dev-guide-transaction-overview.md) - [Optimistic and Pessimistic Transactions](/develop/dev-guide-optimistic-and-pessimistic-transaction.md) @@ -894,6 +912,7 @@ - [Date and Time Types](/data-type-date-and-time.md) - [String Types](/data-type-string.md) - [JSON Type](/data-type-json.md) + - [Vector Types](/vector-search-data-types.md) - Functions and Operators - [Overview](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) @@ -907,6 +926,7 @@ - [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) - [Locking Functions](/functions-and-operators/locking-functions.md) - [Information Functions](/functions-and-operators/information-functions.md) + - [Vector Functions and Operators](/vector-search-functions-and-operators.md) - JSON Functions - [Overview](/functions-and-operators/json-functions.md) - [Functions That Create JSON](/functions-and-operators/json-functions/json-functions-create.md) @@ -927,6 +947,7 @@ - [TiDB Specific Functions](/functions-and-operators/tidb-functions.md) - [Comparisons between Functions and Syntax of Oracle and TiDB](/oracle-functions-to-tidb.md) - [Clustered Indexes](/clustered-indexes.md) + - [Vector Index](/vector-search-index.md) - [Constraints](/constraints.md) - [Generated Columns](/generated-columns.md) - [SQL Mode](/sql-mode.md) diff --git a/br/backup-and-restore-overview.md b/br/backup-and-restore-overview.md index 12b32ef9bd65a..516a0f04129d6 100644 --- a/br/backup-and-restore-overview.md +++ b/br/backup-and-restore-overview.md @@ -118,6 +118,7 @@ Backup and restore might go wrong when some TiDB features are enabled or disable | Global temporary tables | | Make sure that you are using v5.3.0 or a later version of BR to back up and restore data. Otherwise, an error occurs in the definition of the backed global temporary tables. | | TiDB Lightning Physical Import| | If the upstream database uses the physical import mode of TiDB Lightning, data cannot be backed up in log backup. It is recommended to perform a full backup after the data import. For more information, see [When the upstream database imports data using TiDB Lightning in the physical import mode, the log backup feature becomes unavailable. Why?](/faq/backup-and-restore-faq.md#when-the-upstream-database-imports-data-using-tidb-lightning-in-the-physical-import-mode-the-log-backup-feature-becomes-unavailable-why).| | TiCDC | | BR v8.2.0 and later: if the target cluster to be restored has a changefeed and the changefeed [CheckpointTS](/ticdc/ticdc-architecture.md#checkpointts) is earlier than the BackupTS, BR does not perform the restoration. BR versions before v8.2.0: if the target cluster to be restored has any active TiCDC changefeeds, BR does not perform the restoration. | +| Vector search | | Make sure that you are using v8.4.0 or a later version of BR to back up and restore data. Restoring tables with [vector data types](/vector-search-data-types.md) to TiDB clusters earlier than v8.4.0 is not supported. | ### Version compatibility diff --git a/dm/dm-overview.md b/dm/dm-overview.md index a11d048cd5a51..54dccba33861f 100644 --- a/dm/dm-overview.md +++ b/dm/dm-overview.md @@ -67,6 +67,10 @@ Before using the DM tool, note the following restrictions: - DM does not support the MySQL 8.0 new feature binlog [Transaction_payload_event](https://dev.mysql.com/doc/refman/8.0/en/binary-log-transaction-compression.html). Using binlog Transaction_payload_event might result in data inconsistency between upstream and downstream. ++ Vector data type replication + + - DM does not support migrating or replicating MySQL 9.0 vector data types to TiDB. + ## Contributing You are welcome to participate in the DM open sourcing project. Your contribution would be highly appreciated. For more details, see [CONTRIBUTING.md](https://github.com/pingcap/tiflow/blob/master/dm/CONTRIBUTING.md). diff --git a/ticdc/ticdc-compatibility.md b/ticdc/ticdc-compatibility.md index f5e37e75a20f7..c1278398ffbaa 100644 --- a/ticdc/ticdc-compatibility.md +++ b/ticdc/ticdc-compatibility.md @@ -64,3 +64,11 @@ The `sort-dir` configuration is used to specify the temporary file directory for Since v5.3.0, TiCDC supports [global temporary tables](/temporary-tables.md#global-temporary-tables). Replicating global temporary tables to the downstream using TiCDC of a version earlier than v5.3.0 causes table definition error. If the upstream cluster contains a global temporary table, the downstream TiDB cluster is expected to be v5.3.0 or a later version. Otherwise, an error occurs during the replication process. + +### Compatibility with vector data types + +Starting from v8.4.0, TiCDC supports replicating tables with [vector data types](/vector-search-data-types.md) to downstream (experimental). + +When the downstream is Kafka or a storage service (such as Amazon S3, GCS, Azure Blob Storage, or NFS), TiCDC converts vector data types into string types before writing to the downstream. + +When the downstream is a MySQL-compatible database that does not support vector data types, TiCDC fails to write DDL events involving vector types to the downstream. In this case, add the `has-vector-type=true` parameter to `sink-url`, which allows TiCDC to convert vector data types into the `LONGTEXT` type before writing. \ No newline at end of file diff --git a/tidb-cloud/data-service-manage-endpoint.md b/tidb-cloud/data-service-manage-endpoint.md index b0e154c8e38ea..0814155c75fc6 100644 --- a/tidb-cloud/data-service-manage-endpoint.md +++ b/tidb-cloud/data-service-manage-endpoint.md @@ -44,7 +44,7 @@ In TiDB Cloud Data Service, you can generate one or multiple endpoints automatic For each operation you select, TiDB Cloud Data Service will generate a corresponding endpoint. If you select a batch operation (such as `POST (Batch Create)`), the generated endpoint lets you operate on multiple rows in a single request. - If the table you selected contains [vector data types](/tidb-cloud/vector-search-data-types.md), you can enable the **Vector Search Operations** option and select a vector distance function to generate a vector search endpoint that automatically calculates vector distances based on your selected distance function. The supported [vector distance functions](/tidb-cloud/vector-search-functions-and-operators.md) include the following: + If the table you selected contains [vector data types](/vector-search-data-types.md), you can enable the **Vector Search Operations** option and select a vector distance function to generate a vector search endpoint that automatically calculates vector distances based on your selected distance function. The supported [vector distance functions](/vector-search-functions-and-operators.md) include the following: - `VEC_L2_DISTANCE` (default): calculates the L2 distance (Euclidean distance) between two vectors. - `VEC_COSINE_DISTANCE`: calculates the cosine distance between two vectors. diff --git a/tidb-cloud/tidb-cloud-release-notes.md b/tidb-cloud/tidb-cloud-release-notes.md index df891da2ff499..aba9c9fd1fe8b 100644 --- a/tidb-cloud/tidb-cloud-release-notes.md +++ b/tidb-cloud/tidb-cloud-release-notes.md @@ -76,7 +76,7 @@ This page lists the release notes of [TiDB Cloud](https://www.pingcap.com/tidb-c - [Data Service (beta)](https://tidbcloud.com/console/data-service) supports automatically generating vector search endpoints. - If your table contains [vector data types](/tidb-cloud/vector-search-data-types.md), you can automatically generate a vector search endpoint that calculates vector distances based on your selected distance function. + If your table contains [vector data types](/vector-search-data-types.md), you can automatically generate a vector search endpoint that calculates vector distances based on your selected distance function. This feature enables seamless integration with AI platforms such as [Dify](https://docs.dify.ai/guides/tools) and [GPTs](https://openai.com/blog/introducing-gpts), enhancing your applications with advanced natural language processing and AI capabilities for more complex tasks and intelligent solutions. @@ -122,12 +122,12 @@ This page lists the release notes of [TiDB Cloud](https://www.pingcap.com/tidb-c The vector search (beta) feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. Key features include: - - [Vector data types](/tidb-cloud/vector-search-data-types.md), [vector index](/tidb-cloud/vector-search-index.md), and [vector functions and operators](/tidb-cloud/vector-search-functions-and-operators.md). - - Ecosystem integrations with [LangChain](/tidb-cloud/vector-search-integrate-with-langchain.md), [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md), and [JinaAI](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md). - - Programming language support for Python: [SQLAlchemy](/tidb-cloud/vector-search-integrate-with-sqlalchemy.md), [Peewee](/tidb-cloud/vector-search-integrate-with-peewee.md), and [Django ORM](/tidb-cloud/vector-search-integrate-with-django-orm.md). - - Sample applications and tutorials: perform semantic searches for documents using [Python](/tidb-cloud/vector-search-get-started-using-python.md) or [SQL](/tidb-cloud/vector-search-get-started-using-sql.md). + - [Vector data types](/vector-search-data-types.md), [vector index](/vector-search-index.md), and [vector functions and operators](/vector-search-functions-and-operators.md). + - Ecosystem integrations with [LangChain](/vector-search-integrate-with-langchain.md), [LlamaIndex](/vector-search-integrate-with-llamaindex.md), and [JinaAI](/vector-search-integrate-with-jinaai-embedding.md). + - Programming language support for Python: [SQLAlchemy](/vector-search-integrate-with-sqlalchemy.md), [Peewee](/vector-search-integrate-with-peewee.md), and [Django ORM](/vector-search-integrate-with-django-orm.md). + - Sample applications and tutorials: perform semantic searches for documents using [Python](/vector-search-get-started-using-python.md) or [SQL](/vector-search-get-started-using-sql.md). - For more information, see [Vector search (beta) overview](/tidb-cloud/vector-search-overview.md). + For more information, see [Vector search (beta) overview](/vector-search-overview.md). - [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) now offers weekly email reports for organization owners. diff --git a/tidb-cloud/vector-search-get-started-using-sql.md b/tidb-cloud/vector-search-get-started-using-sql.md deleted file mode 100644 index 91cd5a93bfc4f..0000000000000 --- a/tidb-cloud/vector-search-get-started-using-sql.md +++ /dev/null @@ -1,148 +0,0 @@ ---- -title: Get Started with Vector Search via SQL -summary: Learn how to quickly get started with Vector Search in TiDB Cloud using SQL statements and power the generative AI application. ---- - -# Get Started with Vector Search via SQL - -TiDB extends MySQL syntax to support [Vector Search](/tidb-cloud/vector-search-overview.md) and introduce new [Vector data types](/tidb-cloud/vector-search-data-types.md) and several [vector functions](/tidb-cloud/vector-search-functions-and-operators.md). - -This tutorial demonstrates how to get started with TiDB Vector Search just using SQL statements. You will learn how to use the [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) to: - -- Connect to your TiDB cluster. -- Create a vector table. -- Store vector embeddings. -- Perform vector search queries. - -> **Note** -> -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. - -## Prerequisites - -To complete this tutorial, you need: - -- [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) (MySQL CLI) installed on your machine. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. - -## Get started - -### Step 1. Connect to the TiDB cluster - -1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. - -2. Click **Connect** in the upper-right corner. A connection dialog is displayed. - -3. In the connection dialog, select **MySQL CLI** from the **Connect With** drop-down list and keep the default setting of the **Connection Type** as **Public**. - -4. If you have not set a password yet, click **Generate Password** to generate a random password. - -5. Copy the connection command and paste it into your terminal. The following is an example for macOS: - - ```bash - mysql -u '.root' -h '' -P 4000 -D 'test' --ssl-mode=VERIFY_IDENTITY --ssl-ca=/etc/ssl/cert.pem -p'' - ``` - -### Step 2. Create a vector table - -With vector search support, you can use the `VECTOR` type column to store [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) in TiDB. - -To create a table with a three-dimensional `VECTOR` column, execute the following SQL statements using your MySQL CLI: - -```sql -USE test; -CREATE TABLE embedded_documents ( - id INT PRIMARY KEY, - -- Column to store the original content of the document. - document TEXT, - -- Column to store the vector representation of the document. - embedding VECTOR(3) -); -``` - -The expected output is as follows: - -```text -Query OK, 0 rows affected (0.27 sec) -``` - -### Step 3. Store the vector embeddings - -Insert three documents with their [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) into the `embedded_documents` table: - -```sql -INSERT INTO embedded_documents -VALUES - (1, 'dog', '[1,2,1]'), - (2, 'fish', '[1,2,4]'), - (3, 'tree', '[1,0,0]'); -``` - -The expected output is as follows: - -``` -Query OK, 3 rows affected (0.15 sec) -Records: 3 Duplicates: 0 Warnings: 0 -``` - -> **Note** -> -> This example simplifies the dimensions of the vector embeddings and uses only 3-dimensional vectors for demonstration purposes. -> -> In real-world applications, [embedding models](/tidb-cloud/vector-search-overview.md#embedding-model) often produce vector embeddings with hundreds or thousands of dimensions. - -### Step 4. Query the vector table - -To verify that the documents have been inserted correctly, query the `embedded_documents` table: - -```sql -SELECT * FROM embedded_documents; -``` - -The expected output is as follows: - -```sql -+----+----------+-----------+ -| id | document | embedding | -+----+----------+-----------+ -| 1 | dog | [1,2,1] | -| 2 | fish | [1,2,4] | -| 3 | tree | [1,0,0] | -+----+----------+-----------+ -3 rows in set (0.15 sec) -``` - -### Step 5. Perform a vector search query - -Similar to full-text search, users provide search terms to the application when using vector search. - -In this example, the search term is "a swimming animal", and its corresponding vector embedding is `[1,2,3]`. In practical applications, you need to use an embedding model to convert the user's search term into a vector embedding. - -Execute the following SQL statement and TiDB will identify the top three documents closest to the search term by calculating and sorting the cosine distances (`vec_cosine_distance`) between the vector embeddings. - -```sql -SELECT id, document, vec_cosine_distance(embedding, '[1,2,3]') AS distance -FROM embedded_documents -ORDER BY distance -LIMIT 3; -``` - -The expected output is as follows: - -```plain -+----+----------+---------------------+ -| id | document | distance | -+----+----------+---------------------+ -| 2 | fish | 0.00853986601633272 | -| 1 | dog | 0.12712843905603044 | -| 3 | tree | 0.7327387580875756 | -+----+----------+---------------------+ -3 rows in set (0.15 sec) -``` - -From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. - -## See also - -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/tidb-cloud/vector-search-improve-performance.md b/tidb-cloud/vector-search-improve-performance.md deleted file mode 100644 index 651bc94251370..0000000000000 --- a/tidb-cloud/vector-search-improve-performance.md +++ /dev/null @@ -1,36 +0,0 @@ ---- -title: Improve Vector Search Performance -summary: Learn best practices for improving the performance of TiDB Vector Search. ---- - -# Improve Vector Search Performance - -TiDB Vector Search allows you to perform ANN queries that search for results similar to an image, document and so on. To improve the query performance, review the following best practices. - -## Add vector search index for vector columns - -The [vector search index](/tidb-cloud/vector-search-index.md) dramatically improves the performance of vector search queries, usually by 10x or more, with a trade-off of only a small decrease of recall rate. - -## Ensure vector indexes are fully built - -Vector indexes are built asynchronously. Until all vector data is indexed, vector search performance is suboptimal. To check the index build progress, see [View index build progress](/tidb-cloud/vector-search-index.md#view-index-build-progress). - -## Reduce vector dimensions or shorten embeddings - -The computational complexity of vector search indexing and queries increases significantly as the size of vectors grows, necessitating more floating point comparisons. - -To optimize performance, consider reducing the vector dimensions whenever feasible. This usually needs switching to another embedding model. Make sure to measure the impact of changing embedding models on the accuracy of your vector queries. - -Certain embedding models like OpenAI `text-embedding-3-large` support [shortening embeddings](https://openai.com/index/new-embedding-models-and-api-updates/), which removes some numbers from the end of vector sequences without losing the embedding's concept-representing properties. You can also use such an embedding model to reduce the vector dimensions. - -## Exclude vector columns from the results - -Vector embedding data are usually large and only used during the search process. By excluding vector columns from the query results, you can greatly reduce the amount of data transferred between the TiDB server and your SQL client, thereby improving query performance. - -To exclude vector columns, explicitly list the columns you want to retrieve in the `SELECT` clause, instead of using `SELECT *`. - -## Warm up the index - -When an index is cold accessed, it takes time to load the whole index from S3, or load from disk (instead of from memory). Such processes usually result in high tail latency. Additionally, if no SQL queries exist on a cluster for a long time (e.g. hours), the compute resource is reclaimed and will result in cold access next time. - -To avoid such tail latencies, warm up your index before actual workload by using similar vector search queries that hit the vector index. diff --git a/tidb-cloud/vector-search-index.md b/tidb-cloud/vector-search-index.md deleted file mode 100644 index e834a6c8ee797..0000000000000 --- a/tidb-cloud/vector-search-index.md +++ /dev/null @@ -1,274 +0,0 @@ ---- -title: Vector Search Index -summary: Learn how to build and use the vector search index to accelerate K-Nearest neighbors (KNN) queries in TiDB. ---- - -# Vector Search Index - -K-nearest neighbors (KNN) search is the problem of finding the K closest points for a given point in a vector space. The most straightforward approach to solving this problem is a brute force search, where the distance between all points in the vector space and the reference point is computed. This method guarantees perfect accuracy, but it is usually too slow for practical applications. Thus, nearest neighbors search problems are often solved with approximate algorithms. - -In TiDB, you can create and utilize vector search indexes for such approximate nearest neighbor (ANN) searches over columns with [vector data types](/tidb-cloud/vector-search-data-types.md). By using vector search indexes, vector search queries could be finished in milliseconds. - -TiDB currently supports the following vector search index algorithms: - -- HNSW - -> **Note:** -> -> Vector search index is only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. - -## Create the HNSW vector index - -[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy (> 98% in typical cases). - -To create an HNSW vector index, specify the index definition in the comment of a column with a [vector data type](/tidb-cloud/vector-search-data-types.md) when creating the table: - -```sql -CREATE TABLE vector_table_with_index ( - id INT PRIMARY KEY, doc TEXT, - embedding VECTOR(3) COMMENT "hnsw(distance=cosine)" -); -``` - -> **Note:** -> -> The syntax to create a vector index might change in future releases. - -You must specify the distance metric via the `distance=` configuration when creating the vector index: - -- Cosine Distance: `COMMENT "hnsw(distance=cosine)"` -- L2 Distance: `COMMENT "hnsw(distance=l2)"` - -The vector index can only be created for fixed-dimensional vector columns like `VECTOR(3)`. It cannot be created for mixed-dimensional vector columns like `VECTOR` because vector distances can only be calculated between vectors with the same dimensions. - -If you are using programming language SDKs or ORMs, refer to the following documentation for creating vector indexes: - -- Python: [TiDB Vector SDK for Python](https://github.com/pingcap/tidb-vector-python) -- Python: [SQLAlchemy](/tidb-cloud/vector-search-integrate-with-sqlalchemy.md) -- Python: [Peewee](/tidb-cloud/vector-search-integrate-with-peewee.md) -- Python: [Django](/tidb-cloud/vector-search-integrate-with-django-orm.md) - -Be aware of the following limitations when creating the vector index. These limitations might be removed in future releases: - -- L1 distance and inner product are not supported for the vector index yet. - -- You can only define and create a vector index when the table is created. You cannot create the vector index on demand using DDL statements after the table is created. You cannot drop the vector index using DDL statements as well. - -## Use the vector index - -The vector search index can be used in K-nearest neighbor search queries by using the `ORDER BY ... LIMIT` form like below: - -```sql -SELECT * -FROM vector_table_with_index -ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]') -LIMIT 10 -``` - -You must use the same distance metric as you have defined when creating the vector index if you want to utilize the index in vector search. - -## Use the vector index with filters - -Queries that contain a pre-filter (using the `WHERE` clause) cannot utilize the vector index because they are not querying for K-Nearest neighborss according to the SQL semantics. For example: - -```sql --- Filter is performed before kNN, so Vector Index cannot be used: - -SELECT * FROM vec_table -WHERE category = "document" -ORDER BY Vec_Cosine_distance(embedding, '[1, 2, 3]') -LIMIT 5; -``` - -Several workarounds are as follows: - -**Post-Filter after Vector Search:** Query for the K-Nearest neighbors first, then filter out unwanted results: - -```sql --- The filter is performed after kNN for these queries, so Vector Index can be used: - -SELECT * FROM -( - SELECT * FROM vec_table - ORDER BY Vec_Cosine_distance(embedding, '[1, 2, 3]') - LIMIT 5 -) t -WHERE category = "document"; - --- Note that this query may return less than 5 results if some are filtered out. -``` - -**Use Table Partitioning**: Queries within the [table partition](/partitioned-table.md) can fully utilize the vector index. This can be useful if you want to perform equality filters, as equality filters can be turned into accessing specified partitions. - -Example: Suppose you want to find the closest documentation for a specific product version. - -```sql --- Filter is performed before kNN, so Vector Index cannot be used: -SELECT * FROM docs -WHERE ver = "v2.0" -ORDER BY Vec_Cosine_distance(embedding, '[1, 2, 3]') -LIMIT 5; -``` - -Instead of writing a query using the `WHERE` clause, you can partition the table and then query within the partition using the [`PARTITION` keyword](/partitioned-table.md#partition-selection): - -```sql -CREATE TABLE docs ( - id INT, - ver VARCHAR(10), - doc TEXT, - embedding VECTOR(3) COMMENT "hnsw(distance=cosine)" -) PARTITION BY LIST COLUMNS (ver) ( - PARTITION p_v1_0 VALUES IN ('v1.0'), - PARTITION p_v1_1 VALUES IN ('v1.1'), - PARTITION p_v1_2 VALUES IN ('v1.2'), - PARTITION p_v2_0 VALUES IN ('v2.0') -); - -SELECT * FROM docs -PARTITION (p_v2_0) -ORDER BY Vec_Cosine_distance(embedding, '[1, 2, 3]') -LIMIT 5; -``` - -See [Table Partitioning](/partitioned-table.md) for more information. - -## View index build progress - -Unlike other indexes, vector indexes are built asynchronously. Therefore, vector indexes might not be immediately available after bulk data insertion. This does not affect data correctness or consistency, and you can perform vector searches at any time and get complete results. However, performance will be suboptimal until vector indexes are fully built. - -To view the index build progress, you can query the `INFORMATION_SCHEMA.TIFLASH_INDEXES` table as follows: - -```sql -SELECT * FROM INFORMATION_SCHEMA.TIFLASH_INDEXES; -+---------------+------------+----------------+----------+--------------------+-------------+-----------+------------+---------------------+-------------------------+--------------------+------------------------+------------------+ -| TIDB_DATABASE | TIDB_TABLE | TIDB_PARTITION | TABLE_ID | BELONGING_TABLE_ID | COLUMN_NAME | COLUMN_ID | INDEX_KIND | ROWS_STABLE_INDEXED | ROWS_STABLE_NOT_INDEXED | ROWS_DELTA_INDEXED | ROWS_DELTA_NOT_INDEXED | TIFLASH_INSTANCE | -+---------------+------------+----------------+----------+--------------------+-------------+-----------+------------+---------------------+-------------------------+--------------------+------------------------+------------------+ -| test | sample | NULL | 106 | -1 | vec | 2 | HNSW | 0 | 13000 | 0 | 2000 | store-6ba728d2 | -| test | sample | NULL | 106 | -1 | vec | 2 | HNSW | 10500 | 0 | 0 | 4500 | store-7000164f | -+---------------+------------+----------------+----------+--------------------+-------------+-----------+------------+---------------------+-------------------------+--------------------+------------------------+------------------+ -``` - -- The `ROWS_STABLE_INDEXED` and `ROWS_STABLE_NOT_INDEXED` columns show the index build progress. When `ROWS_STABLE_NOT_INDEXED` becomes 0, the index build is complete. - - As a reference, indexing a 500 MiB vector dataset might take up to 20 minutes. The indexer can run in parallel for multiple tables. Currently, adjusting the indexer priority or speed is not supported. - -- The `ROWS_DELTA_NOT_INDEXED` column shows the number of rows in the Delta layer. The Delta layer stores _recently_ inserted or updated rows and is periodically merged into the Stable layer according to the write workload. This merge process is called Compaction. - - The Delta layer is always not indexed. To achieve optimal performance, you can force the merge of the Delta layer into the Stable layer so that all data can be indexed: - - ```sql - ALTER TABLE COMPACT; - ``` - - For more information, see [`ALTER TABLE ... COMPACT`](/sql-statements/sql-statement-alter-table-compact.md). - -## Check whether the vector index is used - -Use the [`EXPLAIN`](/sql-statements/sql-statement-explain.md) or [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement to check whether this query is using the vector index. When `annIndex:` is presented in the `operator info` column for the `TableFullScan` executor, it means this table scan is utilizing the vector index. - -**Example: the vector index is used** - -```sql -[tidb]> EXPLAIN SELECT * FROM vector_table_with_index -ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]') -LIMIT 10; -+-----+-------------------------------------------------------------------------------------+ -| ... | operator info | -+-----+-------------------------------------------------------------------------------------+ -| ... | ... | -| ... | Column#5, offset:0, count:10 | -| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#5 | -| ... | MppVersion: 1, data:ExchangeSender_16 | -| ... | ExchangeType: PassThrough | -| ... | ... | -| ... | Column#4, offset:0, count:10 | -| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#4 | -| ... | annIndex:COSINE(test.vector_table_with_index.embedding..[1,2,3], limit:10), ... | -+-----+-------------------------------------------------------------------------------------+ -9 rows in set (0.01 sec) -``` - -**Example: The vector index is not used because of not specifying a Top K** - -```sql -[tidb]> EXPLAIN SELECT * FROM vector_table_with_index - -> ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]'); -+--------------------------------+-----+--------------------------------------------------+ -| id | ... | operator info | -+--------------------------------+-----+--------------------------------------------------+ -| Projection_15 | ... | ... | -| └─Sort_4 | ... | Column#4 | -| └─Projection_16 | ... | ..., vec_cosine_distance(..., [1,2,3])->Column#4 | -| └─TableReader_14 | ... | MppVersion: 1, data:ExchangeSender_13 | -| └─ExchangeSender_13 | ... | ExchangeType: PassThrough | -| └─TableFullScan_12 | ... | keep order:false, stats:pseudo | -+--------------------------------+-----+--------------------------------------------------+ -6 rows in set, 1 warning (0.01 sec) -``` - -When the vector index cannot be used, a warning occurs in some cases to help you learn the cause: - -```sql --- Using a wrong distance metric: -[tidb]> EXPLAIN SELECT * FROM vector_table_with_index -ORDER BY Vec_l2_Distance(embedding, '[1, 2, 3]') -LIMIT 10; - -[tidb]> SHOW WARNINGS; -ANN index not used: not ordering by COSINE distance - --- Using a wrong order: -[tidb]> EXPLAIN SELECT * FROM vector_table_with_index -ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]') DESC -LIMIT 10; - -[tidb]> SHOW WARNINGS; -ANN index not used: index can be used only when ordering by vec_cosine_distance() in ASC order -``` - -## Analyze vector search performance - -The [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement contains detailed information about how the vector index is used in the `execution info` column: - -```sql -[tidb]> EXPLAIN ANALYZE SELECT * FROM vector_table_with_index -ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]') -LIMIT 10; -+-----+--------------------------------------------------------+-----+ -| | execution info | | -+-----+--------------------------------------------------------+-----+ -| ... | time:339.1ms, loops:2, RU:0.000000, Concurrency:OFF | ... | -| ... | time:339ms, loops:2 | ... | -| ... | time:339ms, loops:3, Concurrency:OFF | ... | -| ... | time:339ms, loops:3, cop_task: {...} | ... | -| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | -| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | -| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | -| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | -| ... | tiflash_task:{...}, vector_idx:{ | ... | -| | load:{total:68ms,from_s3:1,from_disk:0,from_cache:0},| | -| | search:{total:0ms,visited_nodes:2,discarded_nodes:0},| | -| | read:{vec_total:0ms,others_total:0ms}},...} | | -+-----+--------------------------------------------------------+-----+ -``` - -> **Note:** -> -> The execution information is internal. Fields and formats are subject to change without any notification. Do not rely on them. - -Explanation of some important fields: - -- `vector_index.load.total`: The total duration of loading index. This field could be larger than actual query time because multiple vector indexes may be loaded in parallel. -- `vector_index.load.from_s3`: Number of indexes loaded from S3. -- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously. -- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously. -- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there is heavy IO when searching through the index. This field could be larger than actual query time because multiple vector indexes may be searched in parallel. -- `vector_index.search.discarded_nodes`: Number of vector rows visited but discarded during the search. These discarded vectors are not considered in the search result. Large values usually indicate that there are many stale rows caused by UPDATE or DELETE statements. - -See [`EXPLAIN`](/sql-statements/sql-statement-explain.md), [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md), and [EXPLAIN Walkthrough](/explain-walkthrough.md) for interpreting the output. - -## See also - -- [Improve Vector Search Performance](/tidb-cloud/vector-search-improve-performance.md) -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) diff --git a/tidb-cloud/vector-search-limitations.md b/tidb-cloud/vector-search-limitations.md deleted file mode 100644 index 077c89348ce29..0000000000000 --- a/tidb-cloud/vector-search-limitations.md +++ /dev/null @@ -1,23 +0,0 @@ ---- -title: Vector Search Limitations -summary: Learn the limitations of the TiDB Vector Search. ---- - -# Vector Search Limitations - -This document describes the known limitations of TiDB Vector Search. We are continuously working to enhance your experience by adding more features. - -- TiDB Vector Search is only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. It is not available for TiDB Cloud Dedicated or TiDB Self-Managed. - -- Each [vector](/tidb-cloud/vector-search-data-types.md) supports up to 16,000 dimensions. - -- Vector data supports only single-precision floating-point numbers (Float32). - -- Only cosine distance and L2 distance are supported when you create a [vector search index](/tidb-cloud/vector-search-index.md). - -## Feedback - -We value your feedback and are always here to help: - -- [Join our Discord](https://discord.gg/zcqexutz2R) -- [Visit our Support Portal](https://tidb.support.pingcap.com/) diff --git a/tidb-cloud/vector-search-overview.md b/tidb-cloud/vector-search-overview.md deleted file mode 100644 index 3cf2655388b12..0000000000000 --- a/tidb-cloud/vector-search-overview.md +++ /dev/null @@ -1,65 +0,0 @@ ---- -title: Vector Search (Beta) Overview -summary: Learn about Vector Search in TiDB Cloud. This feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. ---- - -# Vector Search (Beta) Overview - -TiDB Vector Search (beta) provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. - -> **Note** -> -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. - -## Concepts - -Vector search is a search method that prioritizes the meaning of your data to deliver relevant results. This differs from traditional full-text search, which relies primarily on exact keyword matches and word frequency. - -For example, a full-text search for "a swimming animal" only returns results with those exact keywords. In contrast, vector search can return results for other swimming animals, such as fish or ducks, even if the exact keywords are not present. - -### Vector embedding - -A vector embedding, also known as an embedding, is a sequence of numbers that represents real-world objects in a high-dimensional space. It captures the meaning and context of unstructured data, such as documents, images, audio, and videos. - -Vector embeddings are essential in machine learning and serve as the foundation for semantic similarity searches. - -TiDB introduces [Vector data types](/tidb-cloud/vector-search-data-types.md) designed to optimize the storage and retrieval of vector embeddings, enhancing their use in AI applications. You can store vector embeddings in TiDB and perform vector search queries to find the most relevant data using these data types. - -### Embedding model - -Embedding models are algorithms that transform data into [vector embeddings](#vector-embedding). - -Selecting an appropriate embedding model is crucial for ensuring the accuracy and relevance of semantic search results. For unstructured text data, you can find top-performing text embedding models on the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). - -To learn how to generate vector embeddings for your specific data types, refer to the embedding provider integration tutorials or examples. - -## How vector search works - -After converting raw data into vector embeddings and storing them in TiDB, your application can execute vector search queries to find the data most semantically or contextually relevant to a user's query. - -Vector Search in TiDB Cloud identifies the top-k nearest neighbor (KNN) vectors by using a [distance function](/tidb-cloud/vector-search-functions-and-operators.md) to calculate the distance between the given vector and vectors stored in the database. The vectors closest to the query represent the most similar data in meaning. - -![The Schematic TiDB Vector Search](/media/vector-search/embedding-search.png) - -As a relational database with integrated vector search capabilities, TiDB enables you to store data and their corresponding vector embeddings together in one database. You can store them in the same table using different columns, or separate them into different tables and combine them using `JOIN` queries when retrieving. - -## Use cases - -### Retrieval-Augmented Generation (RAG) - -Retrieval-Augmented Generation (RAG) is an architecture designed to optimize the output of Large Language Models (LLMs). By using vector search, RAG applications can store vector embeddings in the database and retrieve relevant documents as additional context when the LLM generates responses, thereby improving the quality and relevance of the answers. - -### Semantic search - -Semantic search is a search technology that returns results based on the meaning of a query, rather than simply matching keywords. It interprets the meaning across different languages and various types of data (such as text, images, and audio) using embeddings. Vector search algorithms then use these embeddings to find the most relevant data that satisfies the user's query. - -### Recommendation engine - -A recommendation engine is a system that proactively suggests content, products, or services that are relevant and personalized to users. It accomplishes this by creating embeddings that represent user behavior and preferences. These embeddings help the system identify similar items that other users have interacted with or shown interest in. This increases the likelihood that the recommendations will be both relevant and appealing to the user. - -## See also - -To get started with TiDB Vector Search, see the following documents: - -- [Get started with vector search using Python](/tidb-cloud/vector-search-get-started-using-python.md) -- [Get started with vector search using SQL](/tidb-cloud/vector-search-get-started-using-sql.md) diff --git a/tiflash-upgrade-guide.md b/tiflash-upgrade-guide.md index f36051c145166..8492418624f7e 100644 --- a/tiflash-upgrade-guide.md +++ b/tiflash-upgrade-guide.md @@ -122,8 +122,12 @@ After upgrading TiFlash to v7.3 and configuring TiFlash to use V3 DTFiles, if yo ## From v6.x or v7.x to v7.4 or a later version -Starting from v7.4, to reduce the read and write amplification generated during data compaction, TiFlash optimizes the data compaction logic of PageStorage V3, which leads to changes to some of the underlying storage file names. Therefore, after the upgrade to v7.4 or a later version, in-place downgrading to the original version is not supported. +Starting from v7.4, to reduce the read and write amplification generated during data compaction, TiFlash optimizes the data compaction logic of PageStorage V3, which leads to changes to some of the underlying storage file names. Therefore, after TiFlash is upgraded to v7.4 or a later version, in-place downgrading to the original version is not supported. + +## From v7.x to v8.4 or a later version + +Starting from v8.4, the underlying storage format of TiFlash is updated to support [vector search](/vector-search-overview.md). Therefore, after TiFlash is upgraded to v8.4 or a later version, in-place downgrading to the original version is not supported. **Workaround for downgrading TiFlash in testing or other special scenarios** -To downgrade TiFlash in testing or other special scenarios, you can forcibly scale in the target TiFlash node and then replicate data from TiKV again. For detailed steps, see [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). \ No newline at end of file +To downgrade TiFlash in testing or other special scenarios, you can forcibly scale in the target TiFlash node and then replicate data from TiKV again. For detailed steps, see [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). diff --git a/tiflash/tiflash-configuration.md b/tiflash/tiflash-configuration.md index f5f4278f271c5..73f29a300cc5f 100644 --- a/tiflash/tiflash-configuration.md +++ b/tiflash/tiflash-configuration.md @@ -77,8 +77,11 @@ delta_index_cache_size = 0 ## * format_version = 2, the default format for versions < v6.0.0. ## * format_version = 3, the default format for v6.0.0 and v6.1.x, which provides more data validation features. ## * format_version = 4, the default format for versions from v6.2.0 to v7.3.0, which reduces write amplification and background task resource consumption - ## * format_version = 5, the default format for v7.4.0 and later versions (introduced in v7.3.0), which reduces the number of physical files by merging smaller files. + ## * format_version = 5, introduced in v7.3.0, the default format for versions from v7.4.0 to v8.3.0, which reduces the number of physical files by merging smaller files. # format_version = 5 + ## * format_version = 6, introduced in v8.4.0, which partially supports the building and storage of vector indexes. + ## * format_version = 7, introduced in v7.3.0, the default format for v8.4.0 and later versions, which supports the build and storage of vector indexes + # format_version = 7 [storage.main] ## The list of directories to store the main data. More than 90% of the total data is stored in diff --git a/tidb-cloud/vector-search-data-types.md b/vector-search-data-types.md similarity index 54% rename from tidb-cloud/vector-search-data-types.md rename to vector-search-data-types.md index 626533cc84c87..62031f506629d 100644 --- a/tidb-cloud/vector-search-data-types.md +++ b/vector-search-data-types.md @@ -5,26 +5,34 @@ summary: Learn about the Vector data types in TiDB. # Vector Data Types -TiDB provides Vector data type specifically optimized for AI Vector Embedding use cases. By using the Vector data type, you can store and query a sequence of floating numbers efficiently, such as `[0.3, 0.5, -0.1, ...]`. +A vector is a sequence of floating-point numbers, such as `[0.3, 0.5, -0.1, ...]`. TiDB offers Vector data types, specifically optimized for efficiently storing and querying vector embeddings widely used in AI applications. -The following Vector data type is currently available: + -- `VECTOR`: A sequence of single-precision floating numbers. The dimensions can be different for each row. -- `VECTOR(D)`: A sequence of single-precision floating numbers with a fixed dimension `D`. - -The Vector data type provides these advantages over storing in a `JSON` column: +> **Warning:** +> +> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. -- Vector Index support. A [Vector Search Index](/tidb-cloud/vector-search-index.md) can be built to speed up vector searching. -- Dimension enforcement. A dimension can be specified to forbid inserting vectors with different dimensions. -- Optimized storage format. Vector data types are stored even more space-efficient than `JSON` data type. + > **Note:** > -> Vector data types are only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> Vector data types are only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +The following Vector data types are currently available: + +- `VECTOR`: A sequence of single-precision floating-point numbers with any dimension. +- `VECTOR(D)`: A sequence of single-precision floating-point numbers with a fixed dimension `D`. + +Using vector data types provides the following advantages over using the [`JSON`](/data-type-json.md) type: + +- Vector index support: You can build a [vector search index](/vector-search-index.md) to speed up vector searching. +- Dimension enforcement: You can specify a dimension to forbid inserting vectors with different dimensions. +- Optimized storage format: Vector data types are optimized for handling vector data, offering better space efficiency and performance compared to `JSON` types. -## Value syntax +## Syntax -A Vector value contains an arbitrary number of floating numbers. You can use a string in the following syntax to represent a Vector value: +You can use a string in the following syntax to represent a Vector value: ```sql '[, , ...]' @@ -50,18 +58,18 @@ Inserting vector values with invalid syntax will result in an error: ERROR 1105 (HY000): Invalid vector text: [5, ] ``` -As dimension 3 is enforced for the `embedding` column in the preceding example, inserting a vector with a different dimension will result in an error: +In the following example, because dimension `3` is enforced for the `embedding` column when the table is created, inserting a vector with a different dimension will result in an error: ```sql [tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]'); ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3) ``` -See [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md) for available functions and operators over the Vector data type. +For available functions and operators over the vector data types, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). -See [Vector Search Index](/tidb-cloud/vector-search-index.md) for building and using a vector search index. +For more information about building and using a vector search index, see [Vector Search Index](/vector-search-index.md). -## Vectors with different dimensions +## Store vectors with different dimensions You can store vectors with different dimensions in the same column by omitting the dimension parameter in the `VECTOR` type: @@ -75,33 +83,28 @@ INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 3 dimensions vector, INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 2 dimensions vector, OK ``` -However you cannot build a [Vector Search Index](/tidb-cloud/vector-search-index.md) for this column, as vector distances can be only calculated between vectors with the same dimensions. +However, note that you cannot build a [vector search index](/vector-search-index.md) for this column, as vector distances can be only calculated between vectors with the same dimensions. ## Comparison -You can compare vector data types using [comparison operators](/functions-and-operators/operators.md) such as `=`, `!=`, `<`, `>`, `<=`, and `>=`. For a complete list of comparison operators and functions for vector data types, see [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md). +You can compare vector data types using [comparison operators](/functions-and-operators/operators.md) such as `=`, `!=`, `<`, `>`, `<=`, and `>=`. For a complete list of comparison operators and functions for vector data types, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). -Vector data types are compared element-wise numerically. Examples: +Vector data types are compared element-wise numerically. For example: - `[1] < [12]` - `[1,2,3] < [1,2,5]` - `[1,2,3] = [1,2,3]` - `[2,2,3] > [1,2,3]` -Vectors with different dimensions are compared using lexicographical comparison, with the following properties: +Two vectors with different dimensions are compared using lexicographical comparison, with the following rules: -- Two vectors are compared element by element, and each element is compared numerically. +- Two vectors are compared element by element from the start, and each element is compared numerically. - The first mismatching element determines which vector is lexicographically _less_ or _greater_ than the other. -- If one vector is a prefix of another, the shorter vector is lexicographically _less_ than the other. +- If one vector is a prefix of another, the shorter vector is lexicographically _less_ than the other. For example, `[1,2,3] < [1,2,3,0]`. - Vectors of the same length with identical elements are lexicographically _equal_. -- An empty vector is lexicographically _less_ than any non-empty vector. +- An empty vector is lexicographically _less_ than any non-empty vector. For example, `[] < [1]`. - Two empty vectors are lexicographically _equal_. -Examples: - -- `[] < [1]` -- `[1,2,3] < [1,2,3,0]` - When comparing vector constants, consider performing an [explicit cast](#cast) from string to vector to avoid comparisons based on string values: ```sql @@ -126,7 +129,7 @@ When comparing vector constants, consider performing an [explicit cast](#cast) f ## Arithmetic -Vector data types support element-wise arithmetic operations `+` (addition) and `-` (subtraction). However, performing arithmetic operations between vectors with different dimensions results in an error. +Vector data types support arithmetic operations `+` (addition) and `-` (subtraction). However, arithmetic operations between vectors with different dimensions are not supported and will result in an error. Examples: @@ -139,7 +142,7 @@ Examples: +---------------------------------------------+ 1 row in set (0.01 sec) -mysql> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); +[tidb]> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); +-----------------------------------------------------+ | VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') | +-----------------------------------------------------+ @@ -162,10 +165,10 @@ To cast between Vector and String, use the following functions: - `VEC_FROM_TEXT`: String ⇒ Vector - `VEC_AS_TEXT`: Vector ⇒ String -There are implicit casts when calling functions receiving vector data types: +To improve usability, if you call a function that only supports vector data types, such as a vector correlation distance function, you can also just pass in a format-compliant string. TiDB automatically performs an implicit cast in this case. ```sql --- There is an implicit cast here, since VEC_DIMS only accepts VECTOR arguments: +-- The VEC_DIMS function only accepts VECTOR arguments, so you can directly pass in a string for an implicit cast. [tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]'); +------------------------------+ | VEC_DIMS('[0.3, 0.5, -0.1]') | @@ -174,7 +177,7 @@ There are implicit casts when calling functions receiving vector data types: +------------------------------+ 1 row in set (0.01 sec) --- Cast explicitly using VEC_FROM_TEXT: +-- You can also explicitly cast a string to a vector using VEC_FROM_TEXT and then pass the vector to the VEC_DIMS function. [tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')); +---------------------------------------------+ | VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) | @@ -183,7 +186,7 @@ There are implicit casts when calling functions receiving vector data types: +---------------------------------------------+ 1 row in set (0.01 sec) --- Cast explicitly using CAST(... AS VECTOR): +-- You can also cast explicitly using CAST(... AS VECTOR): [tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)); +----------------------------------------------+ | VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) | @@ -193,7 +196,7 @@ There are implicit casts when calling functions receiving vector data types: 1 row in set (0.01 sec) ``` -Use explicit casts when operators or functions accept multiple data types. For example, in comparisons, use explicit casts to compare vector numeric values instead of string values: +When using an operator or function that accepts multiple data types, you need to explicitly cast the string type to the vector type before passing the string to that operator or function, because TiDB does not perform implicit casts in this case. For example, before performing comparison operations, you need to explicitly cast strings to vectors; otherwise, TiDB compares them as string values rather than as vector numeric values: ```sql -- Because string is given, TiDB is comparing strings: @@ -215,10 +218,10 @@ Use explicit casts when operators or functions accept multiple data types. For e 1 row in set (0.01 sec) ``` -To cast vector into its string representation explicitly, use the `VEC_AS_TEXT()` function: +You can also explicitly cast a vector to its string representation. Take using the `VEC_AS_TEXT()` function as an example: ```sql --- String representation is normalized: +-- The string is first implicitly cast to a vector, and then the vector is explicitly cast to a string, thus returning a string in the normalized format: [tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]'); +--------------------------------------+ | VEC_AS_TEXT('[0.3, 0.5, -0.1]') | @@ -228,19 +231,17 @@ To cast vector into its string representation explicitly, use the `VEC_AS_TEXT() 1 row in set (0.01 sec) ``` -For additional cast functions, see [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md). +For additional cast functions, see [Vector Functions and Operators](/vector-search-functions-and-operators.md). ### Cast between Vector ⇔ other data types -It is currently not possible to cast between Vector and other data types (like `JSON`) directly. You need to use String as an intermediate type. +Currently, direct casting between Vector and other data types (such as `JSON`) is not supported. To work around this limitation, use String as an intermediate data type for casting in your SQL statement. -## Restrictions +Note that vector data type columns stored in a table cannot be converted to other data types using `ALTER TABLE ... MODIFY COLUMN ...`. -- The maximum supported Vector dimension is 16000. -- You cannot store `NaN`, `Infinity`, or `-Infinity` values in the vector data type. -- Currently, Vector data types cannot store double-precision floating point numbers. This is planned to be supported in a future release. In the meantime, if you import double-precision floating point numbers for Vector data types, they are converted to single-precision numbers. +## Restrictions -For other limitations, see [Vector Search Limitations](/tidb-cloud/vector-search-limitations.md). +For restrictions on vector data types, see [Vector search limitations](/vector-search-limitations.md) and [Vector index restrictions](/vector-search-index.md#restrictions). ## MySQL compatibility @@ -248,6 +249,6 @@ Vector data types are TiDB specific, and are not supported in MySQL. ## See also -- [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) -- [Improve Vector Search Performance](/tidb-cloud/vector-search-improve-performance.md) +- [Vector Functions and Operators](/vector-search-functions-and-operators.md) +- [Vector Search Index](/vector-search-index.md) +- [Improve Vector Search Performance](/vector-search-improve-performance.md) \ No newline at end of file diff --git a/tidb-cloud/vector-search-functions-and-operators.md b/vector-search-functions-and-operators.md similarity index 68% rename from tidb-cloud/vector-search-functions-and-operators.md rename to vector-search-functions-and-operators.md index 9966898ae380b..f6ed6449e9567 100644 --- a/tidb-cloud/vector-search-functions-and-operators.md +++ b/vector-search-functions-and-operators.md @@ -1,39 +1,49 @@ --- title: Vector Functions and Operators -summary: Learn about functions and operators available for Vector Data Types. +summary: Learn about functions and operators available for Vector data types. --- # Vector Functions and Operators +This document lists the functions and operators available for Vector data types. + + + +> **Warning:** +> +> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + > **Note:** > -> Vector data types and these vector functions are only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> Vector data types and these vector functions are only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Vector functions -The following functions are designed specifically for [Vector Data Types](/tidb-cloud/vector-search-data-types.md). +The following functions are designed specifically for [Vector data types](/vector-search-data-types.md). -**Vector Distance Functions:** +**Vector distance functions:** | Function Name | Description | | --------------------------------------------------------- | ---------------------------------------------------------------- | -| [VEC_L2_DISTANCE](#vec_l2_distance) | Calculates L2 distance (Euclidean distance) between two vectors | -| [VEC_COSINE_DISTANCE](#vec_cosine_distance) | Calculates the cosine distance between two vectors | -| [VEC_NEGATIVE_INNER_PRODUCT](#vec_negative_inner_product) | Calculates the negative of the inner product between two vectors | -| [VEC_L1_DISTANCE](#vec_l1_distance) | Calculates L1 distance (Manhattan distance) between two vectors | +| [`VEC_L2_DISTANCE`](#vec_l2_distance) | Calculates L2 distance (Euclidean distance) between two vectors | +| [`VEC_COSINE_DISTANCE`](#vec_cosine_distance) | Calculates the cosine distance between two vectors | +| [`VEC_NEGATIVE_INNER_PRODUCT`](#vec_negative_inner_product) | Calculates the negative of the inner product between two vectors | +| [`VEC_L1_DISTANCE`](#vec_l1_distance) | Calculates L1 distance (Manhattan distance) between two vectors | -**Other Vector Functions:** +**Other vector functions:** | Function Name | Description | | ------------------------------- | --------------------------------------------------- | -| [VEC_DIMS](#vec_dims) | Returns the dimension of a vector | -| [VEC_L2_NORM](#vec_l2_norm) | Calculates the L2 norm (Euclidean norm) of a vector | -| [VEC_FROM_TEXT](#vec_from_text) | Converts a string into a vector | -| [VEC_AS_TEXT](#vec_as_text) | Converts a vector into a string | +| [`VEC_DIMS`](#vec_dims) | Returns the dimension of a vector | +| [`VEC_L2_NORM`](#vec_l2_norm) | Calculates the L2 norm (Euclidean norm) of a vector | +| [`VEC_FROM_TEXT`](#vec_from_text) | Converts a string into a vector | +| [`VEC_AS_TEXT`](#vec_as_text) | Converts a vector into a string | ## Extended built-in functions and operators -The following built-in functions and operators are extended, supporting operating on [Vector Data Types](/tidb-cloud/vector-search-data-types.md). +The following built-in functions and operators are extended to support operations on [Vector data types](/vector-search-data-types.md). **Arithmetic operators:** @@ -42,12 +52,12 @@ The following built-in functions and operators are extended, supporting operatin | [`+`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_plus) | Vector element-wise addition operator | | [`-`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_minus) | Vector element-wise subtraction operator | -For more information about how vector arithmetic works, see [Vector Data Type | Arithmetic](/tidb-cloud/vector-search-data-types.md#arithmetic). +For more information about how vector arithmetic works, see [Vector Data Type | Arithmetic](/vector-search-data-types.md#arithmetic). **Aggregate (GROUP BY) functions:** -| Name | Description | -| :------------------------------------------------------------------------------------------------------------ | :----------------------------------------------- | +| Name | Description | +| :----------------------- | :----------------------------------------------- | | [`COUNT()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count) | Return a count of the number of rows returned | | [`COUNT(DISTINCT)`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count-distinct) | Return the count of a number of different values | | [`MAX()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_max) | Return the maximum value | @@ -55,8 +65,8 @@ For more information about how vector arithmetic works, see [Vector Data Type | **Comparison functions and operators:** -| Name | Description | -| ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | +| Name | Description | +| ---------------------------------------- | ----------------------------------------------------- | | [`BETWEEN ... AND ...`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_between) | Check whether a value is within a range of values | | [`COALESCE()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_coalesce) | Return the first non-NULL argument | | [`=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal) | Equal operator | @@ -65,8 +75,8 @@ For more information about how vector arithmetic works, see [Vector Data Type | | [`>=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_greater-than-or-equal) | Greater than or equal operator | | [`GREATEST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_greatest) | Return the largest argument | | [`IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_in) | Check whether a value is within a set of values | -| [`IS NULL`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_is-null) | NULL value test | -| [`ISNULL()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_isnull) | Test whether the argument is NULL | +| [`IS NULL`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_is-null) | Test whether a value is `NULL` | +| [`ISNULL()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_isnull) | Test whether the argument is `NULL` | | [`LEAST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_least) | Return the smallest argument | | [`<`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than) | Less than operator | | [`<=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than-or-equal) | Less than or equal operator | @@ -74,7 +84,7 @@ For more information about how vector arithmetic works, see [Vector Data Type | | [`!=`, `<>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-equal) | Not equal operator | | [`NOT IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-in) | Check whether a value is not within a set of values | -For more information about how vectors are compared, see [Vector Data Type | Comparison](/tidb-cloud/vector-search-data-types.md#comparison). +For more information about how vectors are compared, see [Vector Data Type | Comparison](/vector-search-data-types.md#comparison). **Control flow functions:** @@ -83,16 +93,16 @@ For more information about how vectors are compared, see [Vector Data Type | Com | [`CASE`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#operator_case) | Case operator | | [`IF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_if) | If/else construct | | [`IFNULL()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_ifnull) | Null if/else construct | -| [`NULLIF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_nullif) | Return NULL if expr1 = expr2 | +| [`NULLIF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_nullif) | Return `NULL` if expr1 = expr2 | **Cast functions:** | Name | Description | | :------------------------------------------------------------------------------------------ | :----------------------------- | -| [`CAST()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_cast) | Cast a value as a certain type | -| [`CONVERT()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert) | Cast a value as a certain type | +| [`CAST()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_cast) | Cast a value as a string or vector | +| [`CONVERT()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert) | Cast a value as a string | -For more information about how to use `CAST()`, see [Vector Data Type | Cast](/tidb-cloud/vector-search-data-types.md#cast). +For more information about how to use `CAST()`, see [Vector Data Type | Cast](/vector-search-data-types.md#cast). ## Full references @@ -102,16 +112,16 @@ For more information about how to use `CAST()`, see [Vector Data Type | Cast](/t VEC_L2_DISTANCE(vector1, vector2) ``` -Calculates the L2 distance (Euclidean distance) between two vectors using the following formula: +Calculates the [L2 distance](https://en.wikipedia.org/wiki/Euclidean_distance) (Euclidean distance) between two vectors using the following formula: $DISTANCE(p,q)=\sqrt {\sum \limits _{i=1}^{n}{(p_{i}-q_{i})^{2}}}$ -The two vectors must have the same dimension. Otherwise an error is returned. +The two vectors must have the same dimension. Otherwise, an error is returned. -Examples: +Example: ```sql -[tidb]> select VEC_L2_DISTANCE('[0,3]', '[4,0]'); +[tidb]> SELECT VEC_L2_DISTANCE('[0,3]', '[4,0]'); +-----------------------------------+ | VEC_L2_DISTANCE('[0,3]', '[4,0]') | +-----------------------------------+ @@ -125,16 +135,16 @@ Examples: VEC_COSINE_DISTANCE(vector1, vector2) ``` -Calculates the cosine distance between two vectors using the following formula: +Calculates the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors using the following formula: $DISTANCE(p,q)=1.0 - {\frac {\sum \limits _{i=1}^{n}{p_{i}q_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{q_{i}^{2}}}}}}$ -The two vectors must have the same dimension. Otherwise an error is returned. +The two vectors must have the same dimension. Otherwise, an error is returned. -Examples: +Example: ```sql -[tidb]> select VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]'); +[tidb]> SELECT VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]'); +-------------------------------------------+ | VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]') | +-------------------------------------------+ @@ -148,16 +158,16 @@ Examples: VEC_NEGATIVE_INNER_PRODUCT(vector1, vector2) ``` -Calculates the distance by using the negative of the inner product between two vectors, using the following formula: +Calculates the distance by using the negative of the [inner product](https://en.wikipedia.org/wiki/Dot_product) between two vectors, using the following formula: $DISTANCE(p,q)=- INNER\_PROD(p,q)=-\sum \limits _{i=1}^{n}{p_{i}q_{i}}$ -The two vectors must have the same dimension. Otherwise an error is returned. +The two vectors must have the same dimension. Otherwise, an error is returned. -Examples: +Example: ```sql -[tidb]> select VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]'); +[tidb]> SELECT VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]'); +----------------------------------------------+ | VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]') | +----------------------------------------------+ @@ -171,16 +181,16 @@ Examples: VEC_L1_DISTANCE(vector1, vector2) ``` -Calculates the L1 distance (Manhattan distance) between two vectors using the following formula: +Calculates the [L1 distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (Manhattan distance) between two vectors using the following formula: $DISTANCE(p,q)=\sum \limits _{i=1}^{n}{|p_{i}-q_{i}|}$ -The two vectors must have the same dimension. Otherwise an error is returned. +The two vectors must have the same dimension. Otherwise, an error is returned. -Examples: +Example: ```sql -[tidb]> select VEC_L1_DISTANCE('[0,0]', '[3,4]'); +[tidb]> SELECT VEC_L1_DISTANCE('[0,0]', '[3,4]'); +-----------------------------------+ | VEC_L1_DISTANCE('[0,0]', '[3,4]') | +-----------------------------------+ @@ -199,14 +209,14 @@ Returns the dimension of a vector. Examples: ```sql -[tidb]> select VEC_DIMS('[1,2,3]'); +[tidb]> SELECT VEC_DIMS('[1,2,3]'); +---------------------+ | VEC_DIMS('[1,2,3]') | +---------------------+ | 3 | +---------------------+ -[tidb]> select VEC_DIMS('[]'); +[tidb]> SELECT VEC_DIMS('[]'); +----------------+ | VEC_DIMS('[]') | +----------------+ @@ -220,14 +230,14 @@ Examples: VEC_L2_NORM(vector) ``` -Calculates the L2 norm (Euclidean norm) of a vector using the following formula: +Calculates the [L2 norm](https://en.wikipedia.org/wiki/Norm_(mathematics)) (Euclidean norm) of a vector using the following formula: $NORM(p)=\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}$ -Examples: +Example: ```sql -[tidb]> select VEC_L2_NORM('[3,4]'); +[tidb]> SELECT VEC_L2_NORM('[3,4]'); +----------------------+ | VEC_L2_NORM('[3,4]') | +----------------------+ @@ -243,10 +253,10 @@ VEC_FROM_TEXT(string) Converts a string into a vector. -Examples: +Example: ```sql -[tidb]> select VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]'); +[tidb]> SELECT VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]'); +-------------------------------------------------+ | VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]') | +-------------------------------------------------+ @@ -262,10 +272,10 @@ VEC_AS_TEXT(vector) Converts a vector into a string. -Examples: +Example: ```sql -[tidb]> select VEC_AS_TEXT('[1.000, 2.5]'); +[tidb]> SELECT VEC_AS_TEXT('[1.000, 2.5]'); +-------------------------------+ | VEC_AS_TEXT('[1.000, 2.5]') | +-------------------------------+ @@ -279,4 +289,4 @@ The vector functions and the extended usage of built-in functions and operators ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Data Types](/vector-search-data-types.md) diff --git a/tidb-cloud/vector-search-get-started-using-python.md b/vector-search-get-started-using-python.md similarity index 54% rename from tidb-cloud/vector-search-get-started-using-python.md rename to vector-search-get-started-using-python.md index bc3ad8b893bc8..0a39d65a28fc1 100644 --- a/tidb-cloud/vector-search-get-started-using-python.md +++ b/vector-search-get-started-using-python.md @@ -5,13 +5,21 @@ summary: Learn how to quickly develop an AI application that performs semantic s # Get Started with TiDB + AI via Python -This tutorial demonstrates how to develop a simple AI application that provides **semantic search** features. Unlike traditional keyword search, semantic search intelligently understands the meaning behind your query. For example, if you have documents titled "dog", "fish", and "tree", and you search for "a swimming animal", the application would identify "fish" as the most relevant result. +This tutorial demonstrates how to develop a simple AI application that provides **semantic search** features. Unlike traditional keyword search, semantic search intelligently understands the meaning behind your query and returns the most relevant result. For example, if you have documents titled "dog", "fish", and "tree", and you search for "a swimming animal", the application would identify "fish" as the most relevant result. -Throughout this tutorial, you will develop this AI application using [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), Python, [TiDB Vector SDK for Python](https://github.com/pingcap/tidb-vector-python), and AI models. +Throughout this tutorial, you will develop this AI application using [TiDB Vector Search](/vector-search-overview.md), Python, [TiDB Vector SDK for Python](https://github.com/pingcap/tidb-vector-python), and AI models. -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Prerequisites @@ -19,11 +27,28 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Get started -To run the demo directly, check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository. +The following steps show how to develop the application from scratch. To run the demo directly, you can check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository. ### Step 1. Create a new Python project @@ -43,11 +68,18 @@ In your project directory, run the following command to install the required pac pip install sqlalchemy pymysql sentence-transformers tidb-vector python-dotenv ``` -- `tidb-vector`: the Python client for interacting with Vector Search in TiDB Cloud. -- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) from text. +- `tidb-vector`: the Python client for interacting with TiDB vector search. +- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/vector-search-overview.md#vector-embedding) from text. ### Step 3. Configure the connection string to the TiDB cluster +Configure the cluster connection string depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. 2. Click **Connect** in the upper-right corner. A connection dialog is displayed. @@ -77,9 +109,33 @@ pip install sqlalchemy pymysql sentence-transformers tidb-vector python-dotenv TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" ``` +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ ### Step 4. Initialize the embedding model -An [embedding model](/tidb-cloud/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings. +An [embedding model](/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings. To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use. @@ -98,11 +154,11 @@ def text_to_embedding(text): ### Step 5. Connect to the TiDB cluster -Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column to serve as the vector store. +Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column. > **Note** > -> Ensure the dimension of your vector column matches the dimension of the vectors produced by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions. +> Make sure the dimension of your vector column in the table matches the dimension of the vectors generated by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions, so the dimension of your vector columns in `embedded_documents` should be 384 as well. ```python import os @@ -113,13 +169,13 @@ from dotenv import load_dotenv load_dotenv() vector_store = TiDBVectorClient( - # The table which will store the vector data. + # The 'embedded_documents' table will store the vector data. table_name='embedded_documents', # The connection string to the TiDB cluster. connection_string=os.environ.get('TIDB_DATABASE_URL'), # The dimension of the vector generated by the embedding model. vector_dimension=embed_model_dims, - # Determine whether to recreate the table if it already exists. + # Recreate the table if it already exists. drop_existing_table=True, ) ``` @@ -185,11 +241,11 @@ Search result ("a swimming animal"): - text: "tree", distance: 0.798545178640937 ``` -From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. -This demonstration shows how vector search can efficiently locate the most relevant documents, with search results organized by the proximity of the vectors: the smaller the distance, the more relevant the document. +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) \ No newline at end of file diff --git a/vector-search-get-started-using-sql.md b/vector-search-get-started-using-sql.md new file mode 100644 index 0000000000000..9c3d493647daa --- /dev/null +++ b/vector-search-get-started-using-sql.md @@ -0,0 +1,195 @@ +--- +title: Get Started with Vector Search via SQL +summary: Learn how to quickly get started with Vector Search in TiDB using SQL statements to power your generative AI applications. +--- + +# Get Started with Vector Search via SQL + +TiDB extends MySQL syntax to support [Vector Search](/vector-search-overview.md) and introduce new [Vector data types](/vector-search-data-types.md) and several [vector functions](/vector-search-functions-and-operators.md). + +This tutorial demonstrates how to get started with TiDB Vector Search just using SQL statements. You will learn how to use the [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) to complete the following operations: + +- Connect to your TiDB cluster. +- Create a vector table. +- Store vector embeddings. +- Perform vector search queries. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Prerequisites + +To complete this tutorial, you need: + +- [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) (MySQL CLI) installed on your machine. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + + +## Get started + +### Step 1. Connect to the TiDB cluster + +Connect to your TiDB cluster depending on the TiDB deployment option you've selected. + + +
+ +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. In the connection dialog, select **MySQL CLI** from the **Connect With** drop-down list and keep the default setting of the **Connection Type** as **Public**. + +4. If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Copy the connection command and paste it into your terminal. The following is an example for macOS: + + ```bash + mysql -u '.root' -h '' -P 4000 -D 'test' --ssl-mode=VERIFY_IDENTITY --ssl-ca=/etc/ssl/cert.pem -p'' + ``` + +
+
+ +After your TiDB Self-Managed cluster is started, execute your cluster connection command in the terminal. + +The following is an example connection command for macOS: + +```bash +mysql --comments --host 127.0.0.1 --port 4000 -u root +``` + +
+ +
+ +### Step 2. Create a vector table + +When creating a table, you can define a column as a [vector](/vector-search-overview.md#vector-embedding) column by specifying the `VECTOR` data type. + +For example, to create a table `embedded_documents` with a three-dimensional `VECTOR` column, execute the following SQL statements using your MySQL CLI: + +```sql +USE test; +CREATE TABLE embedded_documents ( + id INT PRIMARY KEY, + -- Column to store the original content of the document. + document TEXT, + -- Column to store the vector representation of the document. + embedding VECTOR(3) +); +``` + +The expected output is as follows: + +```text +Query OK, 0 rows affected (0.27 sec) +``` + +### Step 3. Insert vector embeddings to the table + +Insert three documents with their [vector embeddings](/vector-search-overview.md#vector-embedding) into the `embedded_documents` table: + +```sql +INSERT INTO embedded_documents +VALUES + (1, 'dog', '[1,2,1]'), + (2, 'fish', '[1,2,4]'), + (3, 'tree', '[1,0,0]'); +``` + +The expected output is as follows: + +``` +Query OK, 3 rows affected (0.15 sec) +Records: 3 Duplicates: 0 Warnings: 0 +``` + +> **Note** +> +> This example simplifies the dimensions of the vector embeddings and uses only 3-dimensional vectors for demonstration purposes. +> +> In real-world applications, [embedding models](/vector-search-overview.md#embedding-model) often produce vector embeddings with hundreds or thousands of dimensions. + +### Step 4. Query the vector table + +To verify that the documents have been inserted correctly, query the `embedded_documents` table: + +```sql +SELECT * FROM embedded_documents; +``` + +The expected output is as follows: + +```sql ++----+----------+-----------+ +| id | document | embedding | ++----+----------+-----------+ +| 1 | dog | [1,2,1] | +| 2 | fish | [1,2,4] | +| 3 | tree | [1,0,0] | ++----+----------+-----------+ +3 rows in set (0.15 sec) +``` + +### Step 5. Perform a vector search query + +Similar to full-text search, users provide search terms to the application when using vector search. + +In this example, the search term is "a swimming animal", and its corresponding vector embedding is assumed to be `[1,2,3]`. In practical applications, you need to use an embedding model to convert the user's search term into a vector embedding. + +Execute the following SQL statement, and TiDB will identify the top three documents closest to `[1,2,3]` by calculating and sorting the cosine distances (`vec_cosine_distance`) between the vector embeddings in the table. + +```sql +SELECT id, document, vec_cosine_distance(embedding, '[1,2,3]') AS distance +FROM embedded_documents +ORDER BY distance +LIMIT 3; +``` + +The expected output is as follows: + +```plain ++----+----------+---------------------+ +| id | document | distance | ++----+----------+---------------------+ +| 2 | fish | 0.00853986601633272 | +| 1 | dog | 0.12712843905603044 | +| 3 | tree | 0.7327387580875756 | ++----+----------+---------------------+ +3 rows in set (0.15 sec) +``` + +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. + +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. + +## See also + +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/vector-search-improve-performance.md b/vector-search-improve-performance.md new file mode 100644 index 0000000000000..a723a4af95927 --- /dev/null +++ b/vector-search-improve-performance.md @@ -0,0 +1,48 @@ +--- +title: Improve Vector Search Performance +summary: Learn best practices for improving the performance of TiDB Vector Search. +--- + +# Improve Vector Search Performance + +TiDB Vector Search enables you to perform Approximate Nearest Neighbor (ANN) queries that search for results similar to an image, document, or other input. To improve the query performance, review the following best practices. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Add vector search index for vector columns + +The [vector search index](/vector-search-index.md) dramatically improves the performance of vector search queries, usually by 10x or more, with a trade-off of only a small decrease of recall rate. + +## Ensure vector indexes are fully built + +After you insert a large volume of vector data, some of it might be in the Delta layer waiting for persistence. The vector index for such data will be built after the data is persisted. Until all vector data is indexed, vector search performance is suboptimal. To check the index build progress, see [View index build progress](/vector-search-index.md#view-index-build-progress). + +## Reduce vector dimensions or shorten embeddings + +The computational complexity of vector search indexing and queries increases significantly as the dimension of vectors grows, requiring more floating-point comparisons. + +To optimize performance, consider reducing vector dimensions whenever feasible. This usually needs switching to another embedding model. When switching models, you need to evaluate the impact of the model change on the accuracy of vector queries. + +Certain embedding models like OpenAI `text-embedding-3-large` support [shortening embeddings](https://openai.com/index/new-embedding-models-and-api-updates/), which removes some numbers from the end of vector sequences without losing the embedding's concept-representing properties. You can also use such an embedding model to reduce the vector dimensions. + +## Exclude vector columns from the results + +Vector embedding data is usually large and only used during the search process. By excluding vector columns from query results, you can greatly reduce the data transferred between the TiDB server and your SQL client, thereby improving query performance. + +To exclude vector columns, explicitly list the columns you want to retrieve in the `SELECT` clause, instead of using `SELECT *` to retrieve all columns. + +## Warm up the index + +When accessing an index that has never been used or has not been accessed for a long time (cold access), TiDB needs to load the entire index from cloud storage or disk (instead of from memory). This process takes time and often results in higher query latency. Additionally, if there are no SQL queries for an extended period (for example, several hours), computing resources are reclaimed, causing subsequent access to become cold access. + +To avoid such query latency, warm up your index before actual workload by running similar vector search queries that hit the vector index. \ No newline at end of file diff --git a/vector-search-index.md b/vector-search-index.md new file mode 100644 index 0000000000000..828cd2accf3d2 --- /dev/null +++ b/vector-search-index.md @@ -0,0 +1,263 @@ +--- +title: Vector Search Index +summary: Learn how to build and use the vector search index to accelerate K-Nearest neighbors (KNN) queries in TiDB. +--- + +# Vector Search Index + +K-nearest neighbors (KNN) search is the method for finding the K closest points to a given point in a vector space. The most straightforward approach to perform KNN search is a brute force search, which calculates the distance between the given vector and all other vectors in the space. This approach guarantees perfect accuracy, but it is usually too slow for real-world use. Therefore, approximate algorithms are commonly used in KNN search to enhance speed and efficiency. + +In TiDB, you can create and use vector search indexes for such approximate nearest neighbor (ANN) searches over columns with [vector data types](/vector-search-data-types.md). By using vector search indexes, vector search queries could be finished in milliseconds. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +Currently, TiDB supports the [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) vector search index algorithm. + +## Restrictions + +- TiFlash nodes must be deployed in your cluster in advance. +- Vector search indexes cannot be used as primary keys or unique indexes. +- Vector search indexes can only be created on a single vector column and cannot be combined with other columns (such as integers or strings) to form composite indexes. +- A distance function must be specified when creating and using vector search indexes. Currently, only cosine distance `VEC_COSINE_DISTANCE()` and L2 distance `VEC_L2_DISTANCE()` functions are supported. +- For the same column, creating multiple vector search indexes using the same distance function is not supported. +- Directly dropping columns with vector search indexes is not supported. You can drop such a column by first dropping the vector search index on that column and then dropping the column itself. +- Modifying the type of a column with a vector index is not supported. +- Setting vector search indexes as [invisible](/sql-statements/sql-statement-alter-index.md) is not supported. +- Building vector search indexes on TiFlash nodes with [encryption at rest](https://docs.pingcap.com/tidb/stable/encryption-at-rest) enabled is not supported. + +## Create the HNSW vector index + +[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy, up to 98% in specific cases. + +In TiDB, you can create an HNSW index for a column with a [vector data type](/vector-search-data-types.md) in either of the following ways: + +- When creating a table, use the following syntax to specify the vector column for the HNSW index: + + ```sql + CREATE TABLE foo ( + id INT PRIMARY KEY, + embedding VECTOR(5), + VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) + ); + ``` + +- For an existing table that already contains a vector column, use the following syntax to create an HNSW index for the vector column: + + ```sql + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))); + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))); + + -- You can also explicitly specify "USING HNSW" to build the vector search index. + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ``` + +> **Note:** +> +> The vector search index feature relies on TiFlash replicas for tables. +> +> - If a vector search index is defined when a table is created, TiDB automatically creates a TiFlash replica for the table. +> - If no vector search index is defined when a table is created, and the table currently does not have a TiFlash replica, you need to manually create a TiFlash replica before adding a vector search index to the table. For example: `ALTER TABLE 'table_name' SET TIFLASH REPLICA 1;`. + +When creating an HNSW vector index, you need to specify the distance function for the vector: + +- Cosine Distance: `((VEC_COSINE_DISTANCE(embedding)))` +- L2 Distance: `((VEC_L2_DISTANCE(embedding)))` + +The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for non-fixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimension. + +For restrictions and limitations of vector search indexes, see [Restrictions](#restrictions). + +## Use the vector index + +The vector search index can be used in K-nearest neighbor search queries by using the `ORDER BY ... LIMIT` clause as follows: + +```sql +SELECT * +FROM foo +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3, 4, 5]') +LIMIT 10 +``` + +To use an index in a vector search, make sure that the `ORDER BY ... LIMIT` clause uses the same distance function as the one specified when creating the vector index. + +## Use the vector index with filters + +Queries that contain a pre-filter (using the `WHERE` clause) cannot utilize the vector index because they are not querying for K-Nearest neighbors according to the SQL semantics. For example: + +```sql +-- For the following query, the `WHERE` filter is performed before KNN, so the vector index cannot be used: + +SELECT * FROM vec_table +WHERE category = "document" +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 5; +``` + +To use the vector index with filters, query for the K-Nearest neighbors first using vector search, and then filter out unwanted results: + +```sql +-- For the following query, the `WHERE` filter is performed after KNN, so the vector index cannot be used: + +SELECT * FROM +( + SELECT * FROM vec_table + ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') + LIMIT 5 +) t +WHERE category = "document"; + +-- Note that this query might return fewer than 5 results if some are filtered out. +``` + +## View index build progress + +After you insert a large volume of data, some of it might not be instantly persisted to TiFlash. For vector data that has already been persisted, the vector search index is built synchronously. For data that has not yet been persisted, the index will be built once the data is persisted. This process does not affect the accuracy and consistency of the data. You can still perform vector searches at any time and get complete results. However, performance will be suboptimal until vector indexes are fully built. + +To view the index build progress, you can query the `INFORMATION_SCHEMA.TIFLASH_INDEXES` table as follows: + +```sql +SELECT * FROM INFORMATION_SCHEMA.TIFLASH_INDEXES; ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| TIDB_DATABASE | TIDB_TABLE | TABLE_ID | COLUMN_NAME | INDEX_NAME | COLUMN_ID | INDEX_ID | INDEX_KIND | ROWS_STABLE_INDEXED | ROWS_STABLE_NOT_INDEXED | ROWS_DELTA_INDEXED | ROWS_DELTA_NOT_INDEXED | ERROR_MESSAGE | TIFLASH_INSTANCE | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| test | tcff1d827 | 219 | col1fff | 0a452311 | 7 | 1 | HNSW | 29646 | 0 | 0 | 0 | | 127.0.0.1:3930 | +| test | foo | 717 | embedding | idx_embedding | 2 | 1 | HNSW | 0 | 0 | 0 | 3 | | 127.0.0.1:3930 | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +``` + +- You can check the `ROWS_STABLE_INDEXED` and `ROWS_STABLE_NOT_INDEXED` columns for the index build progress. When `ROWS_STABLE_NOT_INDEXED` becomes 0, the index build is complete. + + As a reference, indexing a 500 MiB vector dataset with 768 dimensions might take up to 20 minutes. The indexer can run in parallel for multiple tables. Currently, adjusting the indexer priority or speed is not supported. + +- You can check the `ROWS_DELTA_NOT_INDEXED` column for the number of rows in the Delta layer. Data in the storage layer of TiFlash is stored in two layers: Delta layer and Stable layer. The Delta layer stores recently inserted or updated rows and is periodically merged into the Stable layer according to the write workload. This merge process is called Compaction. + + The Delta layer is always not indexed. To achieve optimal performance, you can force the merge of the Delta layer into the Stable layer so that all data can be indexed: + + ```sql + ALTER TABLE COMPACT; + ``` + + For more information, see [`ALTER TABLE ... COMPACT`](/sql-statements/sql-statement-alter-table-compact.md). + +In addition, you can monitor the execution progress of the DDL job by executing `ADMIN SHOW DDL JOBS;` and checking the `row count`. However, this method is not fully accurate, because the `row count` value is obtained from the `rows_stable_indexed` field in `TIFLASH_INDEXES`. You can use this approach as a reference for tracking the progress of indexing. + +## Check whether the vector index is used + +Use the [`EXPLAIN`](/sql-statements/sql-statement-explain.md) or [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement to check whether a query is using the vector index. When `annIndex:` is presented in the `operator info` column for the `TableFullScan` executor, it means this table scan is utilizing the vector index. + +**Example: the vector index is used** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+-------------------------------------------------------------------------------------+ +| ... | operator info | ++-----+-------------------------------------------------------------------------------------+ +| ... | ... | +| ... | Column#5, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#5 | +| ... | MppVersion: 1, data:ExchangeSender_16 | +| ... | ExchangeType: PassThrough | +| ... | ... | +| ... | Column#4, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#4 | +| ... | annIndex:COSINE(test.vector_table_with_index.embedding..[1,2,3], limit:10), ... | ++-----+-------------------------------------------------------------------------------------+ +9 rows in set (0.01 sec) +``` + +**Example: The vector index is not used because of not specifying a Top K** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index + -> ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]'); ++--------------------------------+-----+--------------------------------------------------+ +| id | ... | operator info | ++--------------------------------+-----+--------------------------------------------------+ +| Projection_15 | ... | ... | +| └─Sort_4 | ... | Column#4 | +| └─Projection_16 | ... | ..., vec_cosine_distance(..., [1,2,3])->Column#4 | +| └─TableReader_14 | ... | MppVersion: 1, data:ExchangeSender_13 | +| └─ExchangeSender_13 | ... | ExchangeType: PassThrough | +| └─TableFullScan_12 | ... | keep order:false, stats:pseudo | ++--------------------------------+-----+--------------------------------------------------+ +6 rows in set, 1 warning (0.01 sec) +``` + +When the vector index cannot be used, a warning occurs in some cases to help you learn the cause: + +```sql +-- Using a wrong distance function: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_L2_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: not ordering by COSINE distance + +-- Using a wrong order: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') DESC +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: index can be used only when ordering by vec_cosine_distance() in ASC order +``` + +## Analyze vector search performance + +To learn detailed information about how a vector index is used, you can execute the [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement and check the `execution info` column in the output: + +```sql +[tidb]> EXPLAIN ANALYZE SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+--------------------------------------------------------+-----+ +| | execution info | | ++-----+--------------------------------------------------------+-----+ +| ... | time:339.1ms, loops:2, RU:0.000000, Concurrency:OFF | ... | +| ... | time:339ms, loops:2 | ... | +| ... | time:339ms, loops:3, Concurrency:OFF | ... | +| ... | time:339ms, loops:3, cop_task: {...} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{...}, vector_idx:{ | ... | +| | load:{total:68ms,from_s3:1,from_disk:0,from_cache:0},| | +| | search:{total:0ms,visited_nodes:2,discarded_nodes:0},| | +| | read:{vec_total:0ms,others_total:0ms}},...} | | ++-----+--------------------------------------------------------+-----+ +``` + +> **Note:** +> +> The execution information is internal. Fields and formats are subject to change without any notification. Do not rely on them. + +Explanation of some important fields: + +- `vector_index.load.total`: The total duration of loading index. This field might be larger than the actual query time because multiple vector indexes might be loaded in parallel. +- `vector_index.load.from_s3`: Number of indexes loaded from S3. +- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously. +- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously. +- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field might be larger than the actual query time because multiple vector indexes might be searched in parallel. +- `vector_index.search.discarded_nodes`: Number of vector rows visited but discarded during the search. These discarded vectors are not considered in the search result. Large values usually indicate that there are many stale rows caused by `UPDATE` or `DELETE` statements. + +See [`EXPLAIN`](/sql-statements/sql-statement-explain.md), [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md), and [EXPLAIN Walkthrough](/explain-walkthrough.md) for interpreting the output. + +## See also + +- [Improve Vector Search Performance](/vector-search-improve-performance.md) +- [Vector Data Types](/vector-search-data-types.md) diff --git a/tidb-cloud/vector-search-integrate-with-django-orm.md b/vector-search-integrate-with-django-orm.md similarity index 65% rename from tidb-cloud/vector-search-integrate-with-django-orm.md rename to vector-search-integrate-with-django-orm.md index cf270d5552b8f..5cbdaea4f893b 100644 --- a/tidb-cloud/vector-search-integrate-with-django-orm.md +++ b/vector-search-integrate-with-django-orm.md @@ -5,11 +5,19 @@ summary: Learn how to integrate TiDB Vector Search with Django ORM to store embe # Integrate TiDB Vector Search with Django ORM -This tutorial walks you through how to use [Django](https://www.djangoproject.com/) ORM to interact with the TiDB Vector Search, store embeddings, and perform vector search queries. +This tutorial walks you through how to use [Django](https://www.djangoproject.com/) ORM to interact with the [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Prerequisites @@ -17,7 +25,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Run the sample app @@ -49,7 +74,7 @@ Install the required dependencies for the demo project: pip install -r requirements.txt ``` -For your existing project, you can install the following packages: +Alternatively, you can install the following packages for your project: ```bash pip install Django django-tidb mysqlclient numpy python-dotenv @@ -59,7 +84,7 @@ If you encounter installation issues with mysqlclient, refer to the mysqlclient #### What is `django-tidb` -`django-tidb` is a TiDB dialect for Django that enhances the Django ORM to support TiDB-specific features (For example, Vector Search) and resolves compatibility issues between TiDB and Django. +`django-tidb` is a TiDB dialect for Django, which enhances the Django ORM to support TiDB-specific features (for example, Vector Search) and resolves compatibility issues between TiDB and Django. To install `django-tidb`, choose a version that matches your Django version. For example, if you are using `django==4.2.*`, install `django-tidb==4.2.*`. The minor version does not need to be the same. It is recommended to use the latest minor version. @@ -67,6 +92,13 @@ For more information, refer to [django-tidb repository](https://github.com/pingc ### Step 4. Configure the environment variables +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. 2. Click **Connect** in the upper-right corner. A connection dialog is displayed. @@ -108,6 +140,33 @@ For more information, refer to [django-tidb repository](https://github.com/pingc TIDB_CA_PATH=/etc/ssl/cert.pem ``` +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_HOST=127.0.0.1 +TIDB_PORT=4000 +TIDB_USERNAME=root +TIDB_PASSWORD= +TIDB_DATABASE=test +``` + +If you are running TiDB on your local machine, `TIDB_HOST` is `127.0.0.1` by default. The initial `TIDB_PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- `TIDB_HOST`: The host of the TiDB cluster. +- `TIDB_PORT`: The port of the TiDB cluster. +- `TIDB_USERNAME`: The username to connect to the TiDB cluster. +- `TIDB_PASSWORD`: The password to connect to the TiDB cluster. +- `TIDB_DATABASE`: The name of the database you want to connect to. + +
+ +
+ ### Step 5. Run the demo Migrate the database schema: @@ -180,22 +239,6 @@ class Document(models.Model): embedding = VectorField(dimensions=3) ``` -#### Define a vector column optimized with index - -Define a 3-dimensional vector column and optimize it with a [vector search index](/tidb-cloud/vector-search-index.md) (HNSW index). - -```python -class DocumentWithIndex(models.Model): - content = models.TextField() - # Note: - # - Using comment to add hnsw index is a temporary solution. In the future it will use `CREATE INDEX` syntax. - # - Currently the HNSW index cannot be changed after the table has been created. - # - Only Django >= 4.2 supports `db_comment`. - embedding = VectorField(dimensions=3, db_comment="hnsw(distance=cosine)") -``` - -TiDB will use this index to speed up vector search queries based on the cosine distance function. - ### Store documents with embeddings ```python @@ -206,7 +249,7 @@ Document.objects.create(content="tree", embedding=[1, 0, 0]) ### Search the nearest neighbor documents -TiDB Vector support below distance functions: +TiDB Vector support the following distance functions: - `L1Distance` - `L2Distance` @@ -233,5 +276,5 @@ results = Document.objects.annotate( ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md b/vector-search-integrate-with-jinaai-embedding.md similarity index 61% rename from tidb-cloud/vector-search-integrate-with-jinaai-embedding.md rename to vector-search-integrate-with-jinaai-embedding.md index 9a580c450e4c0..5a3b1abd4d96e 100644 --- a/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md +++ b/vector-search-integrate-with-jinaai-embedding.md @@ -5,11 +5,19 @@ summary: Learn how to integrate TiDB Vector Search with Jina AI Embeddings API t # Integrate TiDB Vector Search with Jina AI Embeddings API -This tutorial walks you through how to use [Jina AI](https://jina.ai/) to generate embeddings for text data, and then store the embeddings in TiDB Vector Storage and search similar texts based on embeddings. +This tutorial walks you through how to use [Jina AI](https://jina.ai/) to generate embeddings for text data, and then store the embeddings in TiDB vector storage and search similar texts based on embeddings. -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Prerequisites @@ -17,7 +25,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Run the sample app @@ -51,11 +76,12 @@ pip install -r requirements.txt ### Step 4. Configure the environment variables -#### 4.1 Get the Jina AI API key +Get the Jina AI API key from the [Jina AI Embeddings API](https://jina.ai/embeddings/) page, and then configure the environment variables depending on the TiDB deployment option you've selected. -Get the Jina AI API key from the [Jina AI Embeddings API](https://jina.ai/embeddings/) page. + +
-#### 4.2 Get the TiDB connection parameters +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. @@ -78,21 +104,44 @@ Get the Jina AI API key from the [Jina AI Embeddings API](https://jina.ai/embedd > > If you have not set a password yet, click **Create password** to generate a random password. -#### 4.3 Set the environment variables +5. Set the Jina AI API key and the TiDB connection string as environment variables in your terminal, or create a `.env` file with the following environment variables: -Set the environment variables in your terminal, or create a `.env` file with the above environment variables. + ```dotenv + JINAAI_API_KEY="****" + TIDB_DATABASE_URL="{tidb_connection_string}" + ``` -```dotenv -JINAAI_API_KEY="****" -TIDB_DATABASE_URL="{tidb_connection_string}" -``` + The following is an example connection string for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` -For example, the connection string on macOS looks like: +
+
-```dotenv -TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" +For a TiDB Self-Managed cluster, set the environment variables for connecting to your TiDB cluster in your terminal as follows: + +```shell +export JINA_API_KEY="****" +export TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: export TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" ``` +You need to replace parameters in the preceding command according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ ### Step 5. Run the demo ```bash @@ -117,7 +166,7 @@ Example output: ## Sample code snippets -### Get Embeddings from Jina AI +### Get embeddings from Jina AI Define a `generate_embeddings` helper function to call Jina AI embeddings API: @@ -144,9 +193,9 @@ def generate_embeddings(text: str): return response.json()['data'][0]['embedding'] ``` -### Connect to TiDB Cloud Serverless +### Connect to the TiDB cluster -Connect to TiDB Cloud Serverless through SQLAlchemy: +Connect to the TiDB cluster through SQLAlchemy: ```python import os @@ -189,7 +238,7 @@ class Document(Base): > - The dimension of the vector column must match the dimension of the embeddings generated by the embedding model. > - In this example, the dimension of embeddings generated by the `jina-embeddings-v2-base-en` model is `768`. -### Create embeddings with Jina AI embeddings and TiDB +### Create embeddings with Jina AI and store in TiDB Use the Jina AI Embeddings API to generate embeddings for each piece of text and store the embeddings in TiDB: @@ -219,13 +268,13 @@ with Session(engine) as session: session.commit() ``` -### Perform semantic search with Jina AI embeddings and TiDB +### Perform semantic search with Jina AI embeddings in TiDB -Generate embeddings for the query text via Jina AI embeddings API, and then search for the most relevant document based on the cosine distance between the query embedding and the document embeddings: +Generate the embedding for the query text via Jina AI embeddings API, and then search for the most relevant document based on the cosine distance between **the embedding of the query text** and **each embedding in the vector table**: ```python query = 'What is TiDB?' -# Generate embeddings for the query via Jina AI API. +# Generate the embedding for the query via Jina AI API. query_embedding = generate_embeddings(query) with Session(engine) as session: @@ -242,5 +291,5 @@ with Session(engine) as session: ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integrate-with-langchain.md b/vector-search-integrate-with-langchain.md similarity index 80% rename from tidb-cloud/vector-search-integrate-with-langchain.md rename to vector-search-integrate-with-langchain.md index 22c3f17283f5b..f710e80ec7d6e 100644 --- a/tidb-cloud/vector-search-integrate-with-langchain.md +++ b/vector-search-integrate-with-langchain.md @@ -5,12 +5,23 @@ summary: Learn how to integrate Vector Search in TiDB Cloud with LangChain. # Integrate Vector Search with LangChain -This tutorial demonstrates how to integrate the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud with [LangChain](https://python.langchain.com/). +This tutorial demonstrates how to integrate the [vector search](/vector-search-overview.md) feature in TiDB Cloud with [LangChain](https://python.langchain.com/). -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> - TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. -> - You can view the complete [sample code](https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) online environment. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +> **Tip** +> +> You can view the complete [sample code](https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) online environment. ## Prerequisites @@ -19,7 +30,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Jupyter Notebook](https://jupyter.org/install) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Get started @@ -44,7 +72,7 @@ In your project directory, run the following command to install the required pac !pip install tidb-vector ``` -Open the `integrate_with_langchain.ipynb` file in Jupyter Notebook and add the following code to import the required packages: +Open the `integrate_with_langchain.ipynb` file in Jupyter Notebook, and then add the following code to import the required packages: ```python from langchain_community.document_loaders import TextLoader @@ -55,7 +83,12 @@ from langchain_text_splitters import CharacterTextSplitter ### Step 3. Set up your environment -#### Step 3.1 Obtain the connection string to the TiDB cluster +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. @@ -74,11 +107,27 @@ from langchain_text_splitters import CharacterTextSplitter > > If you have not set a password yet, click **Generate Password** to generate a random password. -#### Step 3.2 Configure environment variables +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os -To establish a secure and efficient database connection, use the standard connection method provided by TiDB Cloud. + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` -This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from step 3.1 and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). +
+
+ +This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: @@ -87,12 +136,32 @@ To configure the environment variables, run the following code. You will be prom import getpass import os -# Copy your connection string from the TiDB Cloud console. # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" tidb_connection_string = getpass.getpass("TiDB Connection String:") os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") ``` +Taking macOS as an example, the cluster connection string is as follows: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +You need to modify the values of the connection parameters according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ ### Step 4. Load the sample document #### Step 4.1 Download the sample document @@ -362,7 +431,7 @@ The following metadata filters can match this document: } ``` -Each key-value pair in the metadata filters is treated as a separate filter clause, and these clauses are combined using the `AND` logical operator. +In a metadata filter, TiDB treats each key-value pair as a separate filter clause and combines these clauses using the `AND` logical operator. ### Example @@ -412,19 +481,19 @@ TiDB Vector offers advanced, high-speed vector processing capabilities, enhancin ## Advanced usage example: travel agent -This section demonstrates an advanced use case of integrating vector search with Langchain for a travel agent. The goal is to create personalized travel reports for clients seeking airports with specific amenities, such as clean lounges and vegetarian options. +This section demonstrates a use case of integrating vector search with Langchain for a travel agent. The goal is to create personalized travel reports for clients, helping them find airports with specific amenities, such as clean lounges and vegetarian options. The process involves two main steps: 1. Perform a semantic search across airport reviews to identify airport codes that match the desired amenities. -2. Execute an SQL query to merge these codes with route information, highlighting airlines and destinations that align with user's preferences. +2. Execute a SQL query to merge these codes with route information, highlighting airlines and destinations that align with user's preferences. ### Prepare data First, create a table to store airport route data: ```python -# Create table to store airplan data. +# Create a table to store flight plan data. vector_store.tidb_vector_client.execute( """CREATE TABLE airplan_routes ( id INT AUTO_INCREMENT PRIMARY KEY, @@ -565,7 +634,7 @@ The expected output is as follows: (0.19840519342700513, 3, 'EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', datetime.timedelta(seconds=9000), 7, 'Boeing 737', Decimal('129.99'), 'None', 'Small airport with basic facilities.')] ``` -### Clean up +### Clean up data Finally, clean up the resources by dropping the created table: @@ -581,5 +650,5 @@ The expected output is as follows: ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integrate-with-llamaindex.md b/vector-search-integrate-with-llamaindex.md similarity index 59% rename from tidb-cloud/vector-search-integrate-with-llamaindex.md rename to vector-search-integrate-with-llamaindex.md index 5e670df04d6eb..1aaff4c6a4f2d 100644 --- a/tidb-cloud/vector-search-integrate-with-llamaindex.md +++ b/vector-search-integrate-with-llamaindex.md @@ -1,16 +1,27 @@ --- title: Integrate Vector Search with LlamaIndex -summary: Learn how to integrate Vector Search in TiDB Cloud with LlamaIndex. +summary: Learn how to integrate TiDB Vector Search with LlamaIndex. --- # Integrate Vector Search with LlamaIndex -This tutorial demonstrates how to integrate the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud with [LlamaIndex](https://www.llamaindex.ai). +This tutorial demonstrates how to integrate the [vector search](/vector-search-overview.md) feature of TiDB with [LlamaIndex](https://www.llamaindex.ai). -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +> **Tip** > -> - TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. -> - You can view the complete [sample code](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) online environment. +> You can view the complete [sample code](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) online environment. ## Prerequisites @@ -19,7 +30,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Jupyter Notebook](https://jupyter.org/install) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Get started @@ -27,7 +55,7 @@ This section provides step-by-step instructions for integrating TiDB Vector Sear ### Step 1. Create a new Jupyter Notebook file -In your preferred directory, create a new Jupyter Notebook file named `integrate_with_llamaindex.ipynb`: +In the root directory, create a new Jupyter Notebook file named `integrate_with_llamaindex.ipynb`: ```shell touch integrate_with_llamaindex.ipynb @@ -52,9 +80,14 @@ from llama_index.core import VectorStoreIndex from llama_index.vector_stores.tidbvector import TiDBVectorStore ``` -### Step 3. Set up your environment +### Step 3. Configure environment variables + +Configure the environment variables depending on the TiDB deployment option you've selected. -#### Step 3.1 Obtain the connection string to the TiDB cluster + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. @@ -73,24 +106,61 @@ from llama_index.vector_stores.tidbvector import TiDBVectorStore > > If you have not set a password yet, click **Generate Password** to generate a random password. -#### Step 3.2 Configure environment variables +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os -To establish a secure and efficient database connection, use the standard connection method provided by TiDB Cloud. + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` -This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from step 3.1 and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). +
+
+ +This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string of your TiDB cluster and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: ```python +# Use getpass to securely prompt for environment variables in your terminal. import getpass import os -tidb_connection_url = getpass.getpass( - "TiDB connection URL (format - mysql+pymysql://root@127.0.0.1:4000/test): " -) +# Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" +tidb_connection_string = getpass.getpass("TiDB Connection String:") os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") ``` +Taking macOS as an example, the cluster connection string is as follows: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +You need to modify the parameters in the connection string according to your TiDB cluster. If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ ### Step 4. Load the sample document #### Step 4.1 Download the sample document @@ -260,5 +330,5 @@ Empty Response ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integrate-with-peewee.md b/vector-search-integrate-with-peewee.md similarity index 64% rename from tidb-cloud/vector-search-integrate-with-peewee.md rename to vector-search-integrate-with-peewee.md index 9f418c539c837..8842ca2e68269 100644 --- a/tidb-cloud/vector-search-integrate-with-peewee.md +++ b/vector-search-integrate-with-peewee.md @@ -5,11 +5,19 @@ summary: Learn how to integrate TiDB Vector Search with peewee to store embeddin # Integrate TiDB Vector Search with peewee -This tutorial walks you through how to use [peewee](https://docs.peewee-orm.com/) to interact with the [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), store embeddings, and perform vector search queries. +This tutorial walks you through how to use [peewee](https://docs.peewee-orm.com/) to interact with the [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. -> **Note** + + +> **Warning:** > -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Prerequisites @@ -17,7 +25,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Run the sample app @@ -49,7 +74,7 @@ Install the required dependencies for the demo project: pip install -r requirements.txt ``` -For your existing project, you can install the following packages: +Alternatively, you can install the following packages for your project: ```bash pip install peewee pymysql python-dotenv tidb-vector @@ -57,6 +82,13 @@ pip install peewee pymysql python-dotenv tidb-vector ### Step 4. Configure the environment variables +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. 2. Click **Connect** in the upper-right corner. A connection dialog is displayed. @@ -98,6 +130,33 @@ pip install peewee pymysql python-dotenv tidb-vector TIDB_CA_PATH=/etc/ssl/cert.pem ``` +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_HOST=127.0.0.1 +TIDB_PORT=4000 +TIDB_USERNAME=root +TIDB_PASSWORD= +TIDB_DATABASE=test +``` + +If you are running TiDB on your local machine, `TIDB_HOST` is `127.0.0.1` by default. The initial `TIDB_PASSWORD` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- `TIDB_HOST`: The host of the TiDB cluster. +- `TIDB_PORT`: The port of the TiDB cluster. +- `TIDB_USERNAME`: The username to connect to the TiDB cluster. +- `TIDB_PASSWORD`: The password to connect to the TiDB cluster. +- `TIDB_DATABASE`: The name of the database you want to connect to. + +
+ +
+ ### Step 5. Run the demo ```bash @@ -178,22 +237,6 @@ class Document(Model): embedding = VectorField(3) ``` -#### Define a vector column optimized with index - -Define a 3-dimensional vector column and optimize it with a [vector search index](/tidb-cloud/vector-search-index.md) (HNSW index). - -```python -class DocumentWithIndex(Model): - class Meta: - database = db - table_name = 'peewee_demo_documents_with_index' - - content = TextField() - embedding = VectorField(3, constraints=[SQL("COMMENT 'hnsw(distance=cosine)'")]) -``` - -TiDB will use this index to accelerate vector search queries based on the cosine distance function. - ### Store documents with embeddings ```python @@ -223,5 +266,5 @@ results = Document.select(Document, distance).where(distance_expression < 0.2).o ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integrate-with-sqlalchemy.md b/vector-search-integrate-with-sqlalchemy.md similarity index 60% rename from tidb-cloud/vector-search-integrate-with-sqlalchemy.md rename to vector-search-integrate-with-sqlalchemy.md index 5e59cfdcf5895..93965e454c6d7 100644 --- a/tidb-cloud/vector-search-integrate-with-sqlalchemy.md +++ b/vector-search-integrate-with-sqlalchemy.md @@ -5,11 +5,19 @@ summary: Learn how to integrate TiDB Vector Search with SQLAlchemy to store embe # Integrate TiDB Vector Search with SQLAlchemy -This tutorial walks you through how to use [SQLAlchemy](https://www.sqlalchemy.org/) to interact with [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), store embeddings, and perform vector search queries. +This tutorial walks you through how to use [SQLAlchemy](https://www.sqlalchemy.org/) to interact with [TiDB Vector Search](/vector-search-overview.md), store embeddings, and perform vector search queries. -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> TiDB Vector Search is currently in beta and only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## Prerequisites @@ -17,7 +25,24 @@ To complete this tutorial, you need: - [Python 3.8 or higher](https://www.python.org/downloads/) installed. - [Git](https://git-scm.com/downloads) installed. -- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. +- A TiDB cluster. + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- Follow [Deploy a local test TiDB cluster](/quick-start-with-tidb.md#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](/production-deployment-using-tiup.md) to create a local cluster. +- Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. + + + + +**If you don't have a TiDB cluster, you can create one as follows:** + +- (Recommended) Follow [Creating a TiDB Cloud Serverless cluster](/develop/dev-guide-build-cluster-in-cloud.md) to create your own TiDB Cloud cluster. +- Follow [Deploy a local test TiDB cluster](https://docs.pingcap.com/tidb/stable/quick-start-with-tidb#deploy-a-local-test-cluster) or [Deploy a production TiDB cluster](https://docs.pingcap.com/tidb/stable/production-deployment-using-tiup) to create a local cluster of v8.4.0 or a later version. + + ## Run the sample app @@ -49,7 +74,7 @@ Install the required dependencies for the demo project: pip install -r requirements.txt ``` -For your existing project, you can install the following packages: +Alternatively, you can install the following packages for your project: ```bash pip install pymysql python-dotenv sqlalchemy tidb-vector @@ -57,6 +82,13 @@ pip install pymysql python-dotenv sqlalchemy tidb-vector ### Step 4. Configure the environment variables +Configure the environment variables depending on the TiDB deployment option you've selected. + + +
+ +For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables: + 1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. 2. Click **Connect** in the upper-right corner. A connection dialog is displayed. @@ -86,6 +118,30 @@ pip install pymysql python-dotenv sqlalchemy tidb-vector TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" ``` +
+
+ +For a TiDB Self-Managed cluster, create a `.env` file in the root directory of your Python project. Copy the following content into the `.env` file, and modify the environment variable values according to the connection parameters of your TiDB cluster: + +```dotenv +TIDB_DATABASE_URL="mysql+pymysql://:@:/" +# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test" +``` + +If you are running TiDB on your local machine, `` is `127.0.0.1` by default. The initial `` is empty, so if you are starting the cluster for the first time, you can omit this field. + +The following are descriptions for each parameter: + +- ``: The username to connect to the TiDB cluster. +- ``: The password to connect to the TiDB cluster. +- ``: The host of the TiDB cluster. +- ``: The port of the TiDB cluster. +- ``: The name of the database you want to connect to. + +
+ +
+ ### Step 5. Run the demo ```bash @@ -145,20 +201,6 @@ class Document(Base): embedding = Column(VectorType(3)) ``` -#### Define a vector column optimized with index - -Define a 3-dimensional vector column and optimize it with a [vector search index](/tidb-cloud/vector-search-index.md) (HNSW index). - -```python -class DocumentWithIndex(Base): - __tablename__ = 'sqlalchemy_demo_documents_with_index' - id = Column(Integer, primary_key=True) - content = Column(Text) - embedding = Column(VectorType(3), comment="hnsw(distance=cosine)") -``` - -TiDB will use this index to accelerate vector search queries based on the cosine distance function. - ### Store documents with embeddings ```python @@ -195,5 +237,5 @@ with Session(engine) as session: ## See also -- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) -- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Vector Data Types](/vector-search-data-types.md) +- [Vector Search Index](/vector-search-index.md) diff --git a/tidb-cloud/vector-search-integration-overview.md b/vector-search-integration-overview.md similarity index 62% rename from tidb-cloud/vector-search-integration-overview.md rename to vector-search-integration-overview.md index d55a1c4aab837..d0f4e51c9cff1 100644 --- a/tidb-cloud/vector-search-integration-overview.md +++ b/vector-search-integration-overview.md @@ -7,9 +7,17 @@ summary: An overview of TiDB vector search integration, including supported AI f This document provides an overview of TiDB vector search integration, including supported AI frameworks, embedding models, and Object Relational Mapping (ORM) libraries. -> **Note** + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** > -> TiDB Vector Search is currently in beta and is only available for [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless) clusters. +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. ## AI frameworks @@ -17,14 +25,14 @@ TiDB provides official support for the following AI frameworks, enabling you to | AI frameworks | Tutorial | |---------------|---------------------------------------------------------------------------------------------------| -| Langchain | [Integrate Vector Search with LangChain](/tidb-cloud/vector-search-integrate-with-langchain.md) | -| LlamaIndex | [Integrate Vector Search with LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md) | +| Langchain | [Integrate Vector Search with LangChain](/vector-search-integrate-with-langchain.md) | +| LlamaIndex | [Integrate Vector Search with LlamaIndex](/vector-search-integrate-with-llamaindex.md) | Moreover, you can also use TiDB for various purposes, such as document storage and knowledge graph storage for AI applications. ## Embedding models and services -TiDB Vector Search supports storing vectors of up to 16,000 dimensions, which accommodates most embedding models. +TiDB Vector Search supports storing vectors of up to 16383 dimensions, which accommodates most embedding models. You can either use self-deployed open-source embedding models or third-party embedding APIs provided by third-party embedding providers to generate vectors. @@ -32,7 +40,7 @@ The following table lists some mainstream embedding service providers and the co | Embedding service providers | Tutorial | |-----------------------------|---------------------------------------------------------------------------------------------------------------------| -| Jina AI | [Integrate Vector Search with Jina AI Embeddings API](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md) | +| Jina AI | [Integrate Vector Search with Jina AI Embeddings API](/vector-search-integrate-with-jinaai-embedding.md) | ## Object Relational Mapping (ORM) libraries @@ -51,21 +59,21 @@ The following table lists the supported ORM libraries and the corresponding inte Python TiDB Vector Client pip install tidb-vector[client] - Get Started with Vector Search Using Python + Get Started with Vector Search Using Python SQLAlchemy pip install tidb-vector - Integrate TiDB Vector Search with SQLAlchemy + Integrate TiDB Vector Search with SQLAlchemy peewee pip install tidb-vector - Integrate TiDB Vector Search with peewee + Integrate TiDB Vector Search with peewee Django pip install django-tidb[vector] - Integrate TiDB Vector Search with Django + Integrate TiDB Vector Search with Django diff --git a/vector-search-limitations.md b/vector-search-limitations.md new file mode 100644 index 0000000000000..063ddcdd4186c --- /dev/null +++ b/vector-search-limitations.md @@ -0,0 +1,67 @@ +--- +title: Vector Search Limitations +summary: Learn the limitations of the TiDB vector search. +--- + +# Vector Search Limitations + +This document describes the known limitations of TiDB vector search. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Vector data type limitations + +- Each [vector](/vector-search-data-types.md) supports up to 16383 dimensions. +- Vector data types cannot store `NaN`, `Infinity`, or `-Infinity` values. +- Vector data types cannot store double-precision floating-point numbers. If you insert or store double-precision floating-point numbers in vector columns, TiDB converts them to single-precision floating-point numbers. +- Vector columns cannot be used as primary keys or as part of a primary key. +- Vector columns cannot be used as unique indexes or as part of a unique index. +- Vector columns cannot be used as partition keys or as part of a partition key. +- Currently, TiDB does not support modifying a vector column to other data types (such as `JSON` and `VARCHAR`). + +## Vector index limitations + +See [Vector search restrictions](/vector-search-index.md#restrictions). + +## Compatibility with TiDB tools + + + +- Make sure that you are using v8.4.0 or a later version of BR to back up and restore data. Restoring tables with vector data types to TiDB clusters earlier than v8.4.0 is not supported. +- TiDB Data Migration (DM) does not support migrating or replicating MySQL 9.0 vector data types to TiDB. +- When TiCDC replicates vector data to a downstream that does not support vector data types, it will change the vector data types to another type. For more information, see [Compatibility with vector data types](/ticdc/ticdc-compatibility.md#compatibility-with-vector-data-types). + + + + + +- The Data Migration feature in the TiDB Cloud console does not support migrating or replicating MySQL 9.0 vector data types to TiDB Cloud. + + + +## Feedback + +We value your feedback and are always here to help: + + + +- [Join our Discord](https://discord.gg/zcqexutz2R) + + + + + +- [Join our Discord](https://discord.gg/zcqexutz2R) +- [Visit our Support Portal](https://tidb.support.pingcap.com/) + + \ No newline at end of file diff --git a/vector-search-overview.md b/vector-search-overview.md new file mode 100644 index 0000000000000..9d149fbb159ff --- /dev/null +++ b/vector-search-overview.md @@ -0,0 +1,88 @@ +--- +title: Vector Search Overview +summary: Learn about Vector Search in TiDB. This feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. +--- + +# Vector Search Overview + +TiDB Vector Search provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. + + + +> **Warning:** +> +> The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + + + +> **Warning:** +> +> The vector search feature is in beta. It might be changed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub. + + + +> **Note:** +> +> The vector search feature is only available for TiDB Self-Managed clusters and [TiDB Cloud Serverless](https://docs.pingcap.com/tidbcloud/select-cluster-tier#tidb-cloud-serverless) clusters. + +## Concepts + +Vector search is a search method that prioritizes the meaning of your data to deliver relevant results. + +Unlike traditional full-text search, which relies on exact keyword matching and word frequency, vector search converts various data types (such as text, images, or audio) into high-dimensional vectors and queries based on the similarity between these vectors. This search method captures the semantic meaning and contextual information of the data, leading to a more precise understanding of user intent. + +Even when the search terms do not exactly match the content in the database, vector search can still provide results that align with the user's intent by analyzing the semantics of the data. + +For example, a full-text search for "a swimming animal" only returns results containing these exact keywords. In contrast, vector search can return results for other swimming animals, such as fish or ducks, even if these results do not contain the exact keywords. + +### Vector embedding + +A vector embedding, also known as an embedding, is a sequence of numbers that represents real-world objects in a high-dimensional space. It captures the meaning and context of unstructured data, such as documents, images, audio, and videos. + +Vector embeddings are essential in machine learning and serve as the foundation for semantic similarity searches. + +TiDB introduces [Vector data types](/vector-search-data-types.md) and [Vector search index](/vector-search-index.md) designed to optimize the storage and retrieval of vector embeddings, enhancing their use in AI applications. You can store vector embeddings in TiDB and perform vector search queries to find the most relevant data using these data types. + +### Embedding model + +Embedding models are algorithms that transform data into [vector embeddings](#vector-embedding). + +Choosing an appropriate embedding model is crucial for ensuring the accuracy and relevance of semantic search results. For unstructured text data, you can find top-performing text embedding models on the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). + +To learn how to generate vector embeddings for your specific data types, refer to integration tutorials or examples of embedding models. + +## How vector search works + +After converting raw data into vector embeddings and storing them in TiDB, your application can execute vector search queries to find the data most semantically or contextually relevant to a user's query. + +TiDB vector search identifies the top-k nearest neighbor (KNN) vectors by using a [distance function](/vector-search-functions-and-operators.md) to calculate the distance between the given vector and vectors stored in the database. The vectors closest to the given vector in the query represent the most similar data in meaning. + +![The Schematic TiDB Vector Search](/media/vector-search/embedding-search.png) + +As a relational database with integrated vector search capabilities, TiDB enables you to store data and their corresponding vector representations (that is, vector embeddings) together in one database. You can choose any of the following ways for storage: + +- Store data and their corresponding vector representations in different columns of the same table. +- Store data and their corresponding vector representation in different tables. In this way, you need to use `JOIN` queries to combine the tables when retrieving data. + +## Use cases + +### Retrieval-Augmented Generation (RAG) + +Retrieval-Augmented Generation (RAG) is an architecture designed to optimize the output of Large Language Models (LLMs). By using vector search, RAG applications can store vector embeddings in the database and retrieve relevant documents as additional context when the LLM generates responses, thereby improving the quality and relevance of the answers. + +### Semantic search + +Semantic search is a search technology that returns results based on the meaning of a query, rather than simply matching keywords. It interprets the meaning across different languages and various types of data (such as text, images, and audio) using embeddings. Vector search algorithms then use these embeddings to find the most relevant data that satisfies the user's query. + +### Recommendation engine + +A recommendation engine is a system that proactively suggests content, products, or services that are relevant and personalized to users. It accomplishes this by creating embeddings that represent user behavior and preferences. These embeddings help the system identify similar items that other users have interacted with or shown interest in. This increases the likelihood that the recommendations will be both relevant and appealing to the user. + +## See also + +To get started with TiDB Vector Search, see the following documents: + +- [Get started with vector search using Python](/vector-search-get-started-using-python.md) +- [Get started with vector search using SQL](/vector-search-get-started-using-sql.md)