From f01be889d635a6beefe9c3e425626b9cd5cd997c Mon Sep 17 00:00:00 2001 From: Premkumar Date: Wed, 30 Oct 2024 11:16:08 -0700 Subject: [PATCH 01/15] Adding CBO --- .../statements/cmd_analyze.md | 4 +- .../preview/architecture/docdb/lsm-sst.md | 28 +++++- .../architecture/query-layer/_index.md | 21 +---- .../query-layer/join-strategies.md | 3 +- .../query-layer/planner-optimizer.md | 91 +++++++++++++++++++ .../develop/postgresql-compatibility.md | 8 +- 6 files changed, 130 insertions(+), 25 deletions(-) create mode 100644 docs/content/preview/architecture/query-layer/planner-optimizer.md diff --git a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md index f643324da053..88a10cf695cf 100644 --- a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -14,9 +14,9 @@ type: docs ## Synopsis -ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog. These statistics help the query planner to determine the most efficient execution plans for queries. +ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. diff --git a/docs/content/preview/architecture/docdb/lsm-sst.md b/docs/content/preview/architecture/docdb/lsm-sst.md index c0af4ca56d7f..62aae45e1e6f 100644 --- a/docs/content/preview/architecture/docdb/lsm-sst.md +++ b/docs/content/preview/architecture/docdb/lsm-sst.md @@ -12,11 +12,11 @@ menu: type: docs --- -A log-structured merge-tree (LSM tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. +A [log-structured merge-tree (LSM tree)](https://en.wikipedia.org/wiki/Log-structured_merge-tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. The core idea behind an LSM tree is to separate the write and read paths, allowing writes to be sequential and buffered in memory making them faster than random writes, while reads can still access data efficiently through a hierarchical structure of sorted files on disk. -An LSM tree has 2 primary components - Memtable and SSTs. Let's look into each of them in detail and understand how they work during writes and reads. +An LSM tree has 2 primary components - [Memtable](#memtable) and [SSTs](#sst). Let's look into each of them in detail and understand how they work during writes and reads. {{}} Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses the Raft logs for this purpose. For more details, see [Raft log vs LSM WAL](../performance/#raft-vs-rocksdb-wal-logs). @@ -33,7 +33,9 @@ Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tr All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a Memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the Memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that Memtable. -The immutable Memtable is then flushed to disk as an SST (Sorted String Table) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk. +## Flush to SST + +The immutable [Memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk. ## SST @@ -45,6 +47,20 @@ Each SST file contains a bloom filter, which is a space-efficient data structure Most LSMs organize SSTS into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). {{}} +There are 3 core low-level operations that are used to iterate through the data in SST files. + +### Seek + +The **seek** operation is used to locate a specific key or position in an [SST](#sst) file or [Memtable](#memtable). When a seek operation is performed, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek will position the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching. + +### Next + +The **next** operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a [seek](#seek), a sequence of `next` operations can scan through keys in ascending order. + +### Previous + +The **previous** operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after [seeking](#seek) to a key near the end of a range, `previous` can be used to iterate through keys in descending order, often needed in order-by-descending queries. + ## Write path When new data is written to the LSM system, it is first inserted into the active Memtable. As the Memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. @@ -53,10 +69,16 @@ When new data is written to the LSM system, it is first inserted into the active To read a key, the LSM tree first checks the Memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. +## Delete path + +Rather than immediately removing the key from SSTs, the delete operations marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction) when tombstones are removed along with the data they mark as deleted. + ## Compaction As data accumulates in SSTs, a process called compaction merges and sorts the SST files with overlapping key ranges producing a new set of SST files. The merge process during compaction helps to organize and sort the data, maintaining a consistent on-disk format and reclaiming space from obsolete data versions. +The [YB-TServer](../../yb-tserver/) manages multiple [compaction queues](../../yb-tserver/#compaction-queues) and enforces [throttling](../../yb-tserver/#throttled-compactions) to avoid compaction storms. Although full compactions can be [scheduled](../../yb-tserver/#scheduled-full-compactions), they can also be triggered [manually](../../yb-tserver/#manual-compactions). Full compactions are also triggered automatically if the system detects [tombstones and obsolete keys affecting read performance](../../yb-tserver/#statistics-based-full-compactions-to-improve-read-performance). + ## Learn more - [Blog: Background Compactions in YugabyteDB](https://www.yugabyte.com/blog/background-data-compaction/#what-is-a-data-compaction) diff --git a/docs/content/preview/architecture/query-layer/_index.md b/docs/content/preview/architecture/query-layer/_index.md index 686e09b61d8b..6341c09b277f 100644 --- a/docs/content/preview/architecture/query-layer/_index.md +++ b/docs/content/preview/architecture/query-layer/_index.md @@ -58,24 +58,13 @@ Views are realized during this phase. Whenever a query against a view (that is, ### Planner -YugabyteDB needs to determine the most efficient way to execute a query and return the results. This process is handled by the query planner/optimizer component. +The query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. -The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. +After the optimal plan is determined, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. -If the query involves joining multiple tables, the planner evaluates different techniques to combine the data: - -- Nested loop join: Scanning one table for each row in the other table. This can be efficient if one table is small or has a good index. -- Merge join: Sorting both tables by the join columns and then merging them in parallel. This works well when the tables are already sorted or can be efficiently sorted. -- Hash join: Building a hash table from one table and then scanning the other table to find matches in the hash table. -For queries involving more than two tables, the planner considers different sequences of joining the tables to find the most efficient approach. - -The planner estimates the cost of each possible execution plan and chooses the one expected to be the fastest, taking into account factors like table sizes, indexes, sorting requirements, and so on. - -After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. - -{{}} -The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. -{{}} +{{}} +To know how exactly the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) +{{}} ### Executor diff --git a/docs/content/preview/architecture/query-layer/join-strategies.md b/docs/content/preview/architecture/query-layer/join-strategies.md index ec8b7461ba65..c3bf97964bcf 100644 --- a/docs/content/preview/architecture/query-layer/join-strategies.md +++ b/docs/content/preview/architecture/query-layer/join-strategies.md @@ -8,10 +8,9 @@ aliases: - /preview/explore/ysql-language-features/join-strategies/ menu: preview: - name: Join strategies identifier: joins-strategies-ysql parent: architecture-query-layer - weight: 100 + weight: 200 type: docs --- diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md new file mode 100644 index 000000000000..cc5030594d5a --- /dev/null +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -0,0 +1,91 @@ +--- +title: Query Planner +headerTitle: Query Planner / CBO +linkTitle: Query Planner +description: Understand the various methodologies used for joining multiple tables +headcontent: Understand how the planner choses the optimal path for query execution +menu: + preview: + identifier: query-planner + parent: architecture-query-layer + weight: 100 +type: docs +--- + +The query planner is responsible for determining the most efficient way to execute a given SQL query. It generates various plans of exection and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. The primary component that calculates these values is the Cost Based optimizer (CBO). + +{{}} +The Cost-based optimizer is a [YSQL](../../../api/ysql/) only feature. +{{}} + +{{}} +The CBO is disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) flag as: + +```sql +-- Enable for current session +SET yb_enable_base_scans_cost_model = TRUE; + +-- Enable for all new sessions of a user +ALTER USER user SET yb_enable_base_scans_cost_model = TRUE; + +-- Enable for all new sessions on a database +ALTER DATABASE database SET yb_enable_base_scans_cost_model = TRUE; +``` + +{{}} + +Let us understand how this works. + +## Plan search algorithm + +To optimize the search for the best plan, CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each piece of the query. The sub-plans are then combined to find the best overall plan. + +## Statistics gathering + +The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data within columns, and the cardinality of results from operations. These statistics are essential in estimating the costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. + +{{}} +Currently the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command has to be triggered manually. Multiple projects are in progress to trigger this automatically. +{{}} + +## Cost estimation + +For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer pragmatically compare which plan would likely be the most efficient to execute given the current database state and query context. Some of the factors included in the cost estimation are: + +{{}} +These estimates can be seen when using the `DEBUG` option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as `EXPLAIN (ANALYZE, DEBUG)`. +{{}} + +### Cost of data fetch + +To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed within the LSM subsystem, are taken into account. + +### Index scan + +As the primary key is part of the base table and that each [SST](../../docdb/lsm-sst) of the base table is sorted in the order of the primary key the primary index lookup cheaper compared to secondary index lookup. Depending on the type of query this distintion is conidered. + +### Pushdown to storage layer + +CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters and distinct clauses. This can considerably reduce the data transer over network. + +### Join strategies + +For queries involving multiple tables the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join) or [Hash](../join-strategies/#hash-join) join and various join orders are evaluated. + +### Data transfer costs + +The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Since each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. _Note_ that the time spent transferring the data will also depend on the network bandwidth. + +## Plan selection + +The CBO evaluates each candidate plan's estimated costs to determine the plan with the lowest cost, which is then selected for execution. This ensures the optimal usage of system resources and improved query performance. + +After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. + +## Plan caching + +The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. + +## Learn more + +- [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) \ No newline at end of file diff --git a/docs/content/preview/develop/postgresql-compatibility.md b/docs/content/preview/develop/postgresql-compatibility.md index 545f8671be28..50dba2804747 100644 --- a/docs/content/preview/develop/postgresql-compatibility.md +++ b/docs/content/preview/develop/postgresql-compatibility.md @@ -63,7 +63,7 @@ To learn about read committed isolation, see [Read Committed](../../architecture Configuration parameter: `yb_enable_base_scans_cost_model=true` -Cost-based optimizer (CBO) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. +[Cost-based optimizer (CBO)](../../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. {{}} When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. @@ -72,7 +72,11 @@ When enabling the cost models, ensure that packed row for colocated tables is en {{}} -### Wait-on-conflict concurrency +{{}} +To learn about how the Cost-based optimizer works, see [Query Planner / CBO](../../../architecture/query-layer/planner-optimizer/) +{{}} + +#### Wait-on-conflict concurrency Flag: `enable_wait_queues=true` From a61a0684090648705922417997695502f1ba0856 Mon Sep 17 00:00:00 2001 From: Dwight Hodge Date: Wed, 30 Oct 2024 17:19:59 -0400 Subject: [PATCH 02/15] tidyups --- .../preview/architecture/docdb/lsm-sst.md | 34 +++++++++-------- .../architecture/query-layer/_index.md | 8 ++-- .../query-layer/planner-optimizer.md | 37 ++++++++++--------- 3 files changed, 42 insertions(+), 37 deletions(-) diff --git a/docs/content/preview/architecture/docdb/lsm-sst.md b/docs/content/preview/architecture/docdb/lsm-sst.md index 62aae45e1e6f..5d0e0ee10690 100644 --- a/docs/content/preview/architecture/docdb/lsm-sst.md +++ b/docs/content/preview/architecture/docdb/lsm-sst.md @@ -16,7 +16,7 @@ A [log-structured merge-tree (LSM tree)](https://en.wikipedia.org/wiki/Log-struc The core idea behind an LSM tree is to separate the write and read paths, allowing writes to be sequential and buffered in memory making them faster than random writes, while reads can still access data efficiently through a hierarchical structure of sorted files on disk. -An LSM tree has 2 primary components - [Memtable](#memtable) and [SSTs](#sst). Let's look into each of them in detail and understand how they work during writes and reads. +An LSM tree has 2 primary components - [Memtable](#memtable) and [Sorted String Tables (SSTs)](#sst). Let's look into each of them in detail and understand how they work during writes and reads. {{}} Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses the Raft logs for this purpose. For more details, see [Raft log vs LSM WAL](../performance/#raft-vs-rocksdb-wal-logs). @@ -24,60 +24,64 @@ Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses ## Comparison to B-tree -Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree) based storage system. But YugabyteDB had to chose an LSM based storage to build a highly scalable database for of the following reasons. +Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree)-based storage system. But Yugabyte chose LSM-based storage to build a highly scalable database for the following reasons: -- Write operations (insert, update, delete) are more expensive in a B-tree. As it involves random writes and in place node splitting and rebalancing. In an LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. +- Write operations (insert, update, delete) are more expensive in a B-tree, requiring random writes and in-place node splitting and rebalancing. In LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. - The append-only nature of LSM makes it more efficient for concurrent write operations. ## Memtable -All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a Memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the Memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that Memtable. +All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that memtable. ## Flush to SST -The immutable [Memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk. +The immutable [memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the memtable to disk in a sorted order, creating an SST file. DocDB maintains one active memtable, and at most one immutable memtable at any point in time. This ensures that write operations can continue to be processed in the active memtable while the immutable memtable is being flushed to disk. ## SST -Each SST (Sorted String Table) file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. +Each SST file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. Each SST file contains a bloom filter, which is a space-efficient data structure that helps quickly determine whether a key might exist in that file or not, avoiding unnecessary disk reads. {{}} -Most LSMs organize SSTS into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). +Most LSMs organize SSTs into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). {{}} -There are 3 core low-level operations that are used to iterate through the data in SST files. +Three core low-level operations are used to iterate through the data in SST files. ### Seek -The **seek** operation is used to locate a specific key or position in an [SST](#sst) file or [Memtable](#memtable). When a seek operation is performed, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek will position the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching. +The _seek_ operation is used to locate a specific key or position in an SST file or memtable. When performing a seek, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek positions the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching. ### Next -The **next** operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a [seek](#seek), a sequence of `next` operations can scan through keys in ascending order. +The _next_ operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a seek, a sequence of next operations can scan through keys in ascending order. ### Previous -The **previous** operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after [seeking](#seek) to a key near the end of a range, `previous` can be used to iterate through keys in descending order, often needed in order-by-descending queries. +The _previous_ operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after seeking to a key near the end of a range, previous can be used to iterate through keys in descending order, often needed in order-by-descending queries. ## Write path -When new data is written to the LSM system, it is first inserted into the active Memtable. As the Memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. +When new data is written to the LSM system, it is first inserted into the active memtable. As the memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. ## Read Path -To read a key, the LSM tree first checks the Memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. +To read a key, the LSM tree first checks the memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. ## Delete path -Rather than immediately removing the key from SSTs, the delete operations marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction) when tombstones are removed along with the data they mark as deleted. +Rather than immediately removing the key from SSTs, the delete operation marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction), when tombstones are removed along with the data they mark as deleted. ## Compaction As data accumulates in SSTs, a process called compaction merges and sorts the SST files with overlapping key ranges producing a new set of SST files. The merge process during compaction helps to organize and sort the data, maintaining a consistent on-disk format and reclaiming space from obsolete data versions. -The [YB-TServer](../../yb-tserver/) manages multiple [compaction queues](../../yb-tserver/#compaction-queues) and enforces [throttling](../../yb-tserver/#throttled-compactions) to avoid compaction storms. Although full compactions can be [scheduled](../../yb-tserver/#scheduled-full-compactions), they can also be triggered [manually](../../yb-tserver/#manual-compactions). Full compactions are also triggered automatically if the system detects [tombstones and obsolete keys affecting read performance](../../yb-tserver/#statistics-based-full-compactions-to-improve-read-performance). +The [YB-TServer](../../yb-tserver/) manages multiple compaction queues and enforces throttling to avoid compaction storms. Although full compactions can be scheduled, they can also be triggered manually. Full compactions are also triggered automatically if the system detects tombstones and obsolete keys affecting read performance. + +{{}} +To learn more about YB-TServer compaction operations, refer to [YB-TServer](../../yb-tserver/) +{{}} ## Learn more diff --git a/docs/content/preview/architecture/query-layer/_index.md b/docs/content/preview/architecture/query-layer/_index.md index 6341c09b277f..1d5e98839371 100644 --- a/docs/content/preview/architecture/query-layer/_index.md +++ b/docs/content/preview/architecture/query-layer/_index.md @@ -58,21 +58,21 @@ Views are realized during this phase. Whenever a query against a view (that is, ### Planner -The query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. +The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. After the optimal plan is determined, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. {{}} -To know how exactly the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) +To learn how the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) {{}} ### Executor -After the query planner determines the optimal execution plan, the query executor component runs the plan and retrieves the required data. The executor sends appropriate requests to the other YB-TServers that hold the needed data to performs sorts, joins, aggregations, and then evaluates qualifications and finally returns the derived rows. +After the query planner determines the optimal execution plan, the executor runs the plan and retrieves the required data. The executor sends requests to the other YB-TServers that hold the data needed to perform sorts, joins, and aggregations, then evaluates qualifications, and finally returns the derived rows. The executor works in a step-by-step fashion, recursively processing the plan from top to bottom. Each node in the plan tree is responsible for fetching or computing rows of data as requested by its parent node. -For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to get rows from them. +For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to retrieve rows. A child node may be a "Sort" node, which requests rows from its child, sorts them, and returns the sorted rows. The bottom-most child could be a "Sequential Scan" node that reads rows directly from a table. diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index cc5030594d5a..eb44d33f50eb 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -3,7 +3,9 @@ title: Query Planner headerTitle: Query Planner / CBO linkTitle: Query Planner description: Understand the various methodologies used for joining multiple tables -headcontent: Understand how the planner choses the optimal path for query execution +headcontent: Understand how the planner chooses the optimal path for query execution +tags: + feature: early-access menu: preview: identifier: query-planner @@ -12,14 +14,15 @@ menu: type: docs --- -The query planner is responsible for determining the most efficient way to execute a given SQL query. It generates various plans of exection and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. The primary component that calculates these values is the Cost Based optimizer (CBO). +The query planner is responsible for determining the most efficient way to execute a given SQL query. It generates various plans of execution and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. The primary component that calculates these values is the cost-based optimizer (CBO). {{}} -The Cost-based optimizer is a [YSQL](../../../api/ysql/) only feature. +CBO is [YSQL](../../../api/ysql/) only. {{}} {{}} -The CBO is disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) flag as: + +The CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) flag as follows: ```sql -- Enable for current session @@ -34,51 +37,49 @@ ALTER DATABASE database SET yb_enable_base_scans_cost_model = TRUE; {{}} -Let us understand how this works. - ## Plan search algorithm -To optimize the search for the best plan, CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each piece of the query. The sub-plans are then combined to find the best overall plan. +To optimize the search for the best plan, CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. ## Statistics gathering -The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data within columns, and the cardinality of results from operations. These statistics are essential in estimating the costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. +The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. -{{}} -Currently the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command has to be triggered manually. Multiple projects are in progress to trigger this automatically. +{{< note title="Run ANALYZE manually" >}} +Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. {{}} ## Cost estimation -For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer pragmatically compare which plan would likely be the most efficient to execute given the current database state and query context. Some of the factors included in the cost estimation are: +For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. Some of the factors included in the cost estimation are: {{}} -These estimates can be seen when using the `DEBUG` option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as `EXPLAIN (ANALYZE, DEBUG)`. +These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as EXPLAIN (ANALYZE, DEBUG). {{}} ### Cost of data fetch -To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed within the LSM subsystem, are taken into account. +To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. ### Index scan -As the primary key is part of the base table and that each [SST](../../docdb/lsm-sst) of the base table is sorted in the order of the primary key the primary index lookup cheaper compared to secondary index lookup. Depending on the type of query this distintion is conidered. +As the primary key is part of the base table and that each [SST](../../docdb/lsm-sst) of the base table is sorted in the order of the primary key the primary index lookup cheaper compared to secondary index lookup. Depending on the type of query this distinction is considered. ### Pushdown to storage layer -CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters and distinct clauses. This can considerably reduce the data transer over network. +CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. ### Join strategies -For queries involving multiple tables the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join) or [Hash](../join-strategies/#hash-join) join and various join orders are evaluated. +For queries involving multiple tables, CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. ### Data transfer costs -The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Since each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. _Note_ that the time spent transferring the data will also depend on the network bandwidth. +The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. ## Plan selection -The CBO evaluates each candidate plan's estimated costs to determine the plan with the lowest cost, which is then selected for execution. This ensures the optimal usage of system resources and improved query performance. +The CBO evaluates each candidate plan's estimated costs to determine the plan with the lowest cost, which is then selected for execution. This ensures the optimal use of system resources and improved query performance. After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. From 4ee6937ee2ebbb331e6461fe45a1dd7c05925b2f Mon Sep 17 00:00:00 2001 From: Dwight Hodge Date: Wed, 30 Oct 2024 17:24:43 -0400 Subject: [PATCH 03/15] edit --- docs/content/preview/architecture/query-layer/_index.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/_index.md b/docs/content/preview/architecture/query-layer/_index.md index 1d5e98839371..ba6852705c86 100644 --- a/docs/content/preview/architecture/query-layer/_index.md +++ b/docs/content/preview/architecture/query-layer/_index.md @@ -58,9 +58,13 @@ Views are realized during this phase. Whenever a query against a view (that is, ### Planner -The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. +The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. -After the optimal plan is determined, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. +The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. + +After determining the optimal plan, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. + +The execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. {{}} To learn how the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) From be2756dc64d618e4f72eb77a46f45466863114c1 Mon Sep 17 00:00:00 2001 From: Premkumar Date: Thu, 31 Oct 2024 09:10:03 -0700 Subject: [PATCH 04/15] fixes from review --- .../preview/architecture/query-layer/planner-optimizer.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index eb44d33f50eb..8b6c2fcc6aec 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -51,19 +51,21 @@ Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to ## Cost estimation -For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. Some of the factors included in the cost estimation are: +For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. {{}} These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as EXPLAIN (ANALYZE, DEBUG). {{}} +Some of the factors included in the cost estimation are discussed below. + ### Cost of data fetch To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. ### Index scan -As the primary key is part of the base table and that each [SST](../../docdb/lsm-sst) of the base table is sorted in the order of the primary key the primary index lookup cheaper compared to secondary index lookup. Depending on the type of query this distinction is considered. +When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn’t an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. ### Pushdown to storage layer From ca8024c834aae7ec9f6110df4ed8f808647af5ce Mon Sep 17 00:00:00 2001 From: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> Date: Thu, 31 Oct 2024 12:15:44 -0400 Subject: [PATCH 05/15] Update docs/content/preview/architecture/query-layer/planner-optimizer.md --- .../preview/architecture/query-layer/planner-optimizer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index 8b6c2fcc6aec..9090f8fd7faf 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -65,7 +65,7 @@ To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors suc ### Index scan -When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn’t an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. +When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. ### Pushdown to storage layer From 443332cf34d30d150087c4f6c9485b027dacb7f9 Mon Sep 17 00:00:00 2001 From: Premkumar Date: Thu, 31 Oct 2024 15:27:04 -0700 Subject: [PATCH 06/15] added RBO and other details --- .../query-layer/planner-optimizer.md | 84 ++++++++++++++----- 1 file changed, 62 insertions(+), 22 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index 9090f8fd7faf..32c83ee9e2b6 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -12,36 +12,57 @@ menu: parent: architecture-query-layer weight: 100 type: docs +rightnav: + hideH4: true --- -The query planner is responsible for determining the most efficient way to execute a given SQL query. It generates various plans of execution and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. The primary component that calculates these values is the cost-based optimizer (CBO). +The query planner is responsible for determining the most efficient way to execute a given SQL query. The optimizer component of the planner generates various plans of execution and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. YugabyteDB supports 3 different types of optimizers. The primary component that calculates these values is the cost-based optimizer (CBO). + +## Rule-based optimizer + +This is the basic, default optimizer in YugabyteDB. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and utilizing indexes and views to enhance performance. While the RBO is effective for simpler queries, it faces challenges with more complex queries because it does not account for the actual costs of execution plans, like I/O and CPU costs. + +## Default Cost based optimizer {{}} CBO is [YSQL](../../../api/ysql/) only. {{}} -{{}} +The Cost-Based Optimizer (CBO) selects the most efficient execution plan for a query by estimating the "cost" of different plan options. It evaluates factors such as disk I/O, CPU, and memory usage to assign a cost to each possible execution path. The optimizer relies on configurable cost parameters and table statistics, including row counts and data distribution, to estimate how selective each query condition is, which helps minimize data scanned and reduce resource usage. + +The default cost model for evaluating execution path costs in YugabyteDB is based on PostgreSQL's model. It relies on basic statistics, such as the number of rows in tables and whether an index can be utilized for a specific query, which works well for most queries. However, since this model was originally designed for a single-node database (PostgreSQL), it doesn’t account for YugabyteDB’s distributed nature or leverage cluster topology in plan generation. + +{{}} -The CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) flag as follows: +The default CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_optimizer_statistics](../../../reference/configuration/yb-tserver/#yb-enable-optimizer-statistics) configuration parameter as follows: ```sql -- Enable for current session -SET yb_enable_base_scans_cost_model = TRUE; +SET yb_enable_optimizer_statistics = TRUE; +``` + +{{}} + +## YugabyteDB Cost model + +To account for the distributed nature of the data, YugabyteDB introduces an advanced cost model that takes into consideration the cost of network requests, operations on lower level storage layer and the cluster toplogy. Let us see in detail how this works. --- Enable for all new sessions of a user -ALTER USER user SET yb_enable_base_scans_cost_model = TRUE; +{{}} + +The YugabyteDB CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) configuration parameter as follows: --- Enable for all new sessions on a database -ALTER DATABASE database SET yb_enable_base_scans_cost_model = TRUE; +```sql +-- Enable for current session +SET yb_enable_base_scans_cost_model = TRUE; ``` {{}} -## Plan search algorithm +### Plan search algorithm To optimize the search for the best plan, CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. -## Statistics gathering +### Statistics gathering The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. @@ -49,7 +70,7 @@ The optimizer relies on accurate statistics about the tables, including the numb Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. {{}} -## Cost estimation +### Cost estimation For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. @@ -59,25 +80,25 @@ These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../. Some of the factors included in the cost estimation are discussed below. -### Cost of data fetch +1. **Cost of data fetch** -To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. -### Index scan +1. **Index scan** -When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. + When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. -### Pushdown to storage layer +1. Pushdown to storage layer -CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. + CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. -### Join strategies +1. Join strategies -For queries involving multiple tables, CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. + For queries involving multiple tables, CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. -### Data transfer costs +1. Data transfer costs -The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. + The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. ## Plan selection @@ -89,6 +110,25 @@ After the optimal plan is determined, YugabyteDB generates a detailed execution The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. +## Switching to the default CBO + +In case you need to switch back to the default cost model after trying out YugabyteDB cost model, you need to follow these instructions. + +1. Turn of the base scans cost model as follows: + + ```sql + SET yb_enable_base_scans_cost_model = FALSE; + ``` + +1. Reset statistics collected with the ANALYZE command as follows: + + ```sql + SELECT yb_reset_analyze_statistics ( table_oid ); + ``` + + If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze. + ## Learn more -- [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) \ No newline at end of file +- [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) +- [YugabyteDB Cost-Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file From 6a22efbed7218635aab0e7ed47ad1668b61e885b Mon Sep 17 00:00:00 2001 From: Premkumar Date: Mon, 4 Nov 2024 10:53:06 -0800 Subject: [PATCH 07/15] minor fixes --- .../architecture/query-layer/planner-optimizer.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index 32c83ee9e2b6..a1b09d6d3cda 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -2,7 +2,6 @@ title: Query Planner headerTitle: Query Planner / CBO linkTitle: Query Planner -description: Understand the various methodologies used for joining multiple tables headcontent: Understand how the planner chooses the optimal path for query execution tags: feature: early-access @@ -80,7 +79,7 @@ These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../. Some of the factors included in the cost estimation are discussed below. -1. **Cost of data fetch** +1. **Data fetch** To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. @@ -88,15 +87,15 @@ Some of the factors included in the cost estimation are discussed below. When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. -1. Pushdown to storage layer +1. **Pushdown to storage layer** CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. -1. Join strategies +1. **Join strategies** For queries involving multiple tables, CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. -1. Data transfer costs +1. **Data transfer** The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. From 2141862a30ca1b0517c8a3e5253545f86887d2fe Mon Sep 17 00:00:00 2001 From: Premkumar Date: Tue, 5 Nov 2024 11:52:01 -0800 Subject: [PATCH 08/15] Adding yb_reset_analyze_statistics correctly --- .../statements/cmd_analyze.md | 12 ++++++++++ .../query-layer/planner-optimizer.md | 24 +++---------------- 2 files changed, 15 insertions(+), 21 deletions(-) diff --git a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md index 88a10cf695cf..754f2b7a06ce 100644 --- a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -51,6 +51,18 @@ Table name to be analyzed; may be schema-qualified. Optional. Omit to analyze al List of columns to be analyzed. Optional. Omit to analyze all columns of the table. +## Resetting statistics + +Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. + +The `yb_reset_analyze_statistics` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables within a database. This function can be called as, + +```sql +SELECT yb_reset_analyze_statistics ( table_oid ); +``` + +If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze. + ## Examples ### Analyze a single table diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index a1b09d6d3cda..4ea81e64ba84 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -15,7 +15,7 @@ rightnav: hideH4: true --- -The query planner is responsible for determining the most efficient way to execute a given SQL query. The optimizer component of the planner generates various plans of execution and determines the optimal path by taking into consideration the costs associated various factors like index lookups, scans, CPU usage, network latency, and so on. YugabyteDB supports 3 different types of optimizers. The primary component that calculates these values is the cost-based optimizer (CBO). +The query planner is responsible for determining the most efficient way to execute a given SQL query. TThe planner's optimizer calculates the costs of different execution plans, taking into account factors like index lookups, table scans, CPU usage, and network latency. It then selects the most cost-effective path for query execution. YugabyteDB supports 3 different types of optimizers. The query planner, also known as CBO comprises primarily of selectivity estimation and cost modeling. We have implemented a new cost model for YugabyteDB which improves the accuracy of the CBO. ## Rule-based optimizer @@ -63,7 +63,7 @@ To optimize the search for the best plan, CBO uses a dynamic programming-based a ### Statistics gathering -The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. +The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the selectivity of filters and costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. {{< note title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. @@ -81,7 +81,7 @@ Some of the factors included in the cost estimation are discussed below. 1. **Data fetch** - To estimate the cost of fetching a tuple from [DocDB](../../docdb/), factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. 1. **Index scan** @@ -109,24 +109,6 @@ After the optimal plan is determined, YugabyteDB generates a detailed execution The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. -## Switching to the default CBO - -In case you need to switch back to the default cost model after trying out YugabyteDB cost model, you need to follow these instructions. - -1. Turn of the base scans cost model as follows: - - ```sql - SET yb_enable_base_scans_cost_model = FALSE; - ``` - -1. Reset statistics collected with the ANALYZE command as follows: - - ```sql - SELECT yb_reset_analyze_statistics ( table_oid ); - ``` - - If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze. - ## Learn more - [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) From 28ef592f3aa6fcea9a360c6f083205901df128b4 Mon Sep 17 00:00:00 2001 From: Premkumar Date: Thu, 5 Dec 2024 15:55:03 -0800 Subject: [PATCH 09/15] feedback from Mihnea --- .../query-layer/planner-optimizer.md | 32 ++++++++----------- 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index 4ea81e64ba84..fcf5285ffaee 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -15,21 +15,15 @@ rightnav: hideH4: true --- -The query planner is responsible for determining the most efficient way to execute a given SQL query. TThe planner's optimizer calculates the costs of different execution plans, taking into account factors like index lookups, table scans, CPU usage, and network latency. It then selects the most cost-effective path for query execution. YugabyteDB supports 3 different types of optimizers. The query planner, also known as CBO comprises primarily of selectivity estimation and cost modeling. We have implemented a new cost model for YugabyteDB which improves the accuracy of the CBO. +The query planner is responsible for determining the most efficient way to execute a given query. The optimizer is the critical component in the planner that calculates the costs of different execution plans, taking into account factors like index lookups, table scans, network round trips and storage costs. It then selects the most cost-effective path for query execution. YugabyteDB implements completely different types of optimizers for the YSQL and YCQL APIs. -## Rule-based optimizer +## Rule based optimizer (YCQL) -This is the basic, default optimizer in YugabyteDB. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and utilizing indexes and views to enhance performance. While the RBO is effective for simpler queries, it faces challenges with more complex queries because it does not account for the actual costs of execution plans, like I/O and CPU costs. +YugabyteDB implements a simple rules based optimizer (RBO) for YCQL. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and utilizing indexes and views to enhance performance. -## Default Cost based optimizer +## Heuristics based optimizer (YSQL) -{{}} -CBO is [YSQL](../../../api/ysql/) only. -{{}} - -The Cost-Based Optimizer (CBO) selects the most efficient execution plan for a query by estimating the "cost" of different plan options. It evaluates factors such as disk I/O, CPU, and memory usage to assign a cost to each possible execution path. The optimizer relies on configurable cost parameters and table statistics, including row counts and data distribution, to estimate how selective each query condition is, which helps minimize data scanned and reduce resource usage. - -The default cost model for evaluating execution path costs in YugabyteDB is based on PostgreSQL's model. It relies on basic statistics, such as the number of rows in tables and whether an index can be utilized for a specific query, which works well for most queries. However, since this model was originally designed for a single-node database (PostgreSQL), it doesn’t account for YugabyteDB’s distributed nature or leverage cluster topology in plan generation. +YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability and assigns some heuristic costs to the number of result rows depending on the type of the scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. {{}} @@ -42,9 +36,9 @@ SET yb_enable_optimizer_statistics = TRUE; {{}} -## YugabyteDB Cost model +## Cost based optimizer - CBO (YSQL) -To account for the distributed nature of the data, YugabyteDB introduces an advanced cost model that takes into consideration the cost of network requests, operations on lower level storage layer and the cluster toplogy. Let us see in detail how this works. +To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer for YSQL that uses an advanced cost model that takes into consideration of accurate table statistics, the cost of network round trips, operations on lower level storage layer and the cluster toplogy. Let us see in detail how this works. {{}} @@ -59,19 +53,19 @@ SET yb_enable_base_scans_cost_model = TRUE; ### Plan search algorithm -To optimize the search for the best plan, CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. +To optimize the search for the best plan, the CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. ### Statistics gathering The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the selectivity of filters and costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. -{{< note title="Run ANALYZE manually" >}} +{{}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. {{}} ### Cost estimation -For each potential execution plan, the optimizer calculates costs in terms of I/O, CPU usage, and memory consumption. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. +For each potential execution plan, the optimizer calculates costs in terms of storage layer lookups both cache and disk, number of network round trips and other factors. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. {{}} These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as EXPLAIN (ANALYZE, DEBUG). @@ -81,7 +75,7 @@ Some of the factors included in the cost estimation are discussed below. 1. **Data fetch** - To estimate the cost of fetching a tuple from [DocDB](../../docdb/), CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem, are taken into account. + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), the CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem. 1. **Index scan** @@ -89,11 +83,11 @@ Some of the factors included in the cost estimation are discussed below. 1. **Pushdown to storage layer** - CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. + The CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. 1. **Join strategies** - For queries involving multiple tables, CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. + For queries involving multiple tables, the CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. 1. **Data transfer** From 4b0ec3a5c066765eed0c76bc5e17c555a81fa792 Mon Sep 17 00:00:00 2001 From: Premkumar Date: Fri, 10 Jan 2025 15:30:19 -0800 Subject: [PATCH 10/15] feedback from Mihnea --- .../architecture/query-layer/planner-optimizer.md | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index fcf5285ffaee..1443756676b7 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -25,17 +25,6 @@ YugabyteDB implements a simple rules based optimizer (RBO) for YCQL. It operates YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability and assigns some heuristic costs to the number of result rows depending on the type of the scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. -{{}} - -The default CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_optimizer_statistics](../../../reference/configuration/yb-tserver/#yb-enable-optimizer-statistics) configuration parameter as follows: - -```sql --- Enable for current session -SET yb_enable_optimizer_statistics = TRUE; -``` - -{{}} - ## Cost based optimizer - CBO (YSQL) To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer for YSQL that uses an advanced cost model that takes into consideration of accurate table statistics, the cost of network round trips, operations on lower level storage layer and the cluster toplogy. Let us see in detail how this works. @@ -99,10 +88,6 @@ The CBO evaluates each candidate plan's estimated costs to determine the plan wi After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. -## Plan caching - -The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. - ## Learn more - [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) From 742269ae71bd1b0583962053fd79d20dc79ae841 Mon Sep 17 00:00:00 2001 From: Premkumar Date: Tue, 21 Jan 2025 16:10:18 -0800 Subject: [PATCH 11/15] fix links --- docs/content/preview/develop/postgresql-compatibility.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/content/preview/develop/postgresql-compatibility.md b/docs/content/preview/develop/postgresql-compatibility.md index 50dba2804747..ef4f4f362395 100644 --- a/docs/content/preview/develop/postgresql-compatibility.md +++ b/docs/content/preview/develop/postgresql-compatibility.md @@ -72,8 +72,8 @@ When enabling the cost models, ensure that packed row for colocated tables is en {{}} -{{}} -To learn about how the Cost-based optimizer works, see [Query Planner / CBO](../../../architecture/query-layer/planner-optimizer/) +{{}} +To learn about how the Cost-based optimizer works, see [Query Planner / CBO](../../architecture/query-layer/planner-optimizer/) {{}} #### Wait-on-conflict concurrency From 0d757c0b268d68455865f138cf4f7d7281f6ee98 Mon Sep 17 00:00:00 2001 From: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> Date: Wed, 22 Jan 2025 10:47:28 -0500 Subject: [PATCH 12/15] Apply suggestions from code review --- .../statements/cmd_analyze.md | 4 ++-- .../query-layer/planner-optimizer.md | 20 +++++++++---------- .../develop/postgresql-compatibility.md | 4 ++-- 3 files changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md index 754f2b7a06ce..691c89ceb6ef 100644 --- a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -53,9 +53,9 @@ List of columns to be analyzed. Optional. Omit to analyze all columns of the tab ## Resetting statistics -Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. +Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also, when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. -The `yb_reset_analyze_statistics` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables within a database. This function can be called as, +The `yb_reset_analyze_statistics()` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables in a database. Call this function as follows: ```sql SELECT yb_reset_analyze_statistics ( table_oid ); diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index 1443756676b7..a212509efe12 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -15,19 +15,19 @@ rightnav: hideH4: true --- -The query planner is responsible for determining the most efficient way to execute a given query. The optimizer is the critical component in the planner that calculates the costs of different execution plans, taking into account factors like index lookups, table scans, network round trips and storage costs. It then selects the most cost-effective path for query execution. YugabyteDB implements completely different types of optimizers for the YSQL and YCQL APIs. +The query planner is responsible for determining the most efficient way to execute a given query. The optimizer is the critical component in the planner that calculates the costs of different execution plans, taking into account factors like index lookups, table scans, network round trips, and storage costs. It then selects the most cost-effective path for query execution. YugabyteDB implements completely different types of optimizers for the YSQL and YCQL APIs. ## Rule based optimizer (YCQL) -YugabyteDB implements a simple rules based optimizer (RBO) for YCQL. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and utilizing indexes and views to enhance performance. +YugabyteDB implements a simple rules-based optimizer (RBO) for YCQL. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and using indexes and views to enhance performance. ## Heuristics based optimizer (YSQL) -YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability and assigns some heuristic costs to the number of result rows depending on the type of the scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. +YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability, and assigns some heuristic costs to the number of result rows depending on the type of scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. -## Cost based optimizer - CBO (YSQL) +## Cost based optimizer (YSQL) -To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer for YSQL that uses an advanced cost model that takes into consideration of accurate table statistics, the cost of network round trips, operations on lower level storage layer and the cluster toplogy. Let us see in detail how this works. +To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer (CBO) for YSQL that uses an advanced cost model. The model considers accurate table statistics, the cost of network round trips, operations on lower level storage layer, and the cluster toplogy. {{}} @@ -54,17 +54,17 @@ Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to ### Cost estimation -For each potential execution plan, the optimizer calculates costs in terms of storage layer lookups both cache and disk, number of network round trips and other factors. These costs help the optimizer compare which plan would likely be the most efficient to execute given the current database state and query context. +For each potential execution plan, the optimizer calculates costs in terms of storage layer lookups (both cache and disk), number of network round trips, and other factors. These costs help the optimizer compare which plan is likely be the most efficient to execute given the current database state and query context. {{}} -These estimates can be seen when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command as EXPLAIN (ANALYZE, DEBUG). +You can see these estimates when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command, as in EXPLAIN (ANALYZE, DEBUG). {{}} -Some of the factors included in the cost estimation are discussed below. +Some of the factors that the CBO considers in the cost estimation are as follows: 1. **Data fetch** - To estimate the cost of fetching a tuple from [DocDB](../../docdb/), the CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seeks](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem. + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), the CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seek](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem. 1. **Index scan** @@ -76,7 +76,7 @@ Some of the factors included in the cost estimation are discussed below. 1. **Join strategies** - For queries involving multiple tables, the CBO evaluates the cost of different join strategies like [Nested loop](../join-strategies/#nested-loop-join), [BNL](../join-strategies/#batched-nested-loop-join-bnl), [Merge](../join-strategies/#merge-join), or [Hash](../join-strategies/#hash-join) join, as well as various join orders. + For queries involving multiple tables, the CBO evaluates the cost of different join strategies like [nested loop](../join-strategies/#nested-loop-join), [batch nested loop](../join-strategies/#batched-nested-loop-join-bnl), [merge](../join-strategies/#merge-join), or [hash](../join-strategies/#hash-join) join, as well as various join orders. 1. **Data transfer** diff --git a/docs/content/preview/develop/postgresql-compatibility.md b/docs/content/preview/develop/postgresql-compatibility.md index ef4f4f362395..f63224ff3266 100644 --- a/docs/content/preview/develop/postgresql-compatibility.md +++ b/docs/content/preview/develop/postgresql-compatibility.md @@ -63,7 +63,7 @@ To learn about read committed isolation, see [Read Committed](../../architecture Configuration parameter: `yb_enable_base_scans_cost_model=true` -[Cost-based optimizer (CBO)](../../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. +[Cost based optimizer (CBO)](../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. {{}} When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. @@ -73,7 +73,7 @@ When enabling the cost models, ensure that packed row for colocated tables is en {{}} {{}} -To learn about how the Cost-based optimizer works, see [Query Planner / CBO](../../architecture/query-layer/planner-optimizer/) +To learn how CBO works, see [Query Planner / CBO](../../architecture/query-layer/planner-optimizer/) {{}} #### Wait-on-conflict concurrency From 045138fa08ddd99009f6852f49ce7324721e3a48 Mon Sep 17 00:00:00 2001 From: Dwight Hodge Date: Wed, 22 Jan 2025 14:31:25 -0500 Subject: [PATCH 13/15] backport --- .../statements/cmd_analyze.md | 2 +- .../develop/postgresql-compatibility.md | 2 +- .../statements/cmd_analyze.md | 16 +++- .../stable/architecture/docdb/lsm-sst.md | 44 +++++++-- .../stable/architecture/query-layer/_index.md | 55 +++++++---- .../query-layer/join-strategies.md | 2 +- .../query-layer/planner-optimizer.md | 94 +++++++++++++++++++ .../develop/postgresql-compatibility.md | 8 +- .../statements/cmd_analyze.md | 16 +++- .../v2024.1/architecture/docdb/lsm-sst.md | 44 +++++++-- .../architecture/query-layer/_index.md | 55 +++++++---- .../query-layer/join-strategies.md | 2 +- .../query-layer/planner-optimizer.md | 94 +++++++++++++++++++ .../develop/postgresql-compatibility.md | 8 +- 14 files changed, 372 insertions(+), 70 deletions(-) create mode 100644 docs/content/stable/architecture/query-layer/planner-optimizer.md create mode 100644 docs/content/v2024.1/architecture/query-layer/planner-optimizer.md diff --git a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md index 691c89ceb6ef..e61bfa36b360 100644 --- a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -51,7 +51,7 @@ Table name to be analyzed; may be schema-qualified. Optional. Omit to analyze al List of columns to be analyzed. Optional. Omit to analyze all columns of the table. -## Resetting statistics +## Reset statistics Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also, when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. diff --git a/docs/content/preview/develop/postgresql-compatibility.md b/docs/content/preview/develop/postgresql-compatibility.md index f63224ff3266..05eb2fed1cfe 100644 --- a/docs/content/preview/develop/postgresql-compatibility.md +++ b/docs/content/preview/develop/postgresql-compatibility.md @@ -66,7 +66,7 @@ Configuration parameter: `yb_enable_base_scans_cost_model=true` [Cost based optimizer (CBO)](../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. {{}} -When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. +When enabling this parameter, you must run ANALYZE on user tables to maintain up-to-date statistics. When enabling the cost models, ensure that packed row for colocated tables is enabled by setting the `--ysql_enable_packed_row_for_colocated_table` flag to true. diff --git a/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md index ddf13ea3b1f2..16dd8cb3368b 100644 --- a/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -14,9 +14,9 @@ type: docs ## Synopsis -ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog. These statistics help the query planner to determine the most efficient execution plans for queries. +ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. @@ -51,6 +51,18 @@ Table name to be analyzed; may be schema-qualified. Optional. Omit to analyze al List of columns to be analyzed. Optional. Omit to analyze all columns of the table. +## Reset statistics + +Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also, when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. + +The `yb_reset_analyze_statistics()` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables in a database. Call this function as follows: + +```sql +SELECT yb_reset_analyze_statistics ( table_oid ); +``` + +If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze. + ## Examples ### Analyze a single table diff --git a/docs/content/stable/architecture/docdb/lsm-sst.md b/docs/content/stable/architecture/docdb/lsm-sst.md index 2fdd6b9d9e52..2206723ec56a 100644 --- a/docs/content/stable/architecture/docdb/lsm-sst.md +++ b/docs/content/stable/architecture/docdb/lsm-sst.md @@ -12,11 +12,11 @@ menu: type: docs --- -A log-structured merge-tree (LSM tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. +A [log-structured merge-tree (LSM tree)](https://en.wikipedia.org/wiki/Log-structured_merge-tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. The core idea behind an LSM tree is to separate the write and read paths, allowing writes to be sequential and buffered in memory making them faster than random writes, while reads can still access data efficiently through a hierarchical structure of sorted files on disk. -An LSM tree has 2 primary components - Memtable and SSTs. Let's look into each of them in detail and understand how they work during writes and reads. +An LSM tree has 2 primary components - [Memtable](#memtable) and [Sorted String Tables (SSTs)](#sst). Let's look into each of them in detail and understand how they work during writes and reads. {{}} Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses the Raft logs for this purpose. For more details, see [Raft log vs LSM WAL](../performance/#raft-vs-rocksdb-wal-logs). @@ -24,39 +24,63 @@ Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses ## Comparison to B-tree -Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree) based storage system. But YugabyteDB had to chose an LSM based storage to build a highly scalable database for of the following reasons. +Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree)-based storage system. But Yugabyte chose LSM-based storage to build a highly scalable database for the following reasons: -- Write operations (insert, update, delete) are more expensive in a B-tree. As it involves random writes and in place node splitting and rebalancing. In an LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. +- Write operations (insert, update, delete) are more expensive in a B-tree, requiring random writes and in-place node splitting and rebalancing. In LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. - The append-only nature of LSM makes it more efficient for concurrent write operations. ## Memtable -All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a Memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the Memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that Memtable. +All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that memtable. -The immutable Memtable is then flushed to disk as an SST (Sorted String Table) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk. +The immutable [memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the memtable to disk in a sorted order, creating an SST file. DocDB maintains one active memtable, and at most one immutable memtable at any point in time. This ensures that write operations can continue to be processed in the active memtable while the immutable memtable is being flushed to disk. ## SST -Each SST (Sorted String Table) file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. +Each SST file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. Each SST file contains a bloom filter, which is a space-efficient data structure that helps quickly determine whether a key might exist in that file or not, avoiding unnecessary disk reads. {{}} -Most LSMs organize SSTS into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). +Most LSMs organize SSTs into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). {{}} +Three core low-level operations are used to iterate through the data in SST files. + +### Seek + +The _seek_ operation is used to locate a specific key or position in an SST file or memtable. When performing a seek, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek positions the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching. + +### Next + +The _next_ operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a seek, a sequence of next operations can scan through keys in ascending order. + +### Previous + +The _previous_ operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after seeking to a key near the end of a range, previous can be used to iterate through keys in descending order, often needed in order-by-descending queries. + ## Write path -When new data is written to the LSM system, it is first inserted into the active Memtable. As the Memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. +When new data is written to the LSM system, it is first inserted into the active memtable. As the memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. ## Read Path -To read a key, the LSM tree first checks the Memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. +To read a key, the LSM tree first checks the memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. + +## Delete path + +Rather than immediately removing the key from SSTs, the delete operation marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction), when tombstones are removed along with the data they mark as deleted. ## Compaction As data accumulates in SSTs, a process called compaction merges and sorts the SST files with overlapping key ranges producing a new set of SST files. The merge process during compaction helps to organize and sort the data, maintaining a consistent on-disk format and reclaiming space from obsolete data versions. +The [YB-TServer](../../yb-tserver/) manages multiple compaction queues and enforces throttling to avoid compaction storms. Although full compactions can be scheduled, they can also be triggered manually. Full compactions are also triggered automatically if the system detects tombstones and obsolete keys affecting read performance. + +{{}} +To learn more about YB-TServer compaction operations, refer to [YB-TServer](../../yb-tserver/) +{{}} + ## Learn more - [Blog: Background Compactions in YugabyteDB](https://www.yugabyte.com/blog/background-data-compaction/#what-is-a-data-compaction) diff --git a/docs/content/stable/architecture/query-layer/_index.md b/docs/content/stable/architecture/query-layer/_index.md index 14986ab5ed78..2ddbab837d3a 100644 --- a/docs/content/stable/architecture/query-layer/_index.md +++ b/docs/content/stable/architecture/query-layer/_index.md @@ -19,8 +19,8 @@ The YugabyteDB Query Layer (YQL) is the primary layer that provides interfaces f Although YQL is designed with extensibility in mind, allowing for new APIs to be added, it currently supports two types of distributed SQL APIs: [YSQL](../../api/ysql/) and [YCQL](../../api/ycql/). -- [YSQL](../../api/ysql/) is a distributed SQL API that is built by reusing the PostgreSQL language layer code. It is a stateless SQL query engine that is wire-format compatible with PostgreSQL. The default port for YSQL is `5433`. -- [YCQL](../../api/ycql/) is a semi-relational language that has its roots in Cassandra Query Language. It is a SQL-like language built specifically to be aware of the clustering of data across nodes. The default port for YCQL is `9042`. +- [YSQL](../../api/ysql/) is a distributed SQL API that is built by reusing the PostgreSQL language layer code. It is a stateless SQL query engine that is wire-format compatible with PostgreSQL. The default port for YSQL is 5433. +- [YCQL](../../api/ycql/) is a semi-relational language that has its roots in Cassandra Query Language. It is a SQL-like language built specifically to be aware of the clustering of data across nodes. The default port for YCQL is 9042. ## Query processing @@ -38,7 +38,7 @@ The parser processes each query in several steps as follows: 1. Builds a parse tree: If the query is written correctly, the parser builds a structured representation of the query, called a parse tree. This parse tree captures the different parts of the query and how they are related. -1. Recognizes keywords and identifiers: To build the parse tree, the parser first identifies the different components of the query, such as keywords (like `SELECT`, `FROM`), table or column names, and other identifiers. +1. Recognizes keywords and identifiers: To build the parse tree, the parser first identifies the different components of the query, such as keywords (like SELECT, FROM), table or column names, and other identifiers. 1. Applies grammar rules: The parser then applies a set of predefined grammar rules to understand the structure and meaning of the query based on the identified components. @@ -56,38 +56,55 @@ Views are realized during this phase. Whenever a query against a view (that is, ### Planner -YugabyteDB needs to determine the most efficient way to execute a query and return the results. This process is handled by the query planner/optimizer component. +The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. -If the query involves joining multiple tables, the planner evaluates different techniques to combine the data: +After determining the optimal plan, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. -- Nested loop join: Scanning one table for each row in the other table. This can be efficient if one table is small or has a good index. -- Merge join: Sorting both tables by the join columns and then merging them in parallel. This works well when the tables are already sorted or can be efficiently sorted. -- Hash join: Building a hash table from one table and then scanning the other table to find matches in the hash table. -For queries involving more than two tables, the planner considers different sequences of joining the tables to find the most efficient approach. +The execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. -The planner estimates the cost of each possible execution plan and chooses the one expected to be the fastest, taking into account factors like table sizes, indexes, sorting requirements, and so on. - -After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. - -{{}} -The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. -{{}} +{{}} +To learn how the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) +{{}} ### Executor -After the query planner determines the optimal execution plan, the query executor component runs the plan and retrieves the required data. The executor sends appropriate requests to the other YB-TServers that hold the needed data to performs sorts, joins, aggregations, and then evaluates qualifications and finally returns the derived rows. +After the query planner determines the optimal execution plan, the executor runs the plan and retrieves the required data. The executor sends requests to the other YB-TServers that hold the data needed to perform sorts, joins, and aggregations, then evaluates qualifications, and finally returns the derived rows. The executor works in a step-by-step fashion, recursively processing the plan from top to bottom. Each node in the plan tree is responsible for fetching or computing rows of data as requested by its parent node. -For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to get rows from them. +For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to retrieve rows. A child node may be a "Sort" node, which requests rows from its child, sorts them, and returns the sorted rows. The bottom-most child could be a "Sequential Scan" node that reads rows directly from a table. As the executor requests rows from each node, that node fetches or computes the rows from its children, applies any filtering or data transformations specified in the query plan, and returns the requested rows up to its parent node. -This process continues recursively until the top node has received all the rows it needs to produce the final result. For a `SELECT` query, these final rows are sent to the client. For data modification queries like `INSERT`, `UPDATE`, or `DELETE`, the rows are used to make the requested changes in the database tables. +This process continues recursively until the top node has received all the rows it needs to produce the final result. For a SELECT query, these final rows are sent to the client. For data modification queries like INSERT, UPDATE, or DELETE, the rows are used to make the requested changes in the database tables. The executor is designed to efficiently pull rows through the pipeline defined by the plan tree, processing rows in batches where possible for better performance. +### Optimizations + +- **Incremental sort**. If an intermediate query result is known to be sorted by one or more leading keys of a required sort ordering, the additional sorting can be done considering only the remaining keys, if the rows are sorted in batches that have equal leading keys. + +- **Memoize results**. When only a small percentage of rows is checked on the inner side of a nested-loop join, the executor memoizes the results for improving performance. + +- **Disk-based hash aggregation**. Hash-based operations are generally more sensitive to memory availability and are highly efficient as long as the hash table fits within the memory specified by the work_mem parameter. When the hash table grows beyond the `work_mem` limit, the planner transitions to a disk-based hash aggregation plan. This avoids overloading memory and ensures that large datasets can be handled efficiently. + +## Query ID + +In YSQL, to provide a consistent way to track and identify specific queries across different parts of the system such as logs, performance statistics, and EXPLAIN plans, a unique identifier is generated for each query processed. The query ID is effectively a hash value based on the normalized form of the SQL query. This normalization process removes insignificant whitespace and converts literal values to placeholders, ensuring that semantically identical queries have the same ID. This provides the following benefits: + +- By providing a unique identifier for each query, it becomes much easier to analyze query performance and identify problematic queries. +- Including query IDs in logs and performance statistics enables more detailed and accurate monitoring of database activity. +- The EXPLAIN command, which shows the execution plan for a query, can also display the query ID. This helps to link the execution plan with the actual query execution statistics. +- The pg_stat_statements extension (which is installed by default in YugabyteDB) can accurately track and report statistics even for queries with varying literal values (for example, different WHERE clause parameters). This makes it much easier to identify performance bottlenecks caused by specific query patterns. + +Generation of this unique query ID is controlled using the `compute_query_id` setting, which can have the following values: + +- on - Always compute query IDs. +- off - Never compute query IDs. +- auto (the default) - Automatically compute query IDs when needed, such as when pg_stat_statements is enabled (pg_stat_statements is enabled by default). + +You should enable `compute_query_id` to fully realize its benefits for monitoring and performance analysis. diff --git a/docs/content/stable/architecture/query-layer/join-strategies.md b/docs/content/stable/architecture/query-layer/join-strategies.md index 09d70c8a98c6..4e3928905113 100644 --- a/docs/content/stable/architecture/query-layer/join-strategies.md +++ b/docs/content/stable/architecture/query-layer/join-strategies.md @@ -9,7 +9,7 @@ menu: name: Join strategies identifier: joins-strategies-ysql parent: architecture-query-layer - weight: 100 + weight: 200 type: docs --- diff --git a/docs/content/stable/architecture/query-layer/planner-optimizer.md b/docs/content/stable/architecture/query-layer/planner-optimizer.md new file mode 100644 index 000000000000..d9dade51a353 --- /dev/null +++ b/docs/content/stable/architecture/query-layer/planner-optimizer.md @@ -0,0 +1,94 @@ +--- +title: Query Planner +headerTitle: Query Planner / CBO +linkTitle: Query Planner +headcontent: Understand how the planner chooses the optimal path for query execution +tags: + feature: early-access +menu: + stable: + identifier: query-planner + parent: architecture-query-layer + weight: 100 +type: docs +rightnav: + hideH4: true +--- + +The query planner is responsible for determining the most efficient way to execute a given query. The optimizer is the critical component in the planner that calculates the costs of different execution plans, taking into account factors like index lookups, table scans, network round trips, and storage costs. It then selects the most cost-effective path for query execution. YugabyteDB implements completely different types of optimizers for the YSQL and YCQL APIs. + +## Rule based optimizer (YCQL) + +YugabyteDB implements a simple rules-based optimizer (RBO) for YCQL. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and using indexes and views to enhance performance. + +## Heuristics based optimizer (YSQL) + +YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability, and assigns some heuristic costs to the number of result rows depending on the type of scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. + +## Cost based optimizer (YSQL) + +To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer (CBO) for YSQL that uses an advanced cost model. The model considers accurate table statistics, the cost of network round trips, operations on lower level storage layer, and the cluster toplogy. + +{{}} + +The YugabyteDB CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) configuration parameter as follows: + +```sql +-- Enable for current session +SET yb_enable_base_scans_cost_model = TRUE; +``` + +{{}} + +### Plan search algorithm + +To optimize the search for the best plan, the CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. + +### Statistics gathering + +The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the selectivity of filters and costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. + +{{}} +Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. +{{}} + +### Cost estimation + +For each potential execution plan, the optimizer calculates costs in terms of storage layer lookups (both cache and disk), number of network round trips, and other factors. These costs help the optimizer compare which plan is likely be the most efficient to execute given the current database state and query context. + +{{}} +You can see these estimates when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command, as in EXPLAIN (ANALYZE, DEBUG). +{{}} + +Some of the factors that the CBO considers in the cost estimation are as follows: + +1. **Data fetch** + + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), the CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seek](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem. + +1. **Index scan** + + When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. + +1. **Pushdown to storage layer** + + The CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. + +1. **Join strategies** + + For queries involving multiple tables, the CBO evaluates the cost of different join strategies like [nested loop](../join-strategies/#nested-loop-join), [batch nested loop](../join-strategies/#batched-nested-loop-join-bnl), [merge](../join-strategies/#merge-join), or [hash](../join-strategies/#hash-join) join, as well as various join orders. + +1. **Data transfer** + + The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. + +## Plan selection + +The CBO evaluates each candidate plan's estimated costs to determine the plan with the lowest cost, which is then selected for execution. This ensures the optimal use of system resources and improved query performance. + +After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. + +## Learn more + +- [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) +- [YugabyteDB Cost-Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file diff --git a/docs/content/stable/develop/postgresql-compatibility.md b/docs/content/stable/develop/postgresql-compatibility.md index a3769f3fd542..3334eab887da 100644 --- a/docs/content/stable/develop/postgresql-compatibility.md +++ b/docs/content/stable/develop/postgresql-compatibility.md @@ -60,15 +60,19 @@ To learn about read committed isolation, see [Read Committed](../../architecture Configuration parameter: `yb_enable_base_scans_cost_model=true` -Cost-based optimizer (CBO) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. +[Cost based optimizer (CBO)](../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. {{}} -When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. +When enabling this parameter, you must run ANALYZE on user tables to maintain up-to-date statistics. When enabling the cost models, ensure that packed row for colocated tables is enabled by setting the `--ysql_enable_packed_row_for_colocated_table` flag to true. {{}} +{{}} +To learn how CBO works, see [Query Planner / CBO](../../architecture/query-layer/planner-optimizer/) +{{}} + ### Wait-on-conflict concurrency Flag: `enable_wait_queues=true` diff --git a/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md index a33d16ca5ae2..9d5cab41907c 100644 --- a/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -14,9 +14,9 @@ type: docs ## Synopsis -ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog. These statistics help the query planner to determine the most efficient execution plans for queries. +ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. @@ -51,6 +51,18 @@ Table name to be analyzed; may be schema-qualified. Optional. Omit to analyze al List of columns to be analyzed. Optional. Omit to analyze all columns of the table. +## Reset statistics + +Over time, statistics can reach a point where they no longer represent the current workload accurately. Resetting allows you to measure the impact of recent changes, like optimizations or new queries, without the influence of historical data. Also, when diagnosing issues, fresh statistics can help pinpoint current issues more effectively, rather than having to sift through historical data that may not be relevant. + +The `yb_reset_analyze_statistics()` function is a convenient helper that offers an easy way to clear statistics collected for a specific table or for all tables in a database. Call this function as follows: + +```sql +SELECT yb_reset_analyze_statistics ( table_oid ); +``` + +If table_oid is NULL, this function resets the statistics for all the tables in the current database that the user can analyze. + ## Examples ### Analyze a single table diff --git a/docs/content/v2024.1/architecture/docdb/lsm-sst.md b/docs/content/v2024.1/architecture/docdb/lsm-sst.md index 075b504d3fa3..181185afe82e 100644 --- a/docs/content/v2024.1/architecture/docdb/lsm-sst.md +++ b/docs/content/v2024.1/architecture/docdb/lsm-sst.md @@ -12,11 +12,11 @@ menu: type: docs --- -A log-structured merge-tree (LSM tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. +A [log-structured merge-tree (LSM tree)](https://en.wikipedia.org/wiki/Log-structured_merge-tree) is a data structure and storage architecture used by [RocksDB](http://rocksdb.org/), the underlying key-value store of DocDB. LSM trees strike a balance between write and read performance, making them suitable for workloads that involve both frequent writes and efficient reads. The core idea behind an LSM tree is to separate the write and read paths, allowing writes to be sequential and buffered in memory making them faster than random writes, while reads can still access data efficiently through a hierarchical structure of sorted files on disk. -An LSM tree has 2 primary components - Memtable and SSTs. Let's look into each of them in detail and understand how they work during writes and reads. +An LSM tree has 2 primary components - [Memtable](#memtable) and [Sorted String Tables (SSTs)](#sst). Let's look into each of them in detail and understand how they work during writes and reads. {{}} Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses the Raft logs for this purpose. For more details, see [Raft log vs LSM WAL](../performance/#raft-vs-rocksdb-wal-logs). @@ -24,39 +24,63 @@ Typically in LSMs there is a third component - WAL (Write ahead log). DocDB uses ## Comparison to B-tree -Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree) based storage system. But YugabyteDB had to chose an LSM based storage to build a highly scalable database for of the following reasons. +Most traditional databases (for example, MySQL, PostgreSQL, Oracle) have a [B-tree](https://en.wikipedia.org/wiki/B-tree)-based storage system. But Yugabyte chose LSM-based storage to build a highly scalable database for the following reasons: -- Write operations (insert, update, delete) are more expensive in a B-tree. As it involves random writes and in place node splitting and rebalancing. In an LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. +- Write operations (insert, update, delete) are more expensive in a B-tree, requiring random writes and in-place node splitting and rebalancing. In LSM-based storage, data is added to the [memtable](#memtable) and written onto a [SST](#sst) file as a batch. - The append-only nature of LSM makes it more efficient for concurrent write operations. ## Memtable -All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a Memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the Memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that Memtable. +All new write operations (inserts, updates, and deletes) are written as key-value pairs to an in-memory data structure called a memtable, which is essentially a sorted map or tree. The key-value pairs are stored in sorted order based on the keys. When the memtable reaches a certain size, it is made immutable, which means no new writes can be accepted into that memtable. -The immutable Memtable is then flushed to disk as an SST (Sorted String Table) file. This process involves writing the key-value pairs from the Memtable to disk in a sorted order, creating an SST file. DocDB maintains one active Memtable, and utmost one immutable Memtable at any point in time. This ensures that write operations can continue to be processed in the active Memtable, when the immutable memtable is being flushed to disk. +The immutable [memtable](#memtable) is then flushed to disk as an [SST (Sorted String Table)](#sst) file. This process involves writing the key-value pairs from the memtable to disk in a sorted order, creating an SST file. DocDB maintains one active memtable, and at most one immutable memtable at any point in time. This ensures that write operations can continue to be processed in the active memtable while the immutable memtable is being flushed to disk. ## SST -Each SST (Sorted String Table) file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. +Each SST file is an immutable, sorted file containing key-value pairs. The data is organized into data blocks, which are compressed using configurable compression algorithms (for example, Snappy, Zlib). Index blocks provide a mapping between key ranges and the corresponding data blocks, enabling efficient lookup of key-value pairs. Filter blocks containing bloom filters allow for quickly determining if a key might exist in an SST file or not, skipping entire files during lookups. The footer section of an SST file contains metadata about the file, such as the number of entries, compression algorithms used, and pointers to the index and filter blocks. Each SST file contains a bloom filter, which is a space-efficient data structure that helps quickly determine whether a key might exist in that file or not, avoiding unnecessary disk reads. {{}} -Most LSMs organize SSTS into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). +Most LSMs organize SSTs into multiple levels, where each level contains one or more SST files. But DocDB maintains files in only one level (level0). {{}} +Three core low-level operations are used to iterate through the data in SST files. + +### Seek + +The _seek_ operation is used to locate a specific key or position in an SST file or memtable. When performing a seek, the system attempts to jump directly to the position of the specified key. If the exact key is not found, seek positions the iterator at the closest key that is greater than or equal to the specified key, enabling efficient range scans or prefix matching. + +### Next + +The _next_ operation moves the iterator to the following key in sorted order. It is typically used for sequential reads or scans, where a query iterates over multiple keys, such as retrieving a range of rows. After a seek, a sequence of next operations can scan through keys in ascending order. + +### Previous + +The _previous_ operation moves the iterator to the preceding key in sorted order. It is useful for reverse scans or for reading records in descending order. This is important for cases where backward traversal is required, such as reverse range queries. For example, after seeking to a key near the end of a range, previous can be used to iterate through keys in descending order, often needed in order-by-descending queries. + ## Write path -When new data is written to the LSM system, it is first inserted into the active Memtable. As the Memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. +When new data is written to the LSM system, it is first inserted into the active memtable. As the memtable fills up, it is made immutable and written to disk as an SST file. Each SST file is sorted by key and contains a series of key-value pairs organized into data blocks, along with index and filter blocks for efficient lookups. ## Read Path -To read a key, the LSM tree first checks the Memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. +To read a key, the LSM tree first checks the memtable for the most recent value. If not found, it checks the SST files and finds the key or determines that it doesn't exist. During this process, LSM uses the index and filter blocks in the SST files to efficiently locate the relevant data blocks containing the key-value pairs. + +## Delete path + +Rather than immediately removing the key from SSTs, the delete operation marks a key as deleted using a tombstone marker, indicating that the key should be ignored in future reads. The actual deletion happens during [compaction](#compaction), when tombstones are removed along with the data they mark as deleted. ## Compaction As data accumulates in SSTs, a process called compaction merges and sorts the SST files with overlapping key ranges producing a new set of SST files. The merge process during compaction helps to organize and sort the data, maintaining a consistent on-disk format and reclaiming space from obsolete data versions. +The [YB-TServer](../../yb-tserver/) manages multiple compaction queues and enforces throttling to avoid compaction storms. Although full compactions can be scheduled, they can also be triggered manually. Full compactions are also triggered automatically if the system detects tombstones and obsolete keys affecting read performance. + +{{}} +To learn more about YB-TServer compaction operations, refer to [YB-TServer](../../yb-tserver/) +{{}} + ## Learn more - [Blog: Background Compactions in YugabyteDB](https://www.yugabyte.com/blog/background-data-compaction/#what-is-a-data-compaction) diff --git a/docs/content/v2024.1/architecture/query-layer/_index.md b/docs/content/v2024.1/architecture/query-layer/_index.md index 07546be2159e..02c5f9c67fcf 100644 --- a/docs/content/v2024.1/architecture/query-layer/_index.md +++ b/docs/content/v2024.1/architecture/query-layer/_index.md @@ -19,8 +19,8 @@ The YugabyteDB Query Layer (YQL) is the primary layer that provides interfaces f Although YQL is designed with extensibility in mind, allowing for new APIs to be added, it currently supports two types of distributed SQL APIs: [YSQL](../../api/ysql/) and [YCQL](../../api/ycql/). -- [YSQL](../../api/ysql/) is a distributed SQL API that is built by reusing the PostgreSQL language layer code. It is a stateless SQL query engine that is wire-format compatible with PostgreSQL. The default port for YSQL is `5433`. -- [YCQL](../../api/ycql/) is a semi-relational language that has its roots in Cassandra Query Language. It is a SQL-like language built specifically to be aware of the clustering of data across nodes. The default port for YCQL is `9042`. +- [YSQL](../../api/ysql/) is a distributed SQL API that is built by reusing the PostgreSQL language layer code. It is a stateless SQL query engine that is wire-format compatible with PostgreSQL. The default port for YSQL is 5433. +- [YCQL](../../api/ycql/) is a semi-relational language that has its roots in Cassandra Query Language. It is a SQL-like language built specifically to be aware of the clustering of data across nodes. The default port for YCQL is 9042. ## Query processing @@ -38,7 +38,7 @@ The parser processes each query in several steps as follows: 1. Builds a parse tree: If the query is written correctly, the parser builds a structured representation of the query, called a parse tree. This parse tree captures the different parts of the query and how they are related. -1. Recognizes keywords and identifiers: To build the parse tree, the parser first identifies the different components of the query, such as keywords (like `SELECT`, `FROM`), table or column names, and other identifiers. +1. Recognizes keywords and identifiers: To build the parse tree, the parser first identifies the different components of the query, such as keywords (like SELECT, FROM), table or column names, and other identifiers. 1. Applies grammar rules: The parser then applies a set of predefined grammar rules to understand the structure and meaning of the query based on the identified components. @@ -56,38 +56,55 @@ Views are realized during this phase. Whenever a query against a view (that is, ### Planner -YugabyteDB needs to determine the most efficient way to execute a query and return the results. This process is handled by the query planner/optimizer component. +The YugabyteDB query planner plays a crucial role in efficiently executing SQL queries across multiple nodes. It extends the capabilities of the traditional single node query planner to handle distributed data and execution. The planner first analyzes different ways a query can be executed based on the available data and indexes. It considers various strategies like scanning tables sequentially or using indexes to quickly locate specific data. -If the query involves joining multiple tables, the planner evaluates different techniques to combine the data: +After determining the optimal plan, the planner generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. -- Nested loop join: Scanning one table for each row in the other table. This can be efficient if one table is small or has a good index. -- Merge join: Sorting both tables by the join columns and then merging them in parallel. This works well when the tables are already sorted or can be efficiently sorted. -- Hash join: Building a hash table from one table and then scanning the other table to find matches in the hash table. -For queries involving more than two tables, the planner considers different sequences of joining the tables to find the most efficient approach. +The execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. -The planner estimates the cost of each possible execution plan and chooses the one expected to be the fastest, taking into account factors like table sizes, indexes, sorting requirements, and so on. - -After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. - -{{}} -The execution plans are cached for prepared statements to avoid overheads associated with repeated parsing of statements. -{{}} +{{}} +To learn how the query planner decides the optimal path for query execution, see [Query Planner](./planner-optimizer/) +{{}} ### Executor -After the query planner determines the optimal execution plan, the query executor component runs the plan and retrieves the required data. The executor sends appropriate requests to the other YB-TServers that hold the needed data to performs sorts, joins, aggregations, and then evaluates qualifications and finally returns the derived rows. +After the query planner determines the optimal execution plan, the executor runs the plan and retrieves the required data. The executor sends requests to the other YB-TServers that hold the data needed to perform sorts, joins, and aggregations, then evaluates qualifications, and finally returns the derived rows. The executor works in a step-by-step fashion, recursively processing the plan from top to bottom. Each node in the plan tree is responsible for fetching or computing rows of data as requested by its parent node. -For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to get rows from them. +For example, if the top node is a "Merge Join" node, it first requests rows from its two child nodes (the left and right inputs to be joined). The executor recursively calls the child nodes to retrieve rows. A child node may be a "Sort" node, which requests rows from its child, sorts them, and returns the sorted rows. The bottom-most child could be a "Sequential Scan" node that reads rows directly from a table. As the executor requests rows from each node, that node fetches or computes the rows from its children, applies any filtering or data transformations specified in the query plan, and returns the requested rows up to its parent node. -This process continues recursively until the top node has received all the rows it needs to produce the final result. For a `SELECT` query, these final rows are sent to the client. For data modification queries like `INSERT`, `UPDATE`, or `DELETE`, the rows are used to make the requested changes in the database tables. +This process continues recursively until the top node has received all the rows it needs to produce the final result. For a SELECT query, these final rows are sent to the client. For data modification queries like INSERT, UPDATE, or DELETE, the rows are used to make the requested changes in the database tables. The executor is designed to efficiently pull rows through the pipeline defined by the plan tree, processing rows in batches where possible for better performance. +### Optimizations + +- **Incremental sort**. If an intermediate query result is known to be sorted by one or more leading keys of a required sort ordering, the additional sorting can be done considering only the remaining keys, if the rows are sorted in batches that have equal leading keys. + +- **Memoize results**. When only a small percentage of rows is checked on the inner side of a nested-loop join, the executor memoizes the results for improving performance. + +- **Disk-based hash aggregation**. Hash-based operations are generally more sensitive to memory availability and are highly efficient as long as the hash table fits within the memory specified by the work_mem parameter. When the hash table grows beyond the `work_mem` limit, the planner transitions to a disk-based hash aggregation plan. This avoids overloading memory and ensures that large datasets can be handled efficiently. + +## Query ID + +In YSQL, to provide a consistent way to track and identify specific queries across different parts of the system such as logs, performance statistics, and EXPLAIN plans, a unique identifier is generated for each query processed. The query ID is effectively a hash value based on the normalized form of the SQL query. This normalization process removes insignificant whitespace and converts literal values to placeholders, ensuring that semantically identical queries have the same ID. This provides the following benefits: + +- By providing a unique identifier for each query, it becomes much easier to analyze query performance and identify problematic queries. +- Including query IDs in logs and performance statistics enables more detailed and accurate monitoring of database activity. +- The EXPLAIN command, which shows the execution plan for a query, can also display the query ID. This helps to link the execution plan with the actual query execution statistics. +- The pg_stat_statements extension (which is installed by default in YugabyteDB) can accurately track and report statistics even for queries with varying literal values (for example, different WHERE clause parameters). This makes it much easier to identify performance bottlenecks caused by specific query patterns. + +Generation of this unique query ID is controlled using the `compute_query_id` setting, which can have the following values: + +- on - Always compute query IDs. +- off - Never compute query IDs. +- auto (the default) - Automatically compute query IDs when needed, such as when pg_stat_statements is enabled (pg_stat_statements is enabled by default). + +You should enable `compute_query_id` to fully realize its benefits for monitoring and performance analysis. diff --git a/docs/content/v2024.1/architecture/query-layer/join-strategies.md b/docs/content/v2024.1/architecture/query-layer/join-strategies.md index 3a6fe5306d17..24d2f236cc5b 100644 --- a/docs/content/v2024.1/architecture/query-layer/join-strategies.md +++ b/docs/content/v2024.1/architecture/query-layer/join-strategies.md @@ -9,7 +9,7 @@ menu: name: Join strategies identifier: joins-strategies-ysql parent: architecture-query-layer - weight: 100 + weight: 200 type: docs --- diff --git a/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md b/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md new file mode 100644 index 000000000000..56fae747bea2 --- /dev/null +++ b/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md @@ -0,0 +1,94 @@ +--- +title: Query Planner +headerTitle: Query Planner / CBO +linkTitle: Query Planner +headcontent: Understand how the planner chooses the optimal path for query execution +tags: + feature: early-access +menu: + v2024.1: + identifier: query-planner + parent: architecture-query-layer + weight: 100 +type: docs +rightnav: + hideH4: true +--- + +The query planner is responsible for determining the most efficient way to execute a given query. The optimizer is the critical component in the planner that calculates the costs of different execution plans, taking into account factors like index lookups, table scans, network round trips, and storage costs. It then selects the most cost-effective path for query execution. YugabyteDB implements completely different types of optimizers for the YSQL and YCQL APIs. + +## Rule based optimizer (YCQL) + +YugabyteDB implements a simple rules-based optimizer (RBO) for YCQL. It operates by applying a predefined set of rules to optimize queries, such as reordering joins to minimize the number of rows processed, pushing selection conditions down the query tree, and using indexes and views to enhance performance. + +## Heuristics based optimizer (YSQL) + +YugabyteDB’s YSQL API uses a simple heuristics based optimizer to determine the most efficient execution plan for a query. It relies on basic statistics, like table sizes, and applies heuristics to estimate the cost of different plans. The cost model is based on PostgreSQL’s approach, using data such as row counts and index availability, and assigns some heuristic costs to the number of result rows depending on the type of scan. Although this works well for most queries, because this model was designed for single-node databases like PostgreSQL, it doesn’t account for YugabyteDB’s distributed architecture or take cluster topology into consideration during query planning. + +## Cost based optimizer (YSQL) + +To account for the distributed nature of the data, YugabyteDB has implemented a Cost based optimizer (CBO) for YSQL that uses an advanced cost model. The model considers accurate table statistics, the cost of network round trips, operations on lower level storage layer, and the cluster toplogy. + +{{}} + +The YugabyteDB CBO is {{}} and disabled by default. To enable it, turn ON the [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) configuration parameter as follows: + +```sql +-- Enable for current session +SET yb_enable_base_scans_cost_model = TRUE; +``` + +{{}} + +### Plan search algorithm + +To optimize the search for the best plan, the CBO uses a dynamic programming-based algorithm. Instead of enumerating and evaluating the cost of each possible execution plan, it breaks the problem down and finds the most optimal sub-plans for each part of the query. The sub-plans are then combined to find the best overall plan. + +### Statistics gathering + +The optimizer relies on accurate statistics about the tables, including the number of rows, the distribution of data in columns, and the cardinality of results from operations. These statistics are essential for estimating the selectivity of filters and costs of various query plans accurately. These statistics are gathered by the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command and are provided in a display-friendly format by the [pg_stats](../../../architecture/system-catalog/#data-statistics) view. + +{{}} +Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans. Multiple projects are in progress to trigger this automatically. +{{}} + +### Cost estimation + +For each potential execution plan, the optimizer calculates costs in terms of storage layer lookups (both cache and disk), number of network round trips, and other factors. These costs help the optimizer compare which plan is likely be the most efficient to execute given the current database state and query context. + +{{}} +You can see these estimates when using the DEBUG option in the [EXPLAIN](../../../api/ysql/the-sql-language/statements/perf_explain) command, as in EXPLAIN (ANALYZE, DEBUG). +{{}} + +Some of the factors that the CBO considers in the cost estimation are as follows: + +1. **Data fetch** + + To estimate the cost of fetching a tuple from [DocDB](../../docdb/), the CBO takes into account factors such as the number of SST files that may need to be read, and the estimated number of [seek](../../docdb/lsm-sst/#seek), [previous](../../docdb/lsm-sst/#previous), and [next](../../docdb/lsm-sst/#next) operations that may be executed in the LSM subsystem. + +1. **Index scan** + + When an index is used, any additional columns needed for the query must be retrieved from the corresponding row in the main table, which can be more costly than scanning only the base table. However, this isn't an issue if the index is a covering index. To determine the most efficient execution plan, the CBO compares the cost of an index scan with that of a main table scan. + +1. **Pushdown to storage layer** + + The CBO identifies possible operations that can be pushed down to the storage layer for aggregates, filters, and distinct clauses. This can considerably reduce network data transfer. + +1. **Join strategies** + + For queries involving multiple tables, the CBO evaluates the cost of different join strategies like [nested loop](../join-strategies/#nested-loop-join), [batch nested loop](../join-strategies/#batched-nested-loop-join-bnl), [merge](../join-strategies/#merge-join), or [hash](../join-strategies/#hash-join) join, as well as various join orders. + +1. **Data transfer** + + The CBO estimates the size and number of tuples that will be transferred, with data sent in pages. The page size is determined by the configuration parameters [yb_fetch_row_limit](../../../reference/configuration/yb-tserver/#yb-fetch-row-limit) and [yb_fetch_size_limit](../../../reference/configuration/yb-tserver/#yb-fetch-size-limit). Because each page requires a network round trip for the request and response, the CBO also estimates the total number of pages that will be transferred. Note that the time spent transferring the data also depends on the network bandwidth. + +## Plan selection + +The CBO evaluates each candidate plan's estimated costs to determine the plan with the lowest cost, which is then selected for execution. This ensures the optimal use of system resources and improved query performance. + +After the optimal plan is determined, YugabyteDB generates a detailed execution plan with all the necessary steps, such as scanning tables, joining data, filtering rows, sorting, and computing expressions. This execution plan is then passed to the query executor component, which carries out the plan and returns the final query results. + +## Learn more + +- [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) +- [YugabyteDB Cost-Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file diff --git a/docs/content/v2024.1/develop/postgresql-compatibility.md b/docs/content/v2024.1/develop/postgresql-compatibility.md index 07b0d727c1c7..b8d321c9746d 100644 --- a/docs/content/v2024.1/develop/postgresql-compatibility.md +++ b/docs/content/v2024.1/develop/postgresql-compatibility.md @@ -60,15 +60,19 @@ To learn about read committed isolation, see [Read Committed](../../architecture Configuration parameter: `yb_enable_base_scans_cost_model=true` -Cost-based optimizer (CBO) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. +[Cost based optimizer (CBO)](../../architecture/query-layer/planner-optimizer/) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. {{}} -When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. +When enabling this parameter, you must run ANALYZE on user tables to maintain up-to-date statistics. When enabling the cost models, ensure that packed row for colocated tables is enabled by setting the `--ysql_enable_packed_row_for_colocated_table` flag to true. {{}} +{{}} +To learn how CBO works, see [Query Planner / CBO](../../architecture/query-layer/planner-optimizer/) +{{}} + ### Wait-on-conflict concurrency Flag: `enable_wait_queues=true` From 64662725f9806622d95a44325bb81cebd91e0f5a Mon Sep 17 00:00:00 2001 From: Dwight Hodge Date: Wed, 22 Jan 2025 14:41:09 -0500 Subject: [PATCH 14/15] correction --- .../stable/architecture/query-layer/_index.md | 25 ------------------- .../architecture/query-layer/_index.md | 25 ------------------- 2 files changed, 50 deletions(-) diff --git a/docs/content/stable/architecture/query-layer/_index.md b/docs/content/stable/architecture/query-layer/_index.md index 2ddbab837d3a..1865b3df374c 100644 --- a/docs/content/stable/architecture/query-layer/_index.md +++ b/docs/content/stable/architecture/query-layer/_index.md @@ -83,28 +83,3 @@ As the executor requests rows from each node, that node fetches or computes the This process continues recursively until the top node has received all the rows it needs to produce the final result. For a SELECT query, these final rows are sent to the client. For data modification queries like INSERT, UPDATE, or DELETE, the rows are used to make the requested changes in the database tables. The executor is designed to efficiently pull rows through the pipeline defined by the plan tree, processing rows in batches where possible for better performance. - -### Optimizations - -- **Incremental sort**. If an intermediate query result is known to be sorted by one or more leading keys of a required sort ordering, the additional sorting can be done considering only the remaining keys, if the rows are sorted in batches that have equal leading keys. - -- **Memoize results**. When only a small percentage of rows is checked on the inner side of a nested-loop join, the executor memoizes the results for improving performance. - -- **Disk-based hash aggregation**. Hash-based operations are generally more sensitive to memory availability and are highly efficient as long as the hash table fits within the memory specified by the work_mem parameter. When the hash table grows beyond the `work_mem` limit, the planner transitions to a disk-based hash aggregation plan. This avoids overloading memory and ensures that large datasets can be handled efficiently. - -## Query ID - -In YSQL, to provide a consistent way to track and identify specific queries across different parts of the system such as logs, performance statistics, and EXPLAIN plans, a unique identifier is generated for each query processed. The query ID is effectively a hash value based on the normalized form of the SQL query. This normalization process removes insignificant whitespace and converts literal values to placeholders, ensuring that semantically identical queries have the same ID. This provides the following benefits: - -- By providing a unique identifier for each query, it becomes much easier to analyze query performance and identify problematic queries. -- Including query IDs in logs and performance statistics enables more detailed and accurate monitoring of database activity. -- The EXPLAIN command, which shows the execution plan for a query, can also display the query ID. This helps to link the execution plan with the actual query execution statistics. -- The pg_stat_statements extension (which is installed by default in YugabyteDB) can accurately track and report statistics even for queries with varying literal values (for example, different WHERE clause parameters). This makes it much easier to identify performance bottlenecks caused by specific query patterns. - -Generation of this unique query ID is controlled using the `compute_query_id` setting, which can have the following values: - -- on - Always compute query IDs. -- off - Never compute query IDs. -- auto (the default) - Automatically compute query IDs when needed, such as when pg_stat_statements is enabled (pg_stat_statements is enabled by default). - -You should enable `compute_query_id` to fully realize its benefits for monitoring and performance analysis. diff --git a/docs/content/v2024.1/architecture/query-layer/_index.md b/docs/content/v2024.1/architecture/query-layer/_index.md index 02c5f9c67fcf..80dd6aa29cb3 100644 --- a/docs/content/v2024.1/architecture/query-layer/_index.md +++ b/docs/content/v2024.1/architecture/query-layer/_index.md @@ -83,28 +83,3 @@ As the executor requests rows from each node, that node fetches or computes the This process continues recursively until the top node has received all the rows it needs to produce the final result. For a SELECT query, these final rows are sent to the client. For data modification queries like INSERT, UPDATE, or DELETE, the rows are used to make the requested changes in the database tables. The executor is designed to efficiently pull rows through the pipeline defined by the plan tree, processing rows in batches where possible for better performance. - -### Optimizations - -- **Incremental sort**. If an intermediate query result is known to be sorted by one or more leading keys of a required sort ordering, the additional sorting can be done considering only the remaining keys, if the rows are sorted in batches that have equal leading keys. - -- **Memoize results**. When only a small percentage of rows is checked on the inner side of a nested-loop join, the executor memoizes the results for improving performance. - -- **Disk-based hash aggregation**. Hash-based operations are generally more sensitive to memory availability and are highly efficient as long as the hash table fits within the memory specified by the work_mem parameter. When the hash table grows beyond the `work_mem` limit, the planner transitions to a disk-based hash aggregation plan. This avoids overloading memory and ensures that large datasets can be handled efficiently. - -## Query ID - -In YSQL, to provide a consistent way to track and identify specific queries across different parts of the system such as logs, performance statistics, and EXPLAIN plans, a unique identifier is generated for each query processed. The query ID is effectively a hash value based on the normalized form of the SQL query. This normalization process removes insignificant whitespace and converts literal values to placeholders, ensuring that semantically identical queries have the same ID. This provides the following benefits: - -- By providing a unique identifier for each query, it becomes much easier to analyze query performance and identify problematic queries. -- Including query IDs in logs and performance statistics enables more detailed and accurate monitoring of database activity. -- The EXPLAIN command, which shows the execution plan for a query, can also display the query ID. This helps to link the execution plan with the actual query execution statistics. -- The pg_stat_statements extension (which is installed by default in YugabyteDB) can accurately track and report statistics even for queries with varying literal values (for example, different WHERE clause parameters). This makes it much easier to identify performance bottlenecks caused by specific query patterns. - -Generation of this unique query ID is controlled using the `compute_query_id` setting, which can have the following values: - -- on - Always compute query IDs. -- off - Never compute query IDs. -- auto (the default) - Automatically compute query IDs when needed, such as when pg_stat_statements is enabled (pg_stat_statements is enabled by default). - -You should enable `compute_query_id` to fully realize its benefits for monitoring and performance analysis. From 5d67c36f85a807cdb4a00ee11e3ddc2ec7e0b4ea Mon Sep 17 00:00:00 2001 From: Dwight Hodge Date: Wed, 22 Jan 2025 16:35:42 -0500 Subject: [PATCH 15/15] minor edits --- .../statements/cmd_analyze.md | 2 +- .../query-layer/planner-optimizer.md | 2 +- .../develop/postgresql-compatibility.md | 4 +- .../statements/cmd_analyze.md | 2 +- .../develop/postgresql-compatibility.md | 4 +- .../postgresql-compatibility.md | 208 ------------------ .../statements/cmd_analyze.md | 2 +- .../query-layer/planner-optimizer.md | 2 +- .../develop/postgresql-compatibility.md | 4 +- 9 files changed, 11 insertions(+), 219 deletions(-) delete mode 100644 docs/content/stable/explore/ysql-language-features/postgresql-compatibility.md diff --git a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md index e61bfa36b360..67a28412ca05 100644 --- a/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/preview/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -16,7 +16,7 @@ type: docs ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. diff --git a/docs/content/preview/architecture/query-layer/planner-optimizer.md b/docs/content/preview/architecture/query-layer/planner-optimizer.md index a212509efe12..45900f1b78d3 100644 --- a/docs/content/preview/architecture/query-layer/planner-optimizer.md +++ b/docs/content/preview/architecture/query-layer/planner-optimizer.md @@ -91,4 +91,4 @@ After the optimal plan is determined, YugabyteDB generates a detailed execution ## Learn more - [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) -- [YugabyteDB Cost-Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file +- [YugabyteDB Cost Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file diff --git a/docs/content/preview/develop/postgresql-compatibility.md b/docs/content/preview/develop/postgresql-compatibility.md index 05eb2fed1cfe..e3f11c648073 100644 --- a/docs/content/preview/develop/postgresql-compatibility.md +++ b/docs/content/preview/develop/postgresql-compatibility.md @@ -23,7 +23,7 @@ To test and take advantage of features developed for enhanced PostgreSQL compati | :--- | :--- | :--- | :--- | | [Read committed](#read-committed) | [yb_enable_read_committed_isolation](../../reference/configuration/yb-tserver/#ysql-default-transaction-isolation) | {{}} | | | [Wait-on-conflict](#wait-on-conflict-concurrency) | [enable_wait_queues](../../reference/configuration/yb-tserver/#enable-wait-queues) | {{}} | {{}} | -| [Cost-based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | +| [Cost based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | | [Batch nested loop join](#batched-nested-loop-join) | [yb_enable_batchednl](../../reference/configuration/yb-tserver/#yb-enable-batchednl) | {{}} | {{}} | | [Ascending indexing by default](#default-ascending-indexing) | [yb_use_hash_splitting_by_default](../../reference/configuration/yb-tserver/#yb-use-hash-splitting-by-default) | {{}} | | | [YugabyteDB bitmap scan](#yugabytedb-bitmap-scan) | [yb_enable_bitmapscan](../../reference/configuration/yb-tserver/#yb-enable-bitmapscan) | {{}} | {{}} | @@ -59,7 +59,7 @@ Read Committed isolation level handles serialization errors and avoids the need To learn about read committed isolation, see [Read Committed](../../architecture/transactions/read-committed/). {{}} -### Cost-based optimizer +### Cost based optimizer Configuration parameter: `yb_enable_base_scans_cost_model=true` diff --git a/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md index 16dd8cb3368b..1d1a782acdb5 100644 --- a/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/stable/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -16,7 +16,7 @@ type: docs ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. diff --git a/docs/content/stable/develop/postgresql-compatibility.md b/docs/content/stable/develop/postgresql-compatibility.md index 3334eab887da..c605da6c8c8f 100644 --- a/docs/content/stable/develop/postgresql-compatibility.md +++ b/docs/content/stable/develop/postgresql-compatibility.md @@ -20,7 +20,7 @@ To test and take advantage of features developed for enhanced PostgreSQL compati | :--- | :--- | :--- | :--- | | [Read committed](#read-committed) | [yb_enable_read_committed_isolation](../../reference/configuration/yb-tserver/#ysql-default-transaction-isolation) | {{}} | | | [Wait-on-conflict](#wait-on-conflict-concurrency) | [enable_wait_queues](../../reference/configuration/yb-tserver/#enable-wait-queues) | {{}} | {{}} | -| [Cost-based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | +| [Cost based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | | [Batch nested loop join](#batched-nested-loop-join) | [yb_enable_batchednl](../../reference/configuration/yb-tserver/#yb-enable-batchednl) | {{}} | {{}} | | [Ascending indexing by default](#default-ascending-indexing) | [yb_use_hash_splitting_by_default](../../reference/configuration/yb-tserver/#yb-use-hash-splitting-by-default) | {{}} | | | [YugabyteDB bitmap scan](#yugabytedb-bitmap-scan) | [yb_enable_bitmapscan](../../reference/configuration/yb-tserver/#yb-enable-bitmapscan) | {{}} | {{}} | @@ -56,7 +56,7 @@ Read Committed isolation level handles serialization errors and avoids the need To learn about read committed isolation, see [Read Committed](../../architecture/transactions/read-committed/). {{}} -### Cost-based optimizer +### Cost based optimizer Configuration parameter: `yb_enable_base_scans_cost_model=true` diff --git a/docs/content/stable/explore/ysql-language-features/postgresql-compatibility.md b/docs/content/stable/explore/ysql-language-features/postgresql-compatibility.md deleted file mode 100644 index fb7a5067ee18..000000000000 --- a/docs/content/stable/explore/ysql-language-features/postgresql-compatibility.md +++ /dev/null @@ -1,208 +0,0 @@ ---- -title: PostgreSQL compatibility -linkTitle: PostgreSQL compatibility -description: Summary of YugabyteDB's PostgreSQL compatibility -menu: - stable: - identifier: explore-ysql-postgresql-compatibility - parent: explore-ysql-language-features - weight: 1200 -type: docs -rightNav: - hideH4: true ---- - -YugabyteDB is a [PostgreSQL-compatible](https://www.yugabyte.com/tech/postgres-compatibility/) distributed database that supports the majority of PostgreSQL syntax. This means that existing applications built on PostgreSQL can often be migrated to YugabyteDB without changing application code. - -Because YugabyteDB is PostgreSQL compatible, it works with the majority of PostgreSQL database tools such as various language drivers, ORM tools, schema migration tools, and many more third-party database tools. - -PostgreSQL compatibility has two aspects: - -- Feature compatibility - - Compatibility refers to whether YugabyteDB supports all the features of PostgreSQL and behaves as PostgreSQL does. With full PostgreSQL compatibility, you should be able to take an application running on PostgreSQL and run it on YugabyteDB without any code changes. The application will run without any errors, but it may not perform well because of the distributed nature of YugabyteDB. - -- Performance parity - - Performance parity refers to the capabilities of YugabyteDB that allow applications running on PostgreSQL to run with predictable performance on YugabyteDB. In other words, the performance degradation experienced by small and medium scale applications going from a single server database to a distributed database should be predictable and bounded. - -## Enhanced PostgreSQL Compatibility Mode - -To test and take advantage of features developed for PostgreSQL compatibility in YugabyteDB that are currently in {{}}, you can enable Enhanced PostgreSQL Compatibility Mode (EPCM). When this mode is turned on, YugabyteDB is configured to use all the latest features developed for feature and performance parity. EPCM is available in [v2024.1](/preview/releases/ybdb-releases/v2024.1/) and later. - - - -After turning this mode on, as you upgrade universes, YugabyteDB will automatically enable new designated PostgreSQL compatibility features. - -As features included in the PostgreSQL compatibility mode transition from {{}} to {{}} in subsequent versions of YugabyteDB, they become enabled by default on new universes, and are no longer managed under EPCM on your existing universes after the upgrade. - -{{}} -If you have set these features independent of EPCM, you cannot use EPCM. - -Conversely, if you are using EPCM on a universe, you cannot set any of the features independently. -{{}} - -| Feature | Flag/Configuration Parameter | EA | GA | -| :--- | :--- | :--- | :--- | -| [Read committed](#read-committed) | [yb_enable_read_committed_isolation](../../../reference/configuration/yb-tserver/#ysql-default-transaction-isolation) | {{}} | | -| [Wait-on-conflict](#wait-on-conflict-concurrency) | [enable_wait_queues](../../../reference/configuration/yb-tserver/#enable-wait-queues) | {{}} | {{}} | -| [Cost-based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | -| [Batch nested loop join](#batched-nested-loop-join) | [yb_enable_batchednl](../../../reference/configuration/yb-tserver/#yb-enable-batchednl) | {{}} | {{}} | -| [Ascending indexing by default](#default-ascending-indexing) | [yb_use_hash_splitting_by_default](../../../reference/configuration/yb-tserver/#yb-use-hash-splitting-by-default) | {{}} | | -| [YugabyteDB bitmap scan](#yugabytedb-bitmap-scan) | [yb_enable_bitmapscan](../../../reference/configuration/yb-tserver/#yb-enable-bitmapscan) | {{}} | {{}} | -| [Efficient communication
between PostgreSQL and DocDB](#efficient-communication-between-postgresql-and-docdb) | [pg_client_use_shared_memory](../../../reference/configuration/yb-tserver/#pg-client-use-shared-memory) | {{}} | {{}} | - -| Planned Feature | Flag/Configuration Parameter | EA | -| :--- | :--- | :--- | -| [Parallel query](#parallel-query) | | Planned | - -### Released - -The following features are currently available in EPCM. - -#### Read committed - -Flag: `yb_enable_read_committed_isolation=true` - -Read Committed isolation level handles serialization errors and avoids the need to retry errors in the application logic. Read Committed provides feature compatibility, and is the default isolation level in PostgreSQL. When migrating applications from PostgreSQL to YugabyteDB, read committed is the preferred isolation level. - -{{}} -To learn about read committed isolation, see [Read Committed](../../../architecture/transactions/read-committed/). -{{}} - -#### Cost-based optimizer - -Configuration parameter: `yb_enable_base_scans_cost_model=true` - -Cost-based optimizer (CBO) creates optimal execution plans for queries, providing significant performance improvements both in single-primary and distributed PostgreSQL workloads. This feature reduces or eliminates the need to use hints or modify queries to optimize query execution. CBO provides improved performance parity. - -{{}} -When enabling this parameter, you must run `ANALYZE` on user tables to maintain up-to-date statistics. - -When enabling the cost models, ensure that packed row for colocated tables is enabled by setting the `--ysql_enable_packed_row_for_colocated_table` flag to true. - -{{}} - -#### Wait-on-conflict concurrency - -Flag: `enable_wait_queues=true` - -Enables use of wait queues so that conflicting transactions can wait for the completion of other dependent transactions, helping to improve P99 latencies. Wait-on-conflict concurrency control provides feature compatibility, and uses the same semantics as PostgreSQL. - -{{}} -To learn about concurrency control in YugabyteDB, see [Concurrency control](../../../architecture/transactions/concurrency-control/). -{{}} - -#### Batched nested loop join - -Configuration parameter: `yb_enable_batchednl=true` - -Batched nested loop join (BNLJ) is a join execution strategy that improves on nested loop joins by batching the tuples from the outer table into a single request to the inner table. By using batched execution, BNLJ helps reduce the latency for query plans that previously used nested loop joins. BNLJ provides improved performance parity. - -{{}} -To learn about join strategies in YugabyteDB, see [Join strategies](../../../architecture/query-layer/join-strategies/). -{{}} - -#### Default ascending indexing - -Configuration parameter: `yb_use_hash_splitting_by_default=false` - -Enable efficient execution for range queries on data that can be sorted into some ordering. In particular, the query planner will consider using an index whenever an indexed column is involved in a comparison using one of the following operators: `< <= = >= >`. - -Also enables retrieving data in sorted order, which can eliminate the need to sort the data. - -Default ascending indexing provides feature compatibility and is the default in PostgreSQL. - -#### YugabyteDB bitmap scan - -Configuration parameter: `yb_enable_bitmapscan=true` - -Bitmap scans use multiple indexes to answer a query, with only one scan of the main table. Each index produces a "bitmap" indicating which rows of the main table are interesting. Bitmap scans can improve the performance of queries containing AND and OR conditions across several index scans. YugabyteDB bitmap scan provides feature compatibility and improved performance parity. For YugabyteDB relations to use a bitmap scan, the PostgreSQL parameter `enable_bitmapscan` must also be true (the default). - -#### Efficient communication between PostgreSQL and DocDB - -Configuration parameter: `pg_client_use_shared_memory=true` - -Enable more efficient communication between YB-TServer and PostgreSQL using shared memory. This feature provides improved performance parity. - -### Planned - -The following features are planned for EPCM in future releases. - -#### Parallel query - -Enables the use of PostgreSQL [parallel queries](https://www.postgresql.org/docs/11/parallel-query.html). Using parallel queries, the query planner can devise plans that leverage multiple CPUs to answer queries faster. Parallel query provides feature compatibility and improved performance parity. - -### Enable EPCM - -#### YugabyteDB - -To enable EPCM in YugabyteDB: - -- Pass the `enable_pg_parity_early_access` flag to [yugabyted](../../../reference/configuration/yugabyted/) when starting your cluster. - -For example, from your YugabyteDB home directory, run the following command: - -```sh -./bin/yugabyted start --enable_pg_parity_early_access -``` - -Note: When enabling the cost models, ensure that packed row for colocated tables is enabled by setting the `--ysql_enable_packed_row_for_colocated_table` flag to true. - -#### YugabyteDB Anywhere - -To enable EPCM in YugabyteDB Anywhere v2024.1, see the [Release notes](/preview/releases/yba-releases/v2024.1/#v2024.1.0.0). - -To enable EPCM in YugabyteDB Anywhere v2024.2 or later: - -- When creating a universe, turn on the **Enable Enhanced Postgres Compatibility** option. - - You can also change the setting on deployed universes using the **More > Edit Postgres Compatibility** option. - -{{}} -Setting Enhanced Postgres Compatibility overrides any [flags you set](../../../yugabyte-platform/manage-deployments/edit-config-flags/) individually for the universe. The **G-Flags** tab will however continue to display the setting that you customized. -{{}} - -#### YugabyteDB Aeon - -To enable EPCM in YugabyteDB Aeon: - -1. When creating a cluster, choose a track with database v2024.1.0 or later. -1. Select the **Enhanced Postgres Compatibility** option (on by default). - -You can also change the setting on the **Settings** tab for deployed clusters. - -## Unsupported PostgreSQL features - -Because YugabyteDB is a distributed database, supporting all PostgreSQL features in a distributed system is not always feasible. This section documents the known list of differences between PostgreSQL and YugabyteDB. You need to consider these differences while porting an existing application to YugabyteDB. - -The following PostgreSQL features are not supported in YugabyteDB: - -| Unsupported PostgreSQL feature | Track feature request GitHub issue | -| ----------- | ----------- | -| LOCK TABLE to obtain a table-level lock | {{}}| -| Table inheritance | {{}}| -| Exclusion constraints | {{}}| -| Deferrable constraints | {{}}| -| Constraint Triggers|{{}}| -| GiST indexes | {{}}| -| Events (Listen/Notify) | {{}}| -| XML Functions | {{}}| -| XA syntax | {{}}| -| ALTER TYPE | {{}}| -| CREATE CONVERSION | {{}}| -| Primary/Foreign key constraints on foreign tables | {{}}, {{}} | -| GENERATED ALWAYS AS STORED columns | {{}}| -| Multi-column GIN indexes| {{}}| -| CREATE ACCESS METHOD | {{}}| -| DESC/HASH on GIN indexes (ASC supported) | {{}}| -| CREATE SCHEMA with elements | {{}}| -| Index on citext column | {{}}| -| ABSTIME type | {{}}| -| transaction ids (xid)
YugabyteDB uses [Hybrid logical clocks](../../../architecture/transactions/transactions-overview/#hybrid-logical-clocks) instead of transaction ids. | {{}}| -| DDL operations within transaction| {{}}| -| Some ALTER TABLE variants| {{}}| -| UNLOGGED table | {{}} | -| Indexes on complex datatypes such as INET, CITEXT, JSONB, ARRAYs, and so on.| {{}}, {{}}, {{}} | -| %TYPE syntax in Functions/Procedures/Triggers|{{}}| -| Storage parameters on indexes or constraints|{{}}| diff --git a/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md b/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md index 9d5cab41907c..445f315667f1 100644 --- a/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md +++ b/docs/content/v2024.1/api/ysql/the-sql-language/statements/cmd_analyze.md @@ -16,7 +16,7 @@ type: docs ANALYZE collects statistics about the contents of tables in the database, and stores the results in the [pg_statistic](../../../../../architecture/system-catalog/#data-statistics), [pg_class](../../../../../architecture/system-catalog/#schema), and [pg_stat_all_tables](../../../../../architecture/system-catalog/#table-activity) system catalogs. These statistics help the query planner to determine the most efficient execution plans for queries. -The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. +The statistics are also used by the YugabyteDB [cost based optimizer](../../../../../architecture/query-layer/planner-optimizer) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution. {{< warning title="Run ANALYZE manually" >}} Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually. diff --git a/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md b/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md index 56fae747bea2..f3940972da72 100644 --- a/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md +++ b/docs/content/v2024.1/architecture/query-layer/planner-optimizer.md @@ -91,4 +91,4 @@ After the optimal plan is determined, YugabyteDB generates a detailed execution ## Learn more - [Exploring the Cost Based Optimizer](https://www.yugabyte.com/blog/yugabytedb-cost-based-optimizer/) -- [YugabyteDB Cost-Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file +- [YugabyteDB Cost Based Optimizer](https://dev.to/yugabyte/yugabytedb-cost-based-optimizer-and-cost-model-for-distributed-lsm-tree-1hb4) \ No newline at end of file diff --git a/docs/content/v2024.1/develop/postgresql-compatibility.md b/docs/content/v2024.1/develop/postgresql-compatibility.md index b8d321c9746d..20a87d4f7978 100644 --- a/docs/content/v2024.1/develop/postgresql-compatibility.md +++ b/docs/content/v2024.1/develop/postgresql-compatibility.md @@ -20,7 +20,7 @@ To test and take advantage of features developed for enhanced PostgreSQL compati | :--- | :--- | :--- | :--- | | [Read committed](#read-committed) | [yb_enable_read_committed_isolation](../../reference/configuration/yb-tserver/#ysql-default-transaction-isolation) | {{}} | | | [Wait-on-conflict](#wait-on-conflict-concurrency) | [enable_wait_queues](../../reference/configuration/yb-tserver/#enable-wait-queues) | {{}} | {{}} | -| [Cost-based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | +| [Cost based optimizer](#cost-based-optimizer) | [yb_enable_base_scans_cost_model](../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) | {{}} | | | [Batch nested loop join](#batched-nested-loop-join) | [yb_enable_batchednl](../../reference/configuration/yb-tserver/#yb-enable-batchednl) | {{}} | {{}} | | [Ascending indexing by default](#default-ascending-indexing) | [yb_use_hash_splitting_by_default](../../reference/configuration/yb-tserver/#yb-use-hash-splitting-by-default) | {{}} | | | [YugabyteDB bitmap scan](#yugabytedb-bitmap-scan) | [yb_enable_bitmapscan](../../reference/configuration/yb-tserver/#yb-enable-bitmapscan) | {{}} | {{}} | @@ -56,7 +56,7 @@ Read Committed isolation level handles serialization errors and avoids the need To learn about read committed isolation, see [Read Committed](../../architecture/transactions/read-committed/). {{}} -### Cost-based optimizer +### Cost based optimizer Configuration parameter: `yb_enable_base_scans_cost_model=true`