From 9528f0e0a4715e19a94a37d046a7336b1c671894 Mon Sep 17 00:00:00 2001 From: wangbo Date: Fri, 27 Dec 2024 21:12:32 +0800 Subject: [PATCH] refactor workload group en (#1651) ## Versions - [x] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 ## Languages - [ ] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- .../workload-management/workload-group.md | 737 +++++++++++++++--- .../workload-management/workload-group.md | 6 +- .../workload-management/workload-group.md | 6 +- .../workload-management/workload-group.md | 6 +- 4 files changed, 622 insertions(+), 133 deletions(-) diff --git a/docs/admin-manual/workload-management/workload-group.md b/docs/admin-manual/workload-management/workload-group.md index dc491a96259cd..50e3cb65ca392 100644 --- a/docs/admin-manual/workload-management/workload-group.md +++ b/docs/admin-manual/workload-management/workload-group.md @@ -24,137 +24,171 @@ specific language governing permissions and limitations under the License. --> -Use Workload Groups in Doris to manage and limit resources. By employing resource control, you can effectively limit the CPU, memory, and IO resources used by queries and imports, and create query queues to manage the maximum concurrency of queries in the cluster. Since Doris version 2.1, CPU resource limitations are enforced using CGroup. Before using the Workload resource control feature, you need to configure the CGroup environment. When setting up Workload resource control, you must first choose whether to apply soft or hard limits based on your business needs: -- Soft Limit: Allows borrowing resources from other Workload Groups when there is no resource contention, potentially exceeding the soft limit. -- Hard Limit: Ensures that the resource allocation cannot exceed the specified quota, regardless of resource contention. -To use Workload resource control, you need to perform the following steps: -1. Create a Workload Group. -2. Add resource limitation rules to the Workload Group. -3. Bind tenants to the Workload Group. +Workload Group is an in-process mechanism for isolating workloads. +It achieves resource isolation by finely partitioning or limiting resources (CPU, IO, Memory) within the BE process. +Its principle is illustrated in the diagram below: -## Version Upgrade Notes -Workload resource control has been available since Doris version 2.0. In Doris 2.0, Workload resource control did not depend on CGroup, but Doris 2.1 requires CGroup. +![workload_group](/images/workload_group_arch.png) -Upgrading from Doris 2.0 to 2.1: Since Workload resource control in version 2.1 depends on CGroup, you must first configure the CGroup environment before upgrading to Doris 2.1. +The currently supported isolation capabilities include: -## Configuring the CGroup Environment -In Doris version 2.0, CPU resource limitation was implemented based on Doris's scheduling, which provided great flexibility but lacked precise CPU isolation. From version 2.1, Doris uses CGroup for CPU resource limitation. Users requiring strong resource isolation are recommended to upgrade to version 2.1 and ensure CGroup is installed on all BE nodes. +* Managing CPU resources, with support for both cpu hard limit and cpu soft limit; +* Managing memory resources, with support for both memory hard limit and memory soft limit; +* Managing IO resources, including IO generated by reading local and remote files. -If you used soft limits in Workload Groups in version 2.0 and upgraded to 2.1, you also need to configure CGroup to avoid losing soft limit functionality. Without CGroup configured, users can use all Workload Group features except CPU limitation. -:::tip -1. The Doris BE node can effectively utilize the CPU and memory resources of the machine. It is recommended to deploy only one BE instance per machine. Currently, the workload resource management does not support deploying multiple BE instances on a single machine. -2. After a machine restart, the following CGroup configurations will be cleared. If you want the configurations to persist after a reboot, you can use systemd to set the operation as a custom system service. This way, the creation and authorization operations will be automatically performed each time the machine restarts. -3. If CGroup is used within a container, the container needs to have permissions to operate on the host machine. - ::: +## Version Notes -### Verifying CGroup Installation on BE Nodes -Check /proc/filesystems to determine if CGroup is installed: -cat /proc/filesystems | grep cgroup -nodev cgroup -nodev cgroup2 -nodev cgroupfs -Look for cgroup, cgroup2, or cgroupfs in the output, indicating CGroup support. Further verify the CGroup version. +- The Workload Group feature has been available since Doris 2.0. In Doris 2.0, the Workload Group feature does not rely on CGroup, but starting with Doris 2.1, it requires CGroup. -#### Determining CGroup Version -For CGroup V1, multiple subsystems are mounted under /sys/fs/cgroup. The presence of /sys/fs/cgroup/cpu indicates CGroup V1 is in use: -``` -## CGroup V1 is in use -ls /sys/fs/cgroup/cpu -``` +- Upgrading from Doris 1.2 to 2.0: It is recommended to enable the Workload Group feature only after the entire cluster has been upgraded. If only some follower FE nodes are upgraded, queries on the upgraded follower FE nodes may fail due to the absence of Workload Group metadata on the non-upgraded FE nodes. -For CGroup V2, all controllers are managed in a unified hierarchy. The presence of /sys/fs/cgroup/cgroup.controllers indicates CGroup V2 is in use: -``` -## CGroup V2 is in use -ls /sys/fs/cgroup/cgroup.controllers +- Upgrading from Doris 2.0 to 2.1: Since the Workload Group feature in Doris 2.1 relies on CGroup, you need to configure the CGroup environment before upgrading to Doris 2.1. + +## Configuring Workload Group + +### Setting Up the CGroup Environment +Workload Group supports managing CPU, memory, and IO. CPU management relies on the CGroup component. +To use Workload Group for CPU resource management, you must first configure the CGroup environment. + +The following are the steps for configuring the CGroup environment: + +1. First, verify whether the node where the BE is located has CGroup installed. +If the output includes cgroup, it indicates that CGroup V1 is installed in the current environment. +If it includes cgroup2, it indicates that CGroup V2 is installed. You can determine which version is active in the next step. +```shell +cat /proc/filesystems | grep cgroup +nodev cgroup +nodev cgroup2 +nodev cgroupfs ``` -Configure CGroup based on its version when using Workload resource control in Doris. -### Using CGroup V1 -If using CGroup V1, you need to create a CPU management directory for Doris under the /sys/fs/cgroup/cpu directory. You can customize the directory name. In the following example, /sys/fs/cgroup/cpu/doris is used: +2. The active CGroup version can be confirmed based on the path name. +```shell +If this path exists, it indicates that CGroup V1 is currently active. +/sys/fs/cgroup/cpu/ + + +If this path exists, it indicates that CGroup V2 is currently active. +/sys/fs/cgroup/cgroup.controllers ``` -## Create cgroup dir for Doris + +3. Create a directory named doris under the CGroup path. The directory name can be customized by the user. + +```shell +If using CGroup V1, create the directory under the cpu directory. mkdir /sys/fs/cgroup/cpu/doris -## Modify the Doris cgroup directory permissions -chmod 770 /sys/fs/cgroup/cpu/doris -## Grant user permissions for Doris usage -chown -R doris:doris /sys/fs/cgroup/cpu/doris +If using CGroup V2, create the directory directly under the cgroup directory. +mkdir /sys/fs/cgroup/doris ``` -### Using CGroup V2 -Due to stricter permission control in CGroup V2, write access to the cgroup.procs file in the root directory is required to move processes between groups: -Grant permission to the cgroup.procs directory using the following command: +4. Ensure that the Doris BE process has read, write, and execute permissions for this directory. +```shell +// If using CGroup V1, the command is as follows: +// 1. Modify the directory's permissions to be readable, writable, and executable. +chmod 770 /sys/fs/cgroup/cpu/doris +// 2. Change the ownership of this directory to the doris account. +chown -R doris:doris /sys/fs/cgroup/cpu/doris + +// If using CGroup V2, the command is as follows: +// 1.Modify the directory's permissions to be readable, writable, and executable. +chmod 770 /sys/fs/cgroup/doris +// 2. Change the ownership of this directory to the doris account. +chown -R doris:doris /sys/fs/cgroup/doris ``` + +5. If using CGroup V2, the following steps are required. +This is because CGroup V2 has stricter permission control, and write permissions to the cgroup.procs file in the root directory are necessary to move processes between groups. + If it is CGroup V1, this step is not required. +```shell chmod a+w /sys/fs/cgroup/cgroup.procs ``` -### Configuring CGroup for BE Nodes -Before using Workload resource control, configure the CGroup path in the BE configuration file be/conf/be.conf: -``` +6. Modify the BE configuration to specify the path of the cgroup. +```shell +If using CGroup V1, the configuration path is as follows: doris_cgroup_cpu_path = /sys/fs/cgroup/cpu/doris + +If using CGroup V2, the configuration path is as follows: +doris_cgroup_cpu_path = /sys/fs/cgroup/doris ``` -Restart the BE node after configuring be.conf. Check the BE.INFO log for the "add thread {pid} to group" message to confirm successful configuration. -## Managing Resources with Workload Groups -After creating a Workload Group, you can add resource limitation rules. Doris currently supports the following rules: -- Hard or soft limits on CPU -- Hard or soft limits on memory -- Limits on remote or local IO -- Query queues for managing query jobs +7. Restart the BE, and in the log (be.INFO), the phrase 'add thread xxx to group' indicates that the configuration was successful. -### Creating Custom Workload Groups -Use an ADMIN user to create Workload Groups and add resource rules using the CREATE WORKLOAD GROUP statement. Since Doris 2.1, a default Workload Group named normal is automatically created, and users are bound to it by default. The following example creates a Workload Group g1 with CPU and memory resource limits: +:::tip +1. It is recommended to deploy only one BE per machine, as the current Workload Group feature does not support deploying multiple BE instances on a single machine. +2. After a machine is restarted, all configurations under the CGroup path will be cleared. +To persist the CGroup configuration, you can use systemd to set the operation as a custom system service, +so that the creation and authorization operations can be automatically performed each time the machine restarts. +3. If using CGroup within a container, the container must have permission to operate on the host machine. + ::: -``` -CREATE WORKLOAD GROUP IF NOT EXISTS g1 -PROPERTIES( - "cpu_share"="1024", - "memory_limit"="30%" -); -``` +#### Considerations for Using Workload Group in Containers +Workload's CPU management is based on CGroup. If you want to use Workload Group inside a container, +the container needs to be started in privileged mode so that the BE process inside the container has permission to read and write CGroup files on the host machine. -### Modifying Workload Group Resource Rules -You can view the created Workload Group information by accessing the Doris system table information_schema.workload_groups. -To delete a Workload Group, refer to [DROP-WORKLOAD-GROUP](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-WORKLOAD-GROUP); -The ALTER-WORKLOAD-GROUP command can be used to adjust and modify the Workload Group configuration, refer[ALTER-WORKLOAD-GROUP](../../sql-manual/sql-statements/Data-Definition-Statements/Alter/ALTER-WORKLOAD-GROUP). +When BE runs inside a container, the CPU resource usage for Workload Group is partitioned based on the available resources of the container. +For example, if the host machine has 64 cores and the container is allocated 8 cores, +and the Workload Group is configured with a 50% CPU hard limit, the actual available CPU cores for the Workload Group will be 4 (8 cores * 50%). -#### Adding or Modifying Resource Items -Modify the memory limit for the g1 Workload Group: -``` -ALTER WORKLOAD GROUP g1 PROPERTIES('memory_limit'='10%'); -``` +The memory and IO management functions of Workload Group are implemented internally by Doris and do not rely on external components, +so there is no difference in deployment between containers and physical machines. + +If you want to use Doris on K8S, it is recommended to deploy it using the Doris Operator, which can shield underlying permission issues. -You can view the modified memory limits through the information_schema.workload_groups system table: +### Create Workload Group ``` -SELECT name, memory_limit FROM information_schema.workload_groups; -+--------+--------------+ -| name | memory_limit | -+--------+--------------+ -| normal | 30% | -| g1 | 10% | -+--------+--------------+ +mysql [information_schema]>create workload group if not exists g1 + -> properties ( + -> "cpu_share"="1024" + -> ); +Query OK, 0 rows affected (0.03 sec) + ``` +You can refer to [CREATE-WORKLOAD-GROUP](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-WORKLOAD-GROUP)。 -#### Configuring Soft and Hard Limits -Using the Workload Group feature, you can set soft and hard limits for CPU and memory resources, while for remote and local I/O, only hard limits are available: -- Soft Limit: The soft limit acts as a warning threshold for the resource. Under normal operation, users will not exceed this limit. When other Workload Groups have lower loads, resources from those groups can be borrowed, exceeding the soft limit. -- Hard Limit: The hard limit is the absolute upper bound for resource usage. Regardless of whether other Workload Groups are underloaded, the hard limit cannot be exceeded. Hard limits are typically used to prevent resource misuse in the system. +The CPU limit configured at this point is a soft limit. Since version 2.1, Doris will automatically create a group named normal, which cannot be deleted. -| | soft limit switch and params | soft limit switch and params | Description | -|-----------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----| -| CPU | Switch:FE Config - enable_cpu_hard_limit = false params:Property - cpu_share | switch:FE Config - enable_cpu_hard_limit = true params: property - cpu_hard_limit |Only soft or hard limits can be set for different Workload Groups simultaneously | -| Memory | Switch:property - enable_memory_overcommit = true params:property - memory_limit | switch:property - enable_memory_overcommit = false params: property - memory_limit | Soft and hard limits can be set independently for different Workload Groups | -| local IO | None | params: read_bytes_per_second | Only hard limits are currently available for local IO | -| remote IO | None | params: remote_read_bytes_per_second | Only hard limits are currently available for remote IO | +### Workload Group Properties -### Binding Tenants to Workload Groups -Non-ADMIN users must first check their permissions for a Workload Group. Use the information_schema.workload_groups system table to verify permissions. Bind tenants to Workload Groups using user properties or session variables. Session variables take precedence over user properties. -``` +| Property | Data type | Default value | Value range | Description | +|------------------------------|----------------|---------------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| cpu_share | Integer | -1 | [1, 10000] | Optional, effective under CPU soft limit mode. The valid range of values depends on the version of CGroup being used, which is described in detail later. cpu_share represents the weight of CPU time that the Workload Group can acquire; the larger the value, the more CPU time it can obtain. For example, if the user creates three Workload Groups, g-a, g-b, and g-c, with cpu_share values of 10, 30, and 40 respectively, and at a certain point g-a and g-b are running tasks while g-c has no tasks, g-a will receive 25% (10 / (10 + 30)) of the CPU resources, and g-b will receive 75% of the CPU resources. If only one Workload Group is running in the system, regardless of the cpu_share value, it will be able to acquire all the CPU resources. | +| memory_limit | Floating-point | -1 | (0%, 100%] | Optional. Enabling memory hard limit represents the maximum available memory percentage for the current Workload Group. The default value means no memory limit is applied. The cumulative value of memory_limit for all Workload Groups cannot exceed 100%, and it is typically used in conjunction with the enable_memory_overcommit attribute. For example, if a machine has 64GB of memory and the memory_limit for a Workload Group is set to 50%, the actual physical memory available for that group would be 64GB * 90% * 50% = 28.8GB, where 90% is the default value for the available memory configuration of the BE process. | +| enable_memory_overcommit | Boolean | true | true, false | Optional. Used to control whether the memory limit for the current Workload Group is a hard limit or a soft limit, with the default set to true. If set to false, the Workload Group will have hard memory limit, and when the system detects that the memory usage exceeds the limit, it will immediately cancel the tasks with the highest memory usage within the group to release the excess memory. If set to true, the Workload Group will have soft memory limit. If there is free memory available, the Workload Group can continue using system memory even after exceeding the memory_limit. When the system's total memory is under pressure, the system will cancel the tasks with the highest memory usage within the group and release some of the excess memory to alleviate system memory pressure. It is recommended that the total memory_limit of all Workload Groups be kept below 100% to reserve memory for other components of the BE process. | +| cpu_hard_limit | Integer | -1 | [1%, 100%] | Optional. Effective under CPU hard limit mode, it represents the maximum CPU percentage a Workload Group can use. Regardless of whether the machine's CPU resources are fully utilized, the Workload Group's CPU usage cannot exceed the cpu_hard_limit. The cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%. This attribute was introduced in version 2.1 and is not supported in version 2.0. | +| max_concurrency | Integer | 2147483647 | [0, 2147483647] | Optional. Specifies the maximum query concurrency. The default value is the maximum value of an integer, meaning no concurrency limit. When the number of running queries reaches the maximum concurrency, new queries will enter a queue. | +| max_queue_size | Integer | 0 | [0, 2147483647] | Optional. Specifies the length of the query waiting queue. When the queue is full, new queries will be rejected. The default value is 0, which means no queuing. If the queue is full, new queries will fail directly. | +| queue_timeout | Integer | 0 | [0, 2147483647] | Optional. Specifies the maximum waiting time for a query in the waiting queue, in milliseconds. If the query's waiting time in the queue exceeds this value, an exception will be thrown directly to the client. The default value is 0, meaning no queuing; queries will immediately fail upon entering the queue. | +| scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the number of threads used for scanning in the current Workload Group. When this property is set to -1, it means it is not active, and the actual scan thread num on the BE will default to the doris_scanner_thread_pool_thread_num configuration in the BE. | +| max_remote_scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the maximum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores. | +| min_remote_scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the minimum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores. | +| tag | String | empty | - | Specifies tags for the Workload Group. The cumulative resource values of Workload Groups with the same tag cannot exceed 100%. To specify multiple values, use commas to separate them. | +| read_bytes_per_second | Integer | -1 | [1, 9223372036854775807] | Optional. Specifies the maximum I/O throughput when reading internal tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied. It is important to note that this value is not tied to individual disks but to directories. For example, if Doris is configured with two directories to store internal table data, the maximum read I/O for each directory will not exceed this value. If both directories are placed on the same disk, the maximum throughput will be doubled (i.e., 2 times read_bytes_per_second). The file directory for spill disk is also subject to this limit. | +| remote_read_bytes_per_second | Integer | -1 | [1, 9223372036854775807] | Optional. Specifies the maximum I/O throughput when reading external tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied. | + +:::tip + +1. Currently, the simultaneous use of both cpu hard limit and cpu soft limit is not supported. +At any given time, a cluster can only have either a soft limit or a hard limit. The method for switching between them will be described later. + +2. All properties are optional, but at least one property must be specified when creating a Workload Group. + +3. It is important to note that the default values for CPU soft limits differ between CGroup v1 and CGroup v2. The default CPU soft limit for CGroup v1 is 1024, with a valid range from 2 to 262144, while the default for CGroup v2 is 100, with a valid range from 1 to 10000. + If a value outside the range is set for the soft limit, it may cause the CPU soft limit modification to fail in BE. If the default value of 100 from CGroup v2 is applied in a CGroup v1 environment, it could result in this Workload Group having the lowest priority on the machine. + ::: + +## Set Workload Group for user +Before binding a user to a specific Workload Group, it is necessary to ensure that the user has the necessary permissions for the Workload Group. +You can use the user to query the information_schema.workload_groups system table, and the result will show the Workload Groups that the current user has permission to access. +The following query result indicates that the current user has access to the g1 and normal Workload Groups: + +```sql SELECT name FROM information_schema.workload_groups; +--------+ | name | @@ -164,49 +198,504 @@ SELECT name FROM information_schema.workload_groups; +--------+ ``` -If you cannot see the g1 Workload Group, you can use the GRANT statement to grant permissions to the user. -When binding a Workload Group to a tenant, you can do so either by setting a user property or specifying a session variable. When both methods are used, the session variable takes priority over the user property: +If the g1 Workload Group is not visible, you can use the ADMIN account to execute the GRANT statement to authorize the user. For example: +``` +"GRANT USAGE_PRIV ON WORKLOAD GROUP 'g1' TO 'user_1'@'%';" +``` +This statement means granting the user_1 the permission to use the Workload Group named g1. +More details can be found in [grant](../../sql-manual/sql-statements/Account-Management-Statements/GRANT)。 + +**Two ways to bind Workload Group to user** +1. By setting the user property, you can bind the user to a default Workload Group. The default is normal. It's important to note that the value here cannot be left empty; otherwise, the statement will fail. +``` +set property 'default_workload_group' = 'g1'; +``` +After executing this statement, the current user's queries will default to using the 'g1' Workload Group. + -- Binding Workload Group using user property: Typically, administrators use the SET-PROPERTY command to bind the default Workload Group for a tenant. In the following example, the default Workload Group g1 is bound to the test_wlg tenant。 +2. By specifying the Workload Group through a session variable, the default is empty: ``` -set property for 'test_wlg' 'default_workload_group' = 'g1'; +set workload_group = 'g1'; ``` +When both methods are used to specify a Workload Group for the user, the session variable takes priority over the user property. -- Using Session Variables: During development, even if an administrator has set a default Workload Group, it can be overridden in the session using the workload_group variable. In the following example, the Workload Group for the current session is set to g1: +## Show Workload Group +1. You can use the SHOW statement to view the Workload Group: ``` -SET workload_group = 'g1'; +show workload groups; ``` +More details can be found in [SHOW-WORKLOAD-GROUPS](../../sql-manual/sql-statements/Show-Statements/SHOW-WORKLOAD-GROUPS) -## Grouping Workload Groups -In a multi-workload or multi-tenant environment, a Doris cluster may be split into multiple sub-clusters, such as some nodes used for federated queries from external storage and some nodes used for real-time queries on internal tables. Workload Groups can tag BE nodes, and BE nodes with the same tag form a sub-cluster. The resources of each sub-cluster are calculated independently, and the total resource usage within each sub-cluster cannot exceed 100%. In the following example, seven machines are divided into two sub-clusters, sub_a and sub_b, with two Workload Groups created in each sub-cluster. -In a multi-workload or multi-tenant environment, a Doris cluster may be split into multiple sub-clusters, such as some nodes used for federated queries from external storage and some nodes used for fact queries on internal tables. The two sub-clusters are completely isolated in terms of data distribution and resource usage. Within the same sub-cluster, multiple tenants need to be created along with isolation rules for resource usage between tenants. For complex resource isolation requirements, you can combine the Resource Group and Workload Group features. Resource Groups can be used to achieve node-level isolation for multiple sub-clusters, while Workload Groups can be used within each sub-cluster to isolate resource usage between tenants. As shown in the diagram below, two sub-clusters, A and B, are defined, each with its own Workload Group for resource management: +2. You can view the Workload Group through the system table: +``` +mysql [information_schema]>select * from information_schema.workload_groups where name='g1'; ++-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+ +| ID | NAME | CPU_SHARE | MEMORY_LIMIT | ENABLE_MEMORY_OVERCOMMIT | MAX_CONCURRENCY | MAX_QUEUE_SIZE | QUEUE_TIMEOUT | CPU_HARD_LIMIT | SCAN_THREAD_NUM | MAX_REMOTE_SCAN_THREAD_NUM | MIN_REMOTE_SCAN_THREAD_NUM | MEMORY_LOW_WATERMARK | MEMORY_HIGH_WATERMARK | TAG | READ_BYTES_PER_SECOND | REMOTE_READ_BYTES_PER_SECOND | ++-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+ +| 14009 | g1 | 1024 | -1 | true | 2147483647 | 0 | 0 | -1 | -1 | -1 | -1 | 50% | 80% | | -1 | -1 | ++-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+ +1 row in set (0.05 sec) +``` + +## Alter Workload Group +``` +mysql [information_schema]>alter workload group g1 properties('cpu_share'='2048'); +Query OK, 0 rows affected (0.00 sec -![group_workload_group_1](/images/workload-management/group_wg_1.png) +mysql [information_schema]>select cpu_share from information_schema.workload_groups where name='g1'; ++-----------+ +| cpu_share | ++-----------+ +| 2048 | ++-----------+ +1 row in set (0.02 sec) -1. Create sub_cluster_a and sub_cluster_b Resource Groups, dividing seven machines into two sub-clusters: ``` --- create resource group sub_cluster_a -ALTER SYSTEM MODIFY BACKEND "192.168.88.31:9050" SET("tag.location" = "sub_cluster_a"); -ALTER SYSTEM MODIFY BACKEND "192.168.88.32:9050" SET("tag.location" = "sub_cluster_a"); -ALTER SYSTEM MODIFY BACKEND "192.168.88.33:9050" SET("tag.location" = "sub_cluster_a"); --- create resource group sub_cluster_b -ALTER SYSTEM MODIFY BACKEND "192.168.88.34:9050" SET("tag.location" = "sub_cluster_b"); -ALTER SYSTEM MODIFY BACKEND "192.168.88.35:9050" SET("tag.location" = "sub_cluster_b"); +More details can be found in [ALTER-WORKLOAD-GROUP](../../sql-manual/sql-statements/Data-Definition-Statements/Alter/ALTER-WORKLOAD-GROUP) + +## Drop Workload Group +``` +mysql [information_schema]>drop workload group g1; +Query OK, 0 rows affected (0.01 sec) ``` -2. Create Workload Groups for memory resource isolation within sub-clusters: +More details can be found in[DROP-WORKLOAD-GROUP](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-WORKLOAD-GROUP) + +## Explanation of Switching Between CPU Soft and Hard Limit Modes +Currently, Doris does not support running both CPU soft and hard limits simultaneously. At any given time, a Doris cluster can only operate in either CPU soft limit or CPU hard limit mode. +Users can switch between these two modes, and the switching method is as follows: + +1 If the current cluster configuration is set to the default CPU soft limit and you wish to change it to CPU hard limit, you need to modify the cpu_hard_limit parameter of the Workload Group to a valid value. ``` --- create workload groups for sub cluster A -CREATE WORKLOAD GROUP a_wlg_1 PROPERTIES('tag' = "sub_cluster_a", "memory_limit" = "30"); -CREATE WORKLOAD GROUP a_wlg_2 PROPERTIES('tag' = "sub_cluster_a", "memory_limit" = "30"); -CREATE WORKLOAD GROUP a_wlg_3 PROPERTIES('tag' = "sub_cluster_a", "memory_limit" = "30"); +alter workload group test_group properties ( 'cpu_hard_limit'='20%' ); +``` +All Workload Groups in the cluster need to be modified, and the cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%. + +Since CPU hard limits cannot automatically have a valid value, simply enabling the switch without modifying the property will prevent the CPU hard limit from taking effect. --- create workload groups for sub cluster B -CREATE WORKLOAD GROUP b_wlg_1 PROPERTIES('tag' = "sub_cluster_b", "memory_limit" = "30"); -CREATE WORKLOAD GROUP b_wlg_2 PROPERTIES('tag' = "sub_cluster_b", "memory_limit" = "30"); +2 Enable the CPU hard limit on all FE nodes ``` +1 Modify the configuration in the fe.conf file on the disk. +experimental_enable_cpu_hard_limit = true + + +2 Modify the configuration in memory. +ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true"); +``` + +If the user wishes to switch from CPU hard limit back to CPU soft limit, they need to set the value of enable_cpu_hard_limit to false on all FE nodes. +The CPU soft limit property cpu_share will default to a valid value of 1024 (if it was not previously specified). Users can adjust the cpu_share value based on the priority of the group. + +## Testing +### Memory hard limit +Adhoc-type queries typically have unpredictable SQL inputs and uncertain memory usage, which poses the risk of a few queries consuming a large amount of memory. +These types of workloads can be allocated to a separate group, and by using the Workload Group's memory hard limit feature, it helps prevent sudden large queries from consuming all memory, which could cause other queries to run out of available memory or result in OOM (Out of Memory) errors. +When the memory usage of this Workload Group exceeds the configured hard limit, the system will kill queries to release memory, preventing the process from running out of memory. + +**Testing environment** + +1 FE, 1 BE, with BE configured to 96 cores and 375GB of memory. + +The test dataset is clickbench, and the testing method involves using JMeter to run query Q29 with three concurrent executions. + +**Test without enabling memory hard limit for Workload Group** + +1. Check the memory usage of the process. The fourth column in the ps command output represents the physical memory usage of the process, in kilobytes (KB). It shows that under the current test load, the process uses approximately 7.7GB of memory. + + ```sql + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 2.0 7896792 + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 2.0 7929692 + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 2.0 8101232 + ``` + +2. Use Doris system tables to check the current memory usage of the Workload Group. The memory usage of the Workload Group is approximately 5.8GB. + + ```sql + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +-------------------+ + | wg_mem_used_mb | + +-------------------+ + | 5797.524360656738 | + +-------------------+ + 1 row in set (0.01 sec) + + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +-------------------+ + | wg_mem_used_mb | + +-------------------+ + | 5840.246627807617 | + +-------------------+ + 1 row in set (0.02 sec) + + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +-------------------+ + | wg_mem_used_mb | + +-------------------+ + | 5878.394917488098 | + +-------------------+ + 1 row in set (0.02 sec) + ``` + +Here, we can see that the process memory usage is typically much larger than the memory usage of a Workload Group, even if only one Workload Group is running. This is because the Workload Group only tracks the memory used by queries and loads The memory used by other components within the process, such as metadata and various caches, is not counted as part of the Workload Group's memory usage, nor is it managed by the Workload Group. + +**Test with the memory hard limit for Workload Group enabled** +1. Execute the SQL command to modify the memory configuration. + + ```sql + alter workload group g2 properties('memory_limit'='0.5%'); + alter workload group g2 properties('enable_memory_overcommit'='false'); + ``` + +2. Run the same test and check the memory usage in the system table; the memory usage is around 1.5G. + + ```sql + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +--------------------+ + | wg_mem_used_mb | + +--------------------+ + | 1575.3877239227295 | + +--------------------+ + 1 row in set (0.02 sec) + + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +------------------+ + | wg_mem_used_mb | + +------------------+ + | 1668.77405834198 | + +------------------+ + 1 row in set (0.01 sec) + + mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201; + +--------------------+ + | wg_mem_used_mb | + +--------------------+ + | 499.96760272979736 | + +--------------------+ + 1 row in set (0.01 sec) + ``` + +3. Use the ps command to check the memory usage of the process; the memory usage is around 3.8G. + + ```sql + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 1.0 4071364 + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 1.0 4059012 + [ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481 + 1407481 doris_be 1.0 4057068 + ``` + +4. At the same time, the client will observe a significant number of query failures caused by insufficient memory. + + ```sql + 1724074250162,14126,1c_sql,HY000 1105,"java.sql.SQLException: errCode = 2, detailMessage = (127.0.0.1)[MEM_LIMIT_EXCEEDED]GC wg for hard limit, wg id:11201, name:g2, used:1.71 GB, limit:1.69 GB, backend:10.16.10.8. cancel top memory used tracker consumption 1.71 GB. details:process memory used 3.01 GB exceed soft limit 304.41 GB or sys available memory 101.16 GB less than warning water mark 12.80 GB., Execute again after enough memory, details see be.INFO.",并发 1-3,text,false,,444,0,3,3,null,0,0,0 + ``` + +From the error message, it can be observed that the Workload Group used 1.7G of memory, but the Workload Group's limit is 1.69G. The calculation is as follows:1.69G = Physical machine memory (375G) * mem_limit (value from be.conf, default is 0.9) * 0.5% (Workload Group's configuration). +This means the memory percentage configured in the Workload Group is calculated based on the memory available to the BE process. + +**Recommendations** + +As demonstrated in the tests above, memory hard limits can control the memory usage of a Workload Group but do so by terminating queries to release memory. This approach can lead to a poor user experience and, in extreme cases, may cause all queries to fail. + +Therefore, in production environments, it is recommended to use memory hard limits in conjunction with query queuing functionality. This ensures controlled memory usage while maintaining query success rates. + + + +### CPU hard limit +Doris workloads can generally be categorized into three types: +1. Core Report Queries: These are typically used by company executives to view reports. While the load may not be very high, the availability requirements are strict. These queries can be assigned to a group with a higher-priority soft limit, ensuring they receive more CPU resources when resources are insufficient. +2. Adhoc queries are typically exploratory and analytical in nature, with random SQL and unpredictable resource consumption. Their priority is usually low. Therefore, CPU hard limits can be used to manage these queries, configuring lower values to prevent excessive CPU resource usage that could reduce cluster availability. +3. ETL queries typically have fixed SQL and stable resource consumption, although there may occasionally be spikes in resource usage due to increased upstream data. Therefore, CPU hard limits can be configured to manage these queries. + +Different workloads have varying CPU consumption, and users have different latency requirements. When the BE CPU is fully utilized, availability decreases, and response times increase. For example, an Adhoc analysis query may fully utilize the CPU of the entire cluster, causing core report queries to experience higher latency, which impacts SLA. Therefore, a CPU isolation mechanism is needed to separate different workloads and ensure cluster availability and SLA. + +Workload Group supports both CPU soft limits and hard limits. It is currently recommended to configure Workload Groups with hard limits in production environments. This is because CPU soft limits typically only show priority effects when the CPU is fully utilized. However, when the CPU is fully used, internal Doris components (such as the RPC component) and the operating system’s available CPU are reduced, leading to a significant drop in overall cluster availability. Therefore, in production environments, it is essential to avoid CPU resource exhaustion, and the same logic applies to other resources such as memory. + +**Test environment** + +1 FE, 1 BE, 96-core machine. +The dataset is clickbench, and the test SQL is q29. + +**Tesing** +1. Using JMeter to initiate 3 concurrent queries, the CPU usage of the BE process is pushed to a relatively high usage rate. The test machine has 96 cores, and using the top command, we can see that the BE process's CPU usage is 7600%, which means the process is currently using 76 cores. + + ![use workload group cpu](/images/workload-management/use_wg_cpu_1.png) + +2. Modify the CPU hard limit of the currently used Workload Group to 10%. + + ```sql + alter workload group g2 properties('cpu_hard_limit'='10%'); + ``` + +3. Switch to CPU hard limit mode. + + ```sql + ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true"); + ``` + +4. Re-run the load test for queries, and you can see that the current process can only use 9 to 10 cores, which is about 10% of the total cores. + + ![use workload group cpu](/images/workload-management/use_wg_cpu_2.png) + +It is important to note that this test is best conducted using query workloads, as they are more likely to reflect the effect. If testing load, it may trigger Compaction, causing the actual observed values to be higher than the values configured in the Workload Group. Currently, Compaction workloads are not managed under the Workload Group. + +5. In addition to using Linux system commands, you can also observe the current CPU usage of the group through Doris's system tables, where the CPU usage is around 10%. + + ```sql + mysql [information_schema]>select CPU_USAGE_PERCENT from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +-------------------+ + | CPU_USAGE_PERCENT | + +-------------------+ + | 9.57 | + +-------------------+ + 1 row in set (0.02 sec) + ``` + +**note** + +1. When configuring, it's better not to set the total CPU allocation of all groups to exactly 100%. This is mainly to ensure the availability of low-latency scenarios, as some resources need to be reserved for other components. However, for scenarios that are not very sensitive to latency and aim for maximum resource utilization, setting the total CPU allocation of all groups to 100% can be considered. +2. Currently, the interval for synchronizing Workload Group metadata from FE to BE is 30 seconds. Therefore, changes to Workload Group settings may take up to 30 seconds to take effect. + + +### Limit local IO +In OLAP systems, during ETL operations or large Adhoc queries, a significant amount of data needs to be read. To speed up the data analysis process, Doris uses multi-threaded parallel scanning across multiple disk files, which generates substantial disk IO that can impact other queries (such as report analysis). +By using Workload Groups, Doris can group offline ETL data processing and online report queries separately, limiting the offline data processing IO bandwidth. This helps reduce the impact of offline data processing on online report analysis. + +**Test environment** + +1 FE, 1 BE, 96-core machine. Dataset: clickbench. Test query: q29. + +**Testing without enabling IO hard limits** +1. Clear Cache. + + ```sql + // clear OS cache + sync; echo 3 > /proc/sys/vm/drop_caches + + // disable BE page cache + disable_storage_page_cache = true + ``` + +2. Perform a full table scan on the clickbench table, and execute a single concurrent query. + + ```sql + set dry_run_query = true; + select * from hits.hits; + ``` + +3. Check the maximum throughput of the current Group as 3GB per second through Doris's system table. + + ```sql + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 1146.6208400726318 | + +--------------------+ + 1 row in set (0.03 sec) + + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 3496.2762966156006 | + +--------------------+ + 1 row in set (0.04 sec) + + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 2192.7690029144287 | + +--------------------+ + 1 row in set (0.02 sec) + ``` + +4. Use the pidstat command to check the process IO. The first column is the process ID, and the second column is the read IO throughput (in kb/s). It can be seen that when IO is not restricted, the maximum throughput is 2GB per second. + + ![use workload group io](/images/workload-management/use_wg_io_1.png) + + +**Test after enabling IO hard limit** +1. Clear cache. + + ```sql + // Clear OS cache. + sync; echo 3 > /proc/sys/vm/drop_caches + + // disable BE page cache + disable_storage_page_cache = true + ``` + +2. Modify the Workload Group configuration to limit the maximum throughput to 100M per second. + + ```sql + alter workload group g2 properties('read_bytes_per_second'='104857600'); + ``` + +3. Use Doris system tables to check that the maximum IO throughput of the Workload Group is 98M per second. + + ```sql + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 97.94296646118164 | + +--------------------+ + 1 row in set (0.03 sec) + + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 98.37584781646729 | + +--------------------+ + 1 row in set (0.04 sec) -## NOTE -1. Using Workload Resource Control in Kubernetes: Workload's CPU management relies on CGroup. If using Workload Groups in containers, start the container in privileged mode to allow the Doris process to read and write the host's CGroup files. When Doris runs in a container, the CPU resources allocated to the Workload Group are based on the container's available resources. -2. Memory and IO Management: Workload Group's memory and IO management are implemented internally by Doris and do not depend on external components, so there is no difference in deployment on containers or physical machines. For Doris deployment on K8S, using the Doris Operator is recommended to abstract away underlying permission details. \ No newline at end of file + mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201; + +--------------------+ + | mb_per_sec | + +--------------------+ + | 98.06641292572021 | + +--------------------+ + 1 row in set (0.02 sec) + ``` + +4. Use the pid tool to check that the maximum IO throughput of the process is 131M per second. + + ![use workload group io](/images/workload-management/use_wg_io_2.png) + +**Note** +1. The LOCAL_SCAN_BYTES_PER_SECOND field in the system table represents the summary value of the current Workload Group's statistics at the process level. For example, if 12 file paths are configured, LOCAL_SCAN_BYTES_PER_SECOND is the maximum IO value of these 12 file paths. If you wish to view the IO throughput for each file path separately, you can check the detailed values in Grafana. + +2. Due to the presence of the operating system and Doris's Page Cache, the IO observed through Linux's IO monitoring scripts is typically smaller than the IO seen in the system table. + + +### Limit remote IO +BrokerLoad and S3Load are commonly used methods for large-scale data load. Users can first upload data to HDFS or S3, and then use BrokerLoad and S3Load to load data in parallel. To speed up the load process, Doris uses multi-threading to pull data from HDFS/S3, which can generate significant pressure on HDFS/S3, potentially making other jobs running on HDFS/S3 unstable. + +To mitigate the impact on other workloads, the Workload Group's remote IO limit feature can be used to restrict the bandwidth used during the load process from HDFS/S3. This helps reduce the impact on other business operations. + + +**Test environment** + +1 FE and 1 BE are deployed on the same machine, configured with 16 cores and 64GB of memory. The test data is the clickbench dataset, and before testing, we need to upload the dataset to S3. Considering the upload time, we will only upload 10 million rows of data, and then use the TVF function to query the data from S3. + +After the upload is successful, you can use the command to view the schema information. + + ```sql + DESC FUNCTION s3 ( + "URI" = "https://bucketname/1kw.tsv", + "s3.access_key"= "ak", + "s3.secret_key" = "sk", + "format" = "csv", + "use_path_style"="true" + ); + ``` + +**Test without restricting remote read IO** +1. Initiate a single-threaded test to perform a full table scan on the clickbench table. + + ```sql + // Set the operation to only scan the data without returning results. + set dry_run_query = true; + + SELECT * FROM s3( + "URI" = "https://bucketname/1kw.tsv", + "s3.access_key"= "ak", + "s3.secret_key" = "sk", + "format" = "csv", + "use_path_style"="true" + ); + ``` + +2. Use the system table to check the current remote IO throughput. It shows that the remote IO throughput for this query is 837 MB per second. Note that the actual IO throughput here is highly dependent on the environment. If the machine hosting the BE has limited bandwidth to the external storage, the actual throughput may be lower. + + ```sql + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 837 | + +---------+ + 1 row in set (0.104 sec) + + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 867 | + +---------+ + 1 row in set (0.070 sec) + + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 867 | + +---------+ + 1 row in set (0.186 sec) + ``` + +3. Use the sar command (sar -n DEV 1 3600) to monitor the machine's network bandwidth. It shows that the maximum network bandwidth at the machine level is 1033 MB per second. + The first column of the output represents the number of bytes received per second by a specific network interface on the machine, in KB per second. + + ![use workload group rio](/images/workload-management/use_wg_rio_1.png) + +**Test limiting remote read IO** +1. Modify the Workload Group configuration to limit remote read IO throughput to 100M per second. + + ```sql + alter workload group normal properties('remote_read_bytes_per_second'='104857600'); + ``` + +2. Initiate a single concurrent full table scan query. + + ```sql + set dry_run_query = true; + + SELECT * FROM s3( + "URI" = "https://bucketname/1kw.tsv", + "s3.access_key"= "ak", + "s3.secret_key" = "sk", + "format" = "csv", + "use_path_style"="true" + ); + ``` + +3. Use the system table to check the current remote read IO throughput. At this time, the IO throughput is around 100M, with some fluctuations. These fluctuations are influenced by the current algorithm design, typically peaking briefly without persisting for long periods, which is considered normal. + + ```sql + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 56 | + +---------+ + 1 row in set (0.010 sec) + + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 131 | + +---------+ + 1 row in set (0.009 sec) + + MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; + +---------+ + | read_mb | + +---------+ + | 111 | + +---------+ + 1 row in set (0.009 sec) + ``` + +4. Use the sar command (sar -n DEV 1 3600) to monitor the current network card's received traffic. The first column represents the amount of data received per second. The maximum value observed is now 207M per second, indicating that the read IO limit is effective. However, since the sar command reflects machine-level traffic, the observed value is slightly higher than what Doris reports. + + ![use workload group rio](/images/workload-management/use_wg_rio_2.png) \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/workload-group.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/workload-group.md index 8164181c886fc..a73792a025702 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/workload-group.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/workload-group.md @@ -146,7 +146,7 @@ Query OK, 0 rows affected (0.03 sec) | 属性名称 | 数据类型 | 默认值 | 取值范围 | 说明 | |------------------------------|---------|-----|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | cpu_share | 整型 | -1 | [1, 10000] | 可选,CPU软限模式下生效,取值范围和使用的CGroup版本有关,下文有详细描述。cpu_share 代表了 Workload Group 可获得CPU时间的权重,值越大,可获得的CPU时间越多。例如,用户创建了 3 个 Workload Group g-a、g-b 和 g-c,cpu_share 分别为 10、30、40,某一时刻 g-a 和 g-b 正在跑任务,而 g-c 没有任务,此时 g-a 可获得 25% (10 / (10 + 30)) 的 CPU 资源,而 g-b 可获得 75% 的 CPU 资源。如果系统只有一个 Workload Group 正在运行,则不管其 cpu_share 的值为多少,它都可获取全部的 CPU 资源 。 | -| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。一个集群中所有 Workload Group 的 memory_limit 的累加值不能超过 100%。 | +| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。 | | enable_memory_overcommit | 布尔 | true | true, false | 可选,用于控制当前 Workload Group 的内存限制是硬限还是软限,默认为 true。如果设置为 false,则该 workload group 为内存硬隔离,系统检测到 workload group 内存使用超出限制后将立即 cancel 组内内存占用最大的若干个任务,以释放超出的内存;如果设置为 true,则该 Workload Group 为内存软隔离,如果系统有空闲内存资源则该 Workload Group 在超出 memory_limit 的限制后可继续使用系统内存,在系统总内存紧张时会 cancel 组内内存占用最大的若干个任务,释放部分超出的内存以缓解系统内存压力。建议所有 workload group 的 memory_limit 总和低于 100%,为BE进程中的其他组件保留一些内存。 | | cpu_hard_limit | 整型 | -1 | [1%, 100%] | 可选,CPU 硬限制模式下生效,Workload Group 最大可用 CPU 百分比,不管当前机器的 CPU 资源是否被用满,Workload Group 的最大 CPU 用量都不能超过 cpu_hard_limit,所有 Workload Group 的 cpu_hard_limit 累加值不能超过 100%。2.1 版本新增属性,2.0版本不支持该功能。 | | max_concurrency | 整型 | 2147483647 | [0, 2147483647] | 可选,最大查询并发数,默认值为整型最大值,也就是不做并发的限制。运行中的查询数量达到最大并发时,新来的查询会进入排队的逻辑。 | @@ -192,7 +192,7 @@ SELECT name FROM information_schema.workload_groups; 更多授权操作可以参考[grant 语句](../../sql-manual/sql-statements/Account-Management-Statements/GRANT)。 **两种绑定方式** -1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败,如果不知道要设置哪些group,可以设置为`normal`,`normal`为全局默认的group。 +1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败。 ``` set property 'default_workload_group' = 'g1'; ``` @@ -558,7 +558,7 @@ OLAP 系统在做 ETL 或者大的 Adhoc 查询时,需要读取大量的数据 ![use workload group io](/images/workload-management/use_wg_io_2.png) **注意事项** -1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 上或者 BE 的 bvar 监控查看明细的值。 +1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 监控查看明细的值。 2. 由于操作系统和 Doris 的 Page Cache 的存在,通过 linux 的 IO 监控脚本看到的 IO 通常要比系统表看到的要小。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/admin-manual/workload-management/workload-group.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/admin-manual/workload-management/workload-group.md index 419ea650b499f..ca5c02b911c96 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/admin-manual/workload-management/workload-group.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/admin-manual/workload-management/workload-group.md @@ -146,7 +146,7 @@ Query OK, 0 rows affected (0.03 sec) | 属性名称 | 数据类型 | 默认值 | 取值范围 | 说明 | |------------------------------|---------|-----|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | cpu_share | 整型 | -1 | [1, 10000] | 可选,CPU软限模式下生效,取值范围和使用的CGroup版本有关,下文有详细描述。cpu_share 代表了 Workload Group 可获得CPU时间的权重,值越大,可获得的CPU时间越多。例如,用户创建了 3 个 Workload Group g-a、g-b 和 g-c,cpu_share 分别为 10、30、40,某一时刻 g-a 和 g-b 正在跑任务,而 g-c 没有任务,此时 g-a 可获得 25% (10 / (10 + 30)) 的 CPU 资源,而 g-b 可获得 75% 的 CPU 资源。如果系统只有一个 Workload Group 正在运行,则不管其 cpu_share 的值为多少,它都可获取全部的 CPU 资源 。 | -| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。一个集群中所有 Workload Group 的 memory_limit 的累加值不能超过 100%。 | +| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。 | | enable_memory_overcommit | 布尔 | true | true, false | 可选,用于控制当前 Workload Group 的内存限制是硬限还是软限,默认为 true。如果设置为 false,则该 workload group 为内存硬隔离,系统检测到 workload group 内存使用超出限制后将立即 cancel 组内内存占用最大的若干个任务,以释放超出的内存;如果设置为 true,则该 Workload Group 为内存软隔离,如果系统有空闲内存资源则该 Workload Group 在超出 memory_limit 的限制后可继续使用系统内存,在系统总内存紧张时会 cancel 组内内存占用最大的若干个任务,释放部分超出的内存以缓解系统内存压力。建议所有 workload group 的 memory_limit 总和低于 100%,为BE进程中的其他组件保留一些内存。 | | cpu_hard_limit | 整型 | -1 | [1%, 100%] | 可选,CPU 硬限制模式下生效,Workload Group 最大可用 CPU 百分比,不管当前机器的 CPU 资源是否被用满,Workload Group 的最大 CPU 用量都不能超过 cpu_hard_limit,所有 Workload Group 的 cpu_hard_limit 累加值不能超过 100%。2.1 版本新增属性,2.0版本不支持该功能。 | | max_concurrency | 整型 | 2147483647 | [0, 2147483647] | 可选,最大查询并发数,默认值为整型最大值,也就是不做并发的限制。运行中的查询数量达到最大并发时,新来的查询会进入排队的逻辑。 | @@ -192,7 +192,7 @@ SELECT name FROM information_schema.workload_groups; 更多授权操作可以参考[grant 语句](../../sql-manual/sql-statements/Account-Management-Statements/GRANT)。 **两种绑定方式** -1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败,如果不知道要设置哪些group,可以设置为`normal`,`normal`为全局默认的group。 +1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败。 ``` set property 'default_workload_group' = 'g1'; ``` @@ -558,7 +558,7 @@ OLAP 系统在做 ETL 或者大的 Adhoc 查询时,需要读取大量的数据 ![use workload group io](/images/workload-management/use_wg_io_2.png) **注意事项** -1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 上或者 BE 的 bvar 监控查看明细的值。 +1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 监控查看明细的值。 2. 由于操作系统和 Doris 的 Page Cache 的存在,通过 linux 的 IO 监控脚本看到的 IO 通常要比系统表看到的要小。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/admin-manual/workload-management/workload-group.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/admin-manual/workload-management/workload-group.md index 419ea650b499f..ca5c02b911c96 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/admin-manual/workload-management/workload-group.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/admin-manual/workload-management/workload-group.md @@ -146,7 +146,7 @@ Query OK, 0 rows affected (0.03 sec) | 属性名称 | 数据类型 | 默认值 | 取值范围 | 说明 | |------------------------------|---------|-----|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | cpu_share | 整型 | -1 | [1, 10000] | 可选,CPU软限模式下生效,取值范围和使用的CGroup版本有关,下文有详细描述。cpu_share 代表了 Workload Group 可获得CPU时间的权重,值越大,可获得的CPU时间越多。例如,用户创建了 3 个 Workload Group g-a、g-b 和 g-c,cpu_share 分别为 10、30、40,某一时刻 g-a 和 g-b 正在跑任务,而 g-c 没有任务,此时 g-a 可获得 25% (10 / (10 + 30)) 的 CPU 资源,而 g-b 可获得 75% 的 CPU 资源。如果系统只有一个 Workload Group 正在运行,则不管其 cpu_share 的值为多少,它都可获取全部的 CPU 资源 。 | -| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。一个集群中所有 Workload Group 的 memory_limit 的累加值不能超过 100%。 | +| memory_limit | 浮点 | -1 | (0%, 100%] | 可选,开启内存硬限时代表当前 Workload Group 最大可用内存百分比,默认值代表不限制内存。所有 Workload Group 的 memory_limit 累加值不可以超过 100%,通常与 enable_memory_overcommit 属性配合使用。如果一个机器的内存为 64G,Workload Group 的 memory_limit配置为50%,那么该 group 的实际物理内存=64G * 90% * 50%= 28.8G,这里的90%是 BE 进程可用内存配置的默认值。 | | enable_memory_overcommit | 布尔 | true | true, false | 可选,用于控制当前 Workload Group 的内存限制是硬限还是软限,默认为 true。如果设置为 false,则该 workload group 为内存硬隔离,系统检测到 workload group 内存使用超出限制后将立即 cancel 组内内存占用最大的若干个任务,以释放超出的内存;如果设置为 true,则该 Workload Group 为内存软隔离,如果系统有空闲内存资源则该 Workload Group 在超出 memory_limit 的限制后可继续使用系统内存,在系统总内存紧张时会 cancel 组内内存占用最大的若干个任务,释放部分超出的内存以缓解系统内存压力。建议所有 workload group 的 memory_limit 总和低于 100%,为BE进程中的其他组件保留一些内存。 | | cpu_hard_limit | 整型 | -1 | [1%, 100%] | 可选,CPU 硬限制模式下生效,Workload Group 最大可用 CPU 百分比,不管当前机器的 CPU 资源是否被用满,Workload Group 的最大 CPU 用量都不能超过 cpu_hard_limit,所有 Workload Group 的 cpu_hard_limit 累加值不能超过 100%。2.1 版本新增属性,2.0版本不支持该功能。 | | max_concurrency | 整型 | 2147483647 | [0, 2147483647] | 可选,最大查询并发数,默认值为整型最大值,也就是不做并发的限制。运行中的查询数量达到最大并发时,新来的查询会进入排队的逻辑。 | @@ -192,7 +192,7 @@ SELECT name FROM information_schema.workload_groups; 更多授权操作可以参考[grant 语句](../../sql-manual/sql-statements/Account-Management-Statements/GRANT)。 **两种绑定方式** -1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败,如果不知道要设置哪些group,可以设置为`normal`,`normal`为全局默认的group。 +1. 通过设置 user property 将 user 默认绑定到 workload group,默认为`normal`,需要注意的这里的value不能填空,否则语句会执行失败。 ``` set property 'default_workload_group' = 'g1'; ``` @@ -558,7 +558,7 @@ OLAP 系统在做 ETL 或者大的 Adhoc 查询时,需要读取大量的数据 ![use workload group io](/images/workload-management/use_wg_io_2.png) **注意事项** -1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 上或者 BE 的 bvar 监控查看明细的值。 +1. 系统表中的 LOCAL_SCAN_BYTES_PER_SECOND 字段代表的是当前 Workload Group 在进程粒度的统计汇总值,比如配置了 12 个文件路径,那么 LOCAL_SCAN_BYTES_PER_SECOND 就是这 12 个文件路径 IO 的最大值,如果期望查看每个文件路径分别的 IO 吞吐,可以在 grafana 监控查看明细的值。 2. 由于操作系统和 Doris 的 Page Cache 的存在,通过 linux 的 IO 监控脚本看到的 IO 通常要比系统表看到的要小。