Summary:
Customers can load data into the system until they run out of disk space. Once we run out of disk the tservers will `FATAL` until extra disk space is added. This makes the universe unavailable for reads/backups, and other operations like DB/Table drop. It also crashes the DR xCluster.
This change reject user writes at `TabletServiceImpl::PerformWrite` if the disk space is less than `FLAGS_reject_writes_min_disk_space_mb` (3GB default).
This will cover the majority of cases since all nodes more or less have the same size and data distribution. If any node has a skewed count of followers then it may still run out of disk space.
This enables us to keep the cluster functional from a system perspective and still service Reads/Backups/xCluster/CDC, ...
`GetFreeSpaceBytes` system call is only performed every 60 seconds(`FLAGS_reject_writes_min_disk_space_check_interval_sec`) to avoid performance issues as long as there is `FLAGS_reject_writes_min_disk_space_aggressive_check_mb` (18GB default) space left. If the free space is under `FLAGS_reject_writes_min_disk_space_aggressive_check_mb` then we check every 10s.
Delete and Truncate table works for YCQL even if master is out of disk space since we never call `PerformWrite` API on sys_catalog. However all YSQL DDLs require updates to PG catalog which invoke `PerformWrite` and will fail if master is out of disk space.
Feature is guarded by flag `reject_writes_when_disk_full`.
Failure error message:
> Write to tablet $0 rejected. Node $1 has insufficient disk space
Ex:
> 2024-06-13 16:29:07.183 PDT [7439] ERROR: Write to tablet 2dc52a9067bc489c8c19194d05f13df7 rejected. Node 14e84287736647a3a07af32f85aa09d6 has insufficient disk space
Fixes #22430
Jira: DB-11337
Test Plan:
Ran 8 iterations of SYSBENCH read_write tests and noticed no performance degradation. Even the `95th percentile Latency(ms)` shows no impact from this change.
YCqlDiskFullTest.TestDiskFull
YSqlDiskFullTest.TestDiskFull
Reviewers: rthallam, slingam, yyan
Reviewed By: rthallam, yyan
Subscribers: ybase
Differential Revision: https://phorge.dev.yugabyte.com/D35145