Skip to content

2.23.0.0-b471

Summary:
Customers can load data into the system until they run out of disk space. Once we run out of disk the tservers will `FATAL` until extra disk space is added. This makes the universe unavailable for reads/backups, and other operations like DB/Table drop. It also crashes the DR xCluster.

This change reject user writes at `TabletServiceImpl::PerformWrite` if the disk space is less than `FLAGS_reject_writes_min_disk_space_mb` (3GB default).
This will cover the majority of cases since all nodes more or less have the same size and data distribution. If any node has a skewed count of followers then it may still run out of disk space.

This enables us to keep the cluster functional from a system perspective and still service Reads/Backups/xCluster/CDC, ...

`GetFreeSpaceBytes` system call is only performed every 60 seconds(`FLAGS_reject_writes_min_disk_space_check_interval_sec`) to avoid performance issues as long as there is `FLAGS_reject_writes_min_disk_space_aggressive_check_mb` (18GB default) space left. If the free space is under `FLAGS_reject_writes_min_disk_space_aggressive_check_mb` then we check every 10s.

Delete and Truncate table works for YCQL even if master is out of disk space since we never call `PerformWrite` API on sys_catalog. However all YSQL DDLs require updates to PG catalog which invoke `PerformWrite` and will fail if master is out of disk space.

Feature is guarded by flag `reject_writes_when_disk_full`.

Failure error message:
> Write to tablet $0 rejected. Node $1 has insufficient disk space

Ex:
> 2024-06-13 16:29:07.183 PDT [7439] ERROR:  Write to tablet 2dc52a9067bc489c8c19194d05f13df7 rejected. Node 14e84287736647a3a07af32f85aa09d6 has insufficient disk space

Fixes #22430
Jira: DB-11337

Test Plan:
Ran 8 iterations of SYSBENCH read_write tests and noticed no performance degradation. Even the `95th percentile Latency(ms)` shows no impact from this change.

YCqlDiskFullTest.TestDiskFull
YSqlDiskFullTest.TestDiskFull

Reviewers: rthallam, slingam, yyan

Reviewed By: rthallam, yyan

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D35145
Assets 2
Loading