Do not rebalance spilled keys #6002

crusaderky · 2022-03-25T11:24:28Z

Scheduler.rebalance() moves key/value pairs from workers with the heaviest memory load to those with the lightest.
"memory load" is, by default, measured in terms of optimistic memory: managed (not spilled) + unmanaged older than 30s, which is the same as process - managed recent.
(read: https://distributed.dask.org/en/latest/worker-memory.html#using-the-dashboard-to-monitor-memory-usage)

rebalance uses a least-recently-inserted algorithm to pick which keys to move - the rationale being that implementing a least-recently-used algorithm like on the worker-side spill system would require an unreasonable amount of extra synchronization from worker to scheduler.

In a somewhat common scenario where least-recently-inserted crudely matches least-recently-used, this means that rebalance will tend to move spilled keys first. The consequence is that rebalance not only will not improve in any way the situation on the sender worker, but it will actually make its memory usage spike for the whole duration of the transfer, as it copies the data out of the spill disk and into memory to prepare it for send (xref: #5996 for mitigation).

It will also aggravate memory usage on the receiving worker. If the worker later reaches the target threshold itself, then it will have a key from rebalance that used to be spilled (read: it was not used for a long time) that will be spilled potentially later than more recent keys, as the SpillBuffer makes no distinction between key/values used by rebalance/AMM and those used by the computation.

Proposed design

Sync a spilled flag from worker to scheduler.
Completely ignore spilled keys in rebalance().

Both issue and design are valid both in the current (legacy) implementation of rebalance() as well as in the future reimplementation on top of AMM.

gjoseph92 · 2022-03-25T19:35:37Z

Sync a spilled flag from worker to scheduler

Having this information on the scheduler would be generally helpful, beyond this issue. For example, it could let decide_worker pick workers for tasks in a way that avoided needless un-spilling. It would probably also be useful to AMM policies in general. This opens a few doors.

How would this interact with #5900? Would we want more detail than spilled: bool, if data could eventually be at various levels of storage hierarchy (in memory->serialized&compressed in memory->disk)?

crusaderky · 2022-03-25T20:12:56Z

How would this interact with #5900? Would we want more detail than spilled: bool, if data could eventually be at various levels of storage hierarchy (in memory->serialized&compressed in memory->disk)?

I suspect that starting to account for the cost of serialization scheduler-side would be a massive and ultimately unjustified complication. Unspilling is order-of-magnitude more expensive than just deserialising a key that is already in memory. So I'm moderately inclined to completely ignore any memory management cost that is not disk or network I/O.

gjoseph92 · 2022-03-25T20:23:09Z

I think that's entirely sensible as the initial step.

crusaderky · 2022-06-15T09:55:28Z

Soft-blocked by #4906 (we could implement it for the current rebalance, but there would be a lot of throwaway work).

crusaderky · 2022-10-07T10:10:46Z

Having scheduler-side data about individual spilled keys would also help resolve this issue described in distributed.scheduler.WorkerState.memory:

    The sudden deletion of spilled keys will cause a negative blip of managed_in_memory:

    1. Delete 100MB of spilled data
    2. The updated managed memory *total* reaches the scheduler faster than the
       updated spilled portion
    3. This causes managed_in_memory to temporarily plummet and be replaced by
       unmanaged_recent, while managed_spilled remains unaltered
    4. When the heartbeat arrives, managed_in_memory goes back up, unmanaged_recent
       goes back down, and managed_spilled goes down by 100MB as it should have to
       begin with.

crusaderky added the memory label Mar 25, 2022

crusaderky self-assigned this Jun 15, 2022

crusaderky mentioned this issue Jul 7, 2022

Restart paused workers after a certain timeout #5999

Open

crusaderky mentioned this issue Aug 4, 2022

Scale up on memory pressure #6826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not rebalance spilled keys #6002

Do not rebalance spilled keys #6002

crusaderky commented Mar 25, 2022

gjoseph92 commented Mar 25, 2022

crusaderky commented Mar 25, 2022

gjoseph92 commented Mar 25, 2022

crusaderky commented Jun 15, 2022

crusaderky commented Oct 7, 2022 •

edited

Loading

Do not rebalance spilled keys #6002

Do not rebalance spilled keys #6002

Comments

crusaderky commented Mar 25, 2022

Proposed design

gjoseph92 commented Mar 25, 2022

crusaderky commented Mar 25, 2022

gjoseph92 commented Mar 25, 2022

crusaderky commented Jun 15, 2022

crusaderky commented Oct 7, 2022 • edited Loading

crusaderky commented Oct 7, 2022 •

edited

Loading