Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not rebalance spilled keys #6002

Open
crusaderky opened this issue Mar 25, 2022 · 5 comments
Open

Do not rebalance spilled keys #6002

crusaderky opened this issue Mar 25, 2022 · 5 comments
Assignees
Labels

Comments

@crusaderky
Copy link
Collaborator

Scheduler.rebalance() moves key/value pairs from workers with the heaviest memory load to those with the lightest.
"memory load" is, by default, measured in terms of optimistic memory: managed (not spilled) + unmanaged older than 30s, which is the same as process - managed recent.
(read: https://distributed.dask.org/en/latest/worker-memory.html#using-the-dashboard-to-monitor-memory-usage)

rebalance uses a least-recently-inserted algorithm to pick which keys to move - the rationale being that implementing a least-recently-used algorithm like on the worker-side spill system would require an unreasonable amount of extra synchronization from worker to scheduler.

In a somewhat common scenario where least-recently-inserted crudely matches least-recently-used, this means that rebalance will tend to move spilled keys first. The consequence is that rebalance not only will not improve in any way the situation on the sender worker, but it will actually make its memory usage spike for the whole duration of the transfer, as it copies the data out of the spill disk and into memory to prepare it for send (xref: #5996 for mitigation).

It will also aggravate memory usage on the receiving worker. If the worker later reaches the target threshold itself, then it will have a key from rebalance that used to be spilled (read: it was not used for a long time) that will be spilled potentially later than more recent keys, as the SpillBuffer makes no distinction between key/values used by rebalance/AMM and those used by the computation.

Proposed design

Sync a spilled flag from worker to scheduler.
Completely ignore spilled keys in rebalance().

Both issue and design are valid both in the current (legacy) implementation of rebalance() as well as in the future reimplementation on top of AMM.

@gjoseph92
Copy link
Collaborator

Sync a spilled flag from worker to scheduler

Having this information on the scheduler would be generally helpful, beyond this issue. For example, it could let decide_worker pick workers for tasks in a way that avoided needless un-spilling. It would probably also be useful to AMM policies in general. This opens a few doors.

How would this interact with #5900? Would we want more detail than spilled: bool, if data could eventually be at various levels of storage hierarchy (in memory->serialized&compressed in memory->disk)?

@crusaderky
Copy link
Collaborator Author

How would this interact with #5900? Would we want more detail than spilled: bool, if data could eventually be at various levels of storage hierarchy (in memory->serialized&compressed in memory->disk)?

I suspect that starting to account for the cost of serialization scheduler-side would be a massive and ultimately unjustified complication. Unspilling is order-of-magnitude more expensive than just deserialising a key that is already in memory. So I'm moderately inclined to completely ignore any memory management cost that is not disk or network I/O.

@gjoseph92
Copy link
Collaborator

I think that's entirely sensible as the initial step.

@crusaderky
Copy link
Collaborator Author

Soft-blocked by #4906 (we could implement it for the current rebalance, but there would be a lot of throwaway work).

@crusaderky
Copy link
Collaborator Author

crusaderky commented Oct 7, 2022

Having scheduler-side data about individual spilled keys would also help resolve this issue described in distributed.scheduler.WorkerState.memory:

    The sudden deletion of spilled keys will cause a negative blip of managed_in_memory:

    1. Delete 100MB of spilled data
    2. The updated managed memory *total* reaches the scheduler faster than the
       updated spilled portion
    3. This causes managed_in_memory to temporarily plummet and be replaced by
       unmanaged_recent, while managed_spilled remains unaltered
    4. When the heartbeat arrives, managed_in_memory goes back up, unmanaged_recent
       goes back down, and managed_spilled goes down by 100MB as it should have to
       begin with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants