Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve memory capacity in worker operations? #6212

Open
gjoseph92 opened this issue Apr 26, 2022 · 0 comments
Open

Reserve memory capacity in worker operations? #6212

gjoseph92 opened this issue Apr 26, 2022 · 0 comments
Labels
discussion Discussing a topic with no specific actions yet memory

Comments

@gjoseph92
Copy link
Collaborator

I'm just thinking here about what mechanisms we'd need to fix #6208. If we want to avoid fetching more data that we have memory for, we need to know how much space we do have for fetched data. But because there are many systems producing memory, maybe they need some way to cooperate on how much memory each is going to need.

Currently, workers don't take into account how much memory an operation will require:

Right now, workers transfer and un-spill as much data as they want:

If they start using too much memory, hopefully they pause. (Pausing has a limited effect on actually stopping more transfers or un-spilling, but that will get better #6195 #5900 #5996.) Because pause-checking runs at an interval, also it's possible to get unlucky and use up enough memory to start the OS thrashing, which also prevents the memory monitor from intervening further. But even if pausing works, it's still disruptive and not ideal. Sprinting and stopping all the time is not a good a way to win a race, compared to a steady pace.

Instead of waiting for too much memory to be used, then pausing, maybe we could not start the memory-producing operation until we had some memory capacity "reserved" for its output? A simple data structure similar to a CapacityLimiter would work here (with a variable-sized acquire and release). This would create backpressure and let subsystems cooperate to proactively avoid using too much memory, versus reactively pausing everything when it happens.

Of course, we'd have to be careful to a) not deadlock because of this and b) not thrash with this just like the OS does (if all the operations use up nearly the full CapacityLimiter, they will get done eventually, but so slowly that it'll feel like nothing's happening—better to just fail, probably, and say "this can't be done without increasing worker memory"). Note that we face the same problems with pausing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet memory
Projects
None yet
Development

No branches or pull requests

1 participant