You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm just thinking here about what mechanisms we'd need to fix #6208. If we want to avoid fetching more data that we have memory for, we need to know how much space we do have for fetched data. But because there are many systems producing memory, maybe they need some way to cooperate on how much memory each is going to need.
Currently, workers don't take into account how much memory an operation will require:
If they start using too much memory, hopefully they pause. (Pausing has a limited effect on actually stopping more transfers or un-spilling, but that will get better #6195#5900#5996.) Because pause-checking runs at an interval, also it's possible to get unlucky and use up enough memory to start the OS thrashing, which also prevents the memory monitor from intervening further. But even if pausing works, it's still disruptive and not ideal. Sprinting and stopping all the time is not a good a way to win a race, compared to a steady pace.
Instead of waiting for too much memory to be used, then pausing, maybe we could not start the memory-producing operation until we had some memory capacity "reserved" for its output? A simple data structure similar to a CapacityLimiter would work here (with a variable-sized acquire and release). This would create backpressure and let subsystems cooperate to proactively avoid using too much memory, versus reactively pausing everything when it happens.
Of course, we'd have to be careful to a) not deadlock because of this and b) not thrash with this just like the OS does (if all the operations use up nearly the full CapacityLimiter, they will get done eventually, but so slowly that it'll feel like nothing's happening—better to just fail, probably, and say "this can't be done without increasing worker memory"). Note that we face the same problems with pausing.
The text was updated successfully, but these errors were encountered:
I'm just thinking here about what mechanisms we'd need to fix #6208. If we want to avoid fetching more data that we have memory for, we need to know how much space we do have for fetched data. But because there are many systems producing memory, maybe they need some way to cooperate on how much memory each is going to need.
Currently, workers don't take into account how much memory an operation will require:
Right now, workers transfer and un-spill as much data as they want:
If they start using too much memory, hopefully they pause. (Pausing has a limited effect on actually stopping more transfers or un-spilling, but that will get better #6195 #5900 #5996.) Because pause-checking runs at an interval, also it's possible to get unlucky and use up enough memory to start the OS thrashing, which also prevents the memory monitor from intervening further. But even if pausing works, it's still disruptive and not ideal. Sprinting and stopping all the time is not a good a way to win a race, compared to a steady pace.
Instead of waiting for too much memory to be used, then pausing, maybe we could not start the memory-producing operation until we had some memory capacity "reserved" for its output? A simple data structure similar to a
CapacityLimiter
would work here (with a variable-sizedacquire
andrelease
). This would create backpressure and let subsystems cooperate to proactively avoid using too much memory, versus reactively pausing everything when it happens.Of course, we'd have to be careful to a) not deadlock because of this and b) not thrash with this just like the OS does (if all the operations use up nearly the full
CapacityLimiter
, they will get done eventually, but so slowly that it'll feel like nothing's happening—better to just fail, probably, and say "this can't be done without increasing worker memory"). Note that we face the same problems with pausing.The text was updated successfully, but these errors were encountered: