Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation of SpillBuffer #7351

Closed
Tracked by #7345
fjetter opened this issue Nov 25, 2022 · 14 comments · Fixed by #7368, #7370, #7405, dask/dask#9761 or coiled/benchmarks#629
Closed
Tracked by #7345

Instrumentation of SpillBuffer #7351

fjetter opened this issue Nov 25, 2022 · 14 comments · Fixed by #7368, #7370, #7405, dask/dask#9761 or coiled/benchmarks#629
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented Nov 25, 2022

The only way we currently have to observe disk access is startstops we measure whenever we load/store data.

However, with our SpillBuffer we have the possibility to introduce many instrumentation hooks to get much better insights into what's going on

For instance

  • Total number of keys written / read
  • Number of currently managed files
  • Average and total size of data written to disk per key
  • Total duration spent writing / reading
  • Average / max time spent reading / writing per key
  • Number of evicts (memory.spill)
  • Number of writes without evict because buffer is full right away
  • ...

These metrics should not be tracked on TaskState level.

@crusaderky
Copy link
Collaborator

Number of writes without evict because buffer is full right away

Not sure if I understand this one. The only case where a key doesn't end up in fast when you write to the SpillBuffer is when it's individually larger than the target threshold.

@fjetter
Copy link
Member Author

fjetter commented Dec 2, 2022

Not sure if I understand this one. The only case where a key doesn't end up in fast when you write to the SpillBuffer is when it's individually larger than the target threshold.

I thought there was some logic that would put a key into slow if this key would push us over the limit. If that's not the case, ignore this, I don't think data shards larger than the limit are a very common problem we need to build instrumentation for

@crusaderky
Copy link
Collaborator

I thought there was some logic that would put a key into slow if this key would push us over the limit.

No, if inserting a key pushes us over the limit, the least recently used keys are pushed out. The latest inserted one is on top of the LRU pipe and is the only one guaranteed to be in fast at the end of the insertion.

@fjetter
Copy link
Member Author

fjetter commented Dec 5, 2022

The thing I was hoping to differentiate with this comment is the spilling that happens as a result of setitem vs the spilling that happens when the memory_manager is evicting.
If it's not the same key, that's fine but if there is an evict coupled to the setitem, that might be an interesting datapoint

@crusaderky
Copy link
Collaborator

spilling that happens as a result of setitem vs the spilling that happens when the memory_manager is evicting.

Both will evict the same keys, in the same order. The only difference is that memory_manager kicks in when there's substantial unmanaged memory (but on the flip side it's less responsive).

@crusaderky
Copy link
Collaborator

crusaderky commented Dec 6, 2022

@crusaderky
Copy link
Collaborator

crusaderky commented Dec 13, 2022

A modified copy of the Coiled Grafana dashboard is now available at https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVw

New plots:
image

The plots above were run on coiled-runtime/tests/stability/test_spill.py::test_tensordot_stress.
They offer a wealth of new insights:

  • the time spent pickling / unpickling is negligible (the plots on the right are stacked). Note that the test use case uses exclusively native, uncompressible numpy data.
  • the unspill events triggered by get-data are modest compared to those triggered by local execute events. Note that the test ran on 5 workers; it is advisable to re-run on a much higher number of workers and see if this changes.
  • If the above two were to be confirmed in more general use cases (pandas/arrow data and a large number of workers), they suggest that these two proposals would offer a fairly poor cost/benefit ratio.
  • Unsurprisingly, worst-case tick duration and worst-case spill duration are tightly correlated. From this picture we're notably missing garbage collection time, which however could be also metered (out of scope).
  • The contiguous time in which the whole event loop goes into apnoea due to spilling, thus neglecting to start new tasks and perform any sort of network comms, is pretty horrid. This, together with the mild cost of pickling/unpickling (which holds the GIL), suggests that this proposal should be treated as high priority:
  • The spill threshold is designed to be an emergency release valve when the target threshold fails. This happens when (a) the output of sizeof() is inaccurate and/or (b) there are large amounts of unmanaged memory. In this use case (a) is not applicable (it's all trivial numpy data); and yet we see that the spill threshold is crossed very frequently. This is not ideal and we should maybe reconsider the gap between the target and spill thresholds.

@ntabris could you please review the modified grafana dashboard and, if you're happy with it, merge the new plots into the main one? (note that the PRs producing the new data have not been merged yet).

@gjoseph92
Copy link
Collaborator

This is great work—having this sort of information is extremely helpful both operationally when using dask, and for prioritizing what to improve next.

they suggest that these two proposals would offer a fairly poor cost/benefit ratio

It seems like the gist of what you're saying here is "we have to un-spill in order to execute tasks a lot more often that we un-spill to transfer keys". I'm curious how well that generalizes, or how specific it is to the scheduling and transfer patterns of test_tensordot_stress. Given what we see here, I assume that if a workload did exist where most un-spilling was done for transferring keys, then it would look just as bad as this one—the question is just how common that case is, and what kinds of workloads cause it.

For #5996, I think the metric we need to assess its importance is not "how much time is spend un-spilling keys to transfer them", but "how much extra memory is used by keys that were un-spilled for transfer which otherwise could have remained spilled". Presumably, async disc access would address the time component for un-spilling, whether due to execute or transfer. The purpose of sendfile would be to reduce the extra memory used.

@crusaderky
Copy link
Collaborator

For #5996, I think the metric we need to assess its importance is not "how much time is spend un-spilling keys to transfer them", but "how much extra memory is used by keys that were un-spilled for transfer which otherwise could have remained spilled".

Not a trivial thing to answer, because the same key may be also requested by task execution shortly afterwards. In that case, #5996 would actually double the amount of disk I/O and only slightly delay memory usage.

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event. This makes me infer that the opposite may also be true, that needing a key for get-data shortly after the same key has been unspilled for execute is a fairly common event

@jakirkham
Copy link
Member

Curious whether disabling compression was explored in that experiment?

@gjoseph92
Copy link
Collaborator

Not a trivial thing to answer, because the same key may be also requested by task execution shortly afterwards

Agreed, that would be a separate task to figure out how to instrument it (but it does seem like something worth instrumenting).

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event

I'm not following how to infer that from the graph? Use of a key that's already in memory simply wouldn't show up on the graph. I'm seeing yellow (spill for transfer) go up a little, but green (un-spill for execute) doesn't go down by the same amount after (in fact, it usually spikes too). To me, that could even imply that plenty of keys which are un-spilled for transfer aren't immediately used for execute, otherwise we'd seen green go down more after yellow. But I think all of this is very speculative since the chart doesn't show cache hits. If we could look at the percentage of SpillBuffer accesses that touched disk alongside this, that might tell more of the story.

@fjetter
Copy link
Member Author

fjetter commented Dec 14, 2022

Good job @crusaderky . This is very interesting. Looking forward to see this for other kinds of workloads.

Another question this raises is whether LRU is a good policy for picking the to-be-spilled keys. disk-read-execute is strongly coupled to assigned priorities and I guess a priority based system would perform better than LRU and would reduce the total amount of spilling. I don't think we can estimate the impact of this easily from the provided measurements.
Unblocking the event loop is very likely more impactful, though.

@crusaderky
Copy link
Collaborator

Curious whether disabling compression was explored in that experiment?

It wasn't. I don't think there will be much of a difference here, because the test cases runs on uniformly distributed random floats - e.g. uncompressible.
I would expect that, on highly compressible data, compression would increase pickle/unpickle time and decrease write/read time.

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event

I'm not following how to infer that from the graph? Use of a key that's already in memory simply wouldn't show up on the graph.

Exactly; the yellow part of the graph shows only keys that are requested by other workers and were neither produced nor consumed by execute recently.

@crusaderky
Copy link
Collaborator

Another insight:
the above plot shows that pickle5 buffers are deep-copied upon unspill. I've opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment