Instrumentation of SpillBuffer #7351

fjetter · 2022-11-25T15:43:36Z

The only way we currently have to observe disk access is startstops we measure whenever we load/store data.

However, with our SpillBuffer we have the possibility to introduce many instrumentation hooks to get much better insights into what's going on

For instance

Total number of keys written / read
Number of currently managed files
Average and total size of data written to disk per key
Total duration spent writing / reading
Average / max time spent reading / writing per key
Number of evicts (memory.spill)
Number of writes without evict because buffer is full right away
...

These metrics should not be tracked on TaskState level.

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-12-02T17:53:00Z

Number of writes without evict because buffer is full right away

Not sure if I understand this one. The only case where a key doesn't end up in fast when you write to the SpillBuffer is when it's individually larger than the target threshold.

fjetter · 2022-12-02T17:59:55Z

Not sure if I understand this one. The only case where a key doesn't end up in fast when you write to the SpillBuffer is when it's individually larger than the target threshold.

I thought there was some logic that would put a key into slow if this key would push us over the limit. If that's not the case, ignore this, I don't think data shards larger than the limit are a very common problem we need to build instrumentation for

crusaderky · 2022-12-02T21:06:15Z

I thought there was some logic that would put a key into slow if this key would push us over the limit.

No, if inserting a key pushes us over the limit, the least recently used keys are pushed out. The latest inserted one is on top of the LRU pipe and is the only one guaranteed to be in fast at the end of the insertion.

fjetter · 2022-12-05T09:52:16Z

The thing I was hoping to differentiate with this comment is the spilling that happens as a result of setitem vs the spilling that happens when the memory_manager is evicting.
If it's not the same key, that's fine but if there is an evict coupled to the setitem, that might be an interesting datapoint

crusaderky · 2022-12-05T16:38:35Z

spilling that happens as a result of setitem vs the spilling that happens when the memory_manager is evicting.

Both will evict the same keys, in the same order. The only difference is that memory_manager kicks in when there's substantial unmanaged memory (but on the flip side it's less responsive).

crusaderky · 2022-12-06T11:25:31Z

Closed by

SpillBuffer metrics #7368
Prometheus: measure how much spilling blocks the event loop #7370
Update documentation: crusaderky@9d1175f
- blocked by Update Prometheus docs dask#9696
- blocked by Move Prometheus docs to dask/distributed dask#9761
- blocked by Move Prometheus docs from dask/dask #7405
Update Coiled Grafana dashboard: (see next post)

crusaderky · 2022-12-13T22:43:54Z

A modified copy of the Coiled Grafana dashboard is now available at https://grafana.dev-sandbox.coiledhq.com/d/eU1bT-nVw

New plots:

The plots above were run on coiled-runtime/tests/stability/test_spill.py::test_tensordot_stress.
They offer a wealth of new insights:

the time spent pickling / unpickling is negligible (the plots on the right are stacked). Note that the test use case uses exclusively native, uncompressible numpy data.
the unspill events triggered by get-data are modest compared to those triggered by local execute events. Note that the test ran on 5 workers; it is advisable to re-run on a much higher number of workers and see if this changes.
If the above two were to be confirmed in more general use cases (pandas/arrow data and a large number of workers), they suggest that these two proposals would offer a fairly poor cost/benefit ratio.
- Deserialise data on demand #5900
- Stream spilled data directly from disk to other workers without loading into memory (sendfile-style) #5996
Unsurprisingly, worst-case tick duration and worst-case spill duration are tightly correlated. From this picture we're notably missing garbage collection time, which however could be also metered (out of scope).
The contiguous time in which the whole event loop goes into apnoea due to spilling, thus neglecting to start new tasks and perform any sort of network comms, is pretty horrid. This, together with the mild cost of pickling/unpickling (which holds the GIL), suggests that this proposal should be treated as high priority:
- Asynchronous Disk Access in Workers #4424
The spill threshold is designed to be an emergency release valve when the target threshold fails. This happens when (a) the output of sizeof() is inaccurate and/or (b) there are large amounts of unmanaged memory. In this use case (a) is not applicable (it's all trivial numpy data); and yet we see that the spill threshold is crossed very frequently. This is not ideal and we should maybe reconsider the gap between the target and spill thresholds.

@ntabris could you please review the modified grafana dashboard and, if you're happy with it, merge the new plots into the main one? (note that the PRs producing the new data have not been merged yet).

gjoseph92 · 2022-12-13T23:12:58Z

This is great work—having this sort of information is extremely helpful both operationally when using dask, and for prioritizing what to improve next.

they suggest that these two proposals would offer a fairly poor cost/benefit ratio

It seems like the gist of what you're saying here is "we have to un-spill in order to execute tasks a lot more often that we un-spill to transfer keys". I'm curious how well that generalizes, or how specific it is to the scheduling and transfer patterns of test_tensordot_stress. Given what we see here, I assume that if a workload did exist where most un-spilling was done for transferring keys, then it would look just as bad as this one—the question is just how common that case is, and what kinds of workloads cause it.

For #5996, I think the metric we need to assess its importance is not "how much time is spend un-spilling keys to transfer them", but "how much extra memory is used by keys that were un-spilled for transfer which otherwise could have remained spilled". Presumably, async disc access would address the time component for un-spilling, whether due to execute or transfer. The purpose of sendfile would be to reduce the extra memory used.

crusaderky · 2022-12-13T23:25:22Z

For #5996, I think the metric we need to assess its importance is not "how much time is spend un-spilling keys to transfer them", but "how much extra memory is used by keys that were un-spilled for transfer which otherwise could have remained spilled".

Not a trivial thing to answer, because the same key may be also requested by task execution shortly afterwards. In that case, #5996 would actually double the amount of disk I/O and only slightly delay memory usage.

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event. This makes me infer that the opposite may also be true, that needing a key for get-data shortly after the same key has been unspilled for execute is a fairly common event

jakirkham · 2022-12-14T00:37:21Z

Curious whether disabling compression was explored in that experiment?

gjoseph92 · 2022-12-14T01:03:45Z

Not a trivial thing to answer, because the same key may be also requested by task execution shortly afterwards

Agreed, that would be a separate task to figure out how to instrument it (but it does seem like something worth instrumenting).

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event

I'm not following how to infer that from the graph? Use of a key that's already in memory simply wouldn't show up on the graph. I'm seeing yellow (spill for transfer) go up a little, but green (un-spill for execute) doesn't go down by the same amount after (in fact, it usually spikes too). To me, that could even imply that plenty of keys which are un-spilled for transfer aren't immediately used for execute, otherwise we'd seen green go down more after yellow. But I think all of this is very speculative since the chart doesn't show cache hits. If we could look at the percentage of SpillBuffer accesses that touched disk alongside this, that might tell more of the story.

fjetter · 2022-12-14T11:49:25Z

Good job @crusaderky . This is very interesting. Looking forward to see this for other kinds of workloads.

Another question this raises is whether LRU is a good policy for picking the to-be-spilled keys. disk-read-execute is strongly coupled to assigned priorities and I guess a priority based system would perform better than LRU and would reduce the total amount of spilling. I don't think we can estimate the impact of this easily from the provided measurements.
Unblocking the event loop is very likely more impactful, though.

crusaderky · 2022-12-14T14:00:06Z

Curious whether disabling compression was explored in that experiment?

It wasn't. I don't think there will be much of a difference here, because the test cases runs on uniformly distributed random floats - e.g. uncompressible.
I would expect that, on highly compressible data, compression would increase pickle/unpickle time and decrease write/read time.

The plot on the top right suggests that unspilling a key for get-data not shortly after the same key has been unspilled for execute is a fairly uncommon event

I'm not following how to infer that from the graph? Use of a key that's already in memory simply wouldn't show up on the graph.

Exactly; the yellow part of the graph shows only keys that are requested by other workers and were neither produced nor consumed by execute recently.

crusaderky · 2022-12-14T14:19:30Z

Another insight:
the above plot shows that pickle5 buffers are deep-copied upon unspill. I've opened

Deep copy causes memory flare on unspill #7407

fjetter added the diagnostics label Nov 25, 2022

fjetter mentioned this issue Nov 25, 2022

Prometheus metrics improvements #7345

Open

9 tasks

crusaderky self-assigned this Dec 2, 2022

crusaderky mentioned this issue Dec 2, 2022

SpillBuffer metrics #7368

Merged

crusaderky mentioned this issue Dec 6, 2022

Prometheus: measure how much spilling blocks the event loop #7370

Merged

This was referenced Dec 13, 2022

Deserialise data on demand #5900

Open

Stream spilled data directly from disk to other workers without loading into memory (sendfile-style) #5996

Open

Asynchronous Disk Access in Workers #4424

Open

crusaderky added the memory label Dec 13, 2022

This was referenced Dec 14, 2022

Move Prometheus docs from dask/dask #7405

Merged

Move Prometheus docs to dask/distributed dask/dask#9761

Merged

crusaderky mentioned this issue Dec 14, 2022

Deep copy causes memory flare on unspill #7407

Closed

crusaderky closed this as completed in #7370 Dec 15, 2022

crusaderky reopened this Dec 15, 2022

crusaderky closed this as completed in #7368 Dec 15, 2022

crusaderky reopened this Dec 15, 2022

crusaderky closed this as completed in dask/dask#9761 Dec 20, 2022

crusaderky reopened this Dec 20, 2022

crusaderky mentioned this issue Dec 20, 2022

Scale test_spill.py; test highly compressible data coiled/benchmarks#629

Merged

crusaderky closed this as completed in coiled/benchmarks#629 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumentation of SpillBuffer #7351

Instrumentation of SpillBuffer #7351

fjetter commented Nov 25, 2022 •

edited

Loading

crusaderky commented Dec 2, 2022

fjetter commented Dec 2, 2022

crusaderky commented Dec 2, 2022

fjetter commented Dec 5, 2022 •

edited

Loading

crusaderky commented Dec 5, 2022

crusaderky commented Dec 6, 2022 •

edited

Loading

crusaderky commented Dec 13, 2022 •

edited

Loading

gjoseph92 commented Dec 13, 2022

crusaderky commented Dec 13, 2022

jakirkham commented Dec 14, 2022

gjoseph92 commented Dec 14, 2022

fjetter commented Dec 14, 2022

crusaderky commented Dec 14, 2022

crusaderky commented Dec 14, 2022

Instrumentation of SpillBuffer #7351

Instrumentation of SpillBuffer #7351

Comments

fjetter commented Nov 25, 2022 • edited Loading

crusaderky commented Dec 2, 2022

fjetter commented Dec 2, 2022

crusaderky commented Dec 2, 2022

fjetter commented Dec 5, 2022 • edited Loading

crusaderky commented Dec 5, 2022

crusaderky commented Dec 6, 2022 • edited Loading

crusaderky commented Dec 13, 2022 • edited Loading

gjoseph92 commented Dec 13, 2022

crusaderky commented Dec 13, 2022

jakirkham commented Dec 14, 2022

gjoseph92 commented Dec 14, 2022

fjetter commented Dec 14, 2022

crusaderky commented Dec 14, 2022

crusaderky commented Dec 14, 2022

fjetter commented Nov 25, 2022 •

edited

Loading

fjetter commented Dec 5, 2022 •

edited

Loading

crusaderky commented Dec 6, 2022 •

edited

Loading

crusaderky commented Dec 13, 2022 •

edited

Loading