Systems challenges in Materialize #7619

antiguru · 2021-07-30T13:25:08Z

antiguru
Jul 30, 2021
Maintainer

(Converted from #7337)

Summary

Materialize shows a sublinear behavior when scaling the number of worker threads.
While some of this is expected, we need to gain a better understanding of how much this affects us and potential remedies to improve the scaling behavior.

This document outlines problems and suggests solutions that can help avoid the problems in the future.

Goals

The goal of this discussion is to analyze memory and CPU-utilization problems and point out possible solutions.

Non-Goals

The goal is to provide insights into Materialize's behavior assuming a fixed set of queries.
It is a non-goal to improve the performance within the SQL/optimization layer or by guiding users on how to rewrite queries.

Description

We need to validate the following three challenges.

Materialize's memory consumption increases superlinearly with the number of worker threads.
The number of worker threads influences buffer allocations for exchange channels.
The communication between workers follows an all-to-all pattern which means that the buffer space alone increases quadratically with the number of workers.
Timely communicates updates to operator capabilities through broadcasts to all worker threads.
In some situations this results in many fine-grained updates carrying only little information.
A downside is that each update might be a small allocation.
We should investigate if this is a problem and what magnitude it has.
Materialize needs to transfer relatively large pieces of data between operators, including across thread boundaries.
It has the potential impact of stressing the memory allocator, which needs to pass back memory to other threads.

For the first and second challenge, Timely does not yet expose the right mechanisms to provide a different policy.
The last problem could be addressed by not passing owned data but instead only revealing references to data.

The set of challenges are based on experience and observations, but not on reproducible tests.
Before we tackle any of them, we should establish a benchmark that clearly highlights the problem and can be used to validate any solution.

Let's look at the problems in detail.

Memory consumption when scaling worker threads

We conduct an experiment to measure the idle memory consumption of Materialize.
First, we build Materialize in release mode and clean the mzdata directory, which only preserves the system tables and collections.
Next, we start Materialize with a specific amount of workers and measure the Rss after a fixed delay, after which we stop the instance.

The following graphs show the idle memory consumption of Materialize.
The x-axis shows the number of threads, the y-axis indicates the Rss in GiB.
The data is presented in a log-log plot.
The first plot shows data up to 32 threads, the second up to 128 threads.
We collect the data on an AMD Ryzen 3955wx CPU with 16 physical cores and 32 hardware threads, combined with 128 GiB RAM.
Note that the second graph oversubscribes the available CPU cores.

We observe that the memory consumption grows with the number of threads and seems to follow an exponential pattern.
A single worker consumes 330 MiB Rss, which grows to 7.2 GiB for 32 workers and increases further to 48.6 GiB at 128 worker threads.

Worker CPU time overhead

In this experiment, we again start an idle and unconfigured Materialize instance while varying the number of Timely workers.
After 2 minutes, we measure how much time it spend in user and kernel mode (utime and stime).
Each test is repeated eight times.

For each worker configuration on the x-axis, we show how much time Materialize spends in user and kernel mode on the y-axis, and the same number normalized by the number Timely worker threads.

We observe that the time Materialize spends in user and kernel mode increases with the number of Timely worker threads.
Normalizing the time by the number of workers first shows a decrease in the number of time per worker until it start to increase again.
Both user and system time follow a similar pattern, albeit the system time at a lower magnitude.

Where does Materialize consume memory?

Materialize handles queries and maintains tables and indexes, both requiring to store the data and maintaining metadata. Dataflows often maintain their outputs as an arrangement, which contains a view of a relation potentially at multiple times, future updates and non-consolidated data. Timely operators can retain memory for efficiency reasons to to reuse allocations. The communication layer requires buffer allocations to send data to downstream operators or across thread/process boundaries. On top of this, all coordination metadata needs to be retained as well. In this analysis, we want to focus on aspects related to Timely and Differential and drill down into each of them.

Relating the number of Timely workers to memory

Above, we described that with an increase in the number of workers, the memory consumption increases superlinearly, at least quadratically. A reason for this can be found in the structure of exchange channels that transfer data across thread boundaries. For efficiency reasons, an exchange channel maintains buffers for each destination until all data has been seen, and only then flushes the per-destination buffers (simplified, flush occurs on full buffers).

The current implementation eagerly allocates the per-destination buffers. This explains a quadratic growth in memory consumption because every worker maintains exchange channels to all other workers. The benefit of eagerly allocating the buffers is that no allocations happen during distributing data, buffers are of equal size and can potentially reused in the future.

We can change the behavior in various ways, where either deferring allocations until they are needed or releasing memory are two options to explore. Also see TimelyDataflow/timely-dataflow#397, TimelyDataflow/timely-dataflow#400, TimelyDataflow/timely-dataflow#403, and TimelyDataflow/timely-dataflow#406 for related discussions.

Lazily allocating memory. Instead of pre-allocating buffers, the pushers start with an empty vector and rely on Rust's allocation behavior until the initial capacity is reached. Then, the buffer is reallocated to the default size. On push, the returned buffer is reused or a new buffer of default size is eagerly allocated, with the assumption that more data will be pushed. On flush, it retains the returned buffer, but only ensures that it is not larger than the default size.
Eagerly freeing memory. Similar to 1, with the notable difference that all push buffers are released on flush.

Default buffer sizes

Timely's default is to allocate buffers of 1024 elements. In the past, this was a reasonable default, but with different kinds of data and correspondingly different allocation sizes, we might want to revisit this choice. For a usize element type, this results in 8KiB buffers on a 64-bit architecture, for Row with 32 bytes base size it results in 32KiB. In Materialize, many stream types are combinations of Rows (key/value), timestamp, and delta, which per element result in 80KiB (2*64+8+8 per element) allocations. This causes allocations to spill into jemalloc's shared pool and does not fit into CPU L1 caches.

An alternative to use fixed-element buffers is to use fixed-size buffers, which can be achieved by setting the default length of the buffers to a buffer size divided by the element size. In TimelyDataflow/timely-dataflow#403, we demonstrate this by basing the default length off a fixed buffer size, ensuring that at least a single element fits if the buffer size would otherwise not be large enough for one element.

Applying such a change for all of Timely might not be ideal. It would enforce the same sizing policy to all of Timely's clients, likely causing a change in behavior. From this follows the question: Is it possible to introduce a policy into Timley such that clients can express their default buffer behavior in a more granular fashion?

A similar problem exists for Timely's logging infrastructure. It allows to observe events within Timely, which are evaluated when a staging buffer reaches its capacity. The size of this buffer is set to 1024 elements, independent of the size of the elements. The element size, depending on the logger, can be large and is around 80 bytes for Timely's native logging. Again, this causes large allocations, which we would like to avoid.

Memory allocations in Differential

Differential arrangements store timestamped data in batches. Once the frontier advances, some batches become eligible for merging. During a merge, two batches are combined into a potentially larger one. Storing logarithmically sized batches amortizes the need to merge batches and reduces memory allocation, but still presents the memory allocator with possibly large allocations.

The size of retained data in arrangements depends on the data, amount of updates and timestamps, and the merge effort Differential spends to compact batches. The current implementation stores each batch as a contiguous allocation in memory, which potentially presents many differently-sized allocations to the allocator during the lifetime of a query.

Here, we might want to experiment with different allocation strategies:

Columnarization stores data in a column-first format and can pack allocations more densely, but comes at a price of additional complexity and potentially using unsafe code.
Chunking arrangements into equivalently-sized chunks can avoid large allocations. For example, we could limit the maximum allocation size for a batch to 2 MiB. This requires some rework within Differential to enable the current merging and compaction logic to operate on non-contiguous allocations.

For example, the mz_metrics table stores a sliding window of five minutes of metrics, which are collected at one-second intervals. Each metric is a JSON key, timestamp and a value. The value is a floating-point number, which prevents using the value as the difference.

Alternatives

An obvious alternative is not to do anything, because Materialize currently works fine.

Open questions

antiguru · 2021-08-02T14:58:12Z

antiguru
Aug 2, 2021
Maintainer Author

I ran Materialized with the LazyExchange and message length path, and here are the jemalloc stats.

Master: prof_master.txt

Patched: prof_patched.txt

The allocation pattern is shifted towards smaller allocations, especially around the 8KiB bucket. The total memory consumption is lower, but not directly comparable here because of a different MZ uptime.

2 replies

awang Aug 9, 2021

Those attached profiles look like HTML files, they don't seem to contain anything?

antiguru Aug 10, 2021
Maintainer Author

Yep, that's a mistake on my part! The browser content was sent via an HTTP POST request. "Save as" would issue a GET to obtain the contents, which would result in the landing page being saved. #7703 changes rendering to GET with a proper URL, so the issue shouldn't happen again.

antiguru · 2021-08-02T15:53:28Z

antiguru
Aug 2, 2021
Maintainer Author

After running it a bit longer: prof_patched2.txt

Now, the memory consumption is about halved and the allocation classes are still shifted towards the 8KiB class.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systems challenges in Materialize #7619

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Systems challenges in Materialize #7619

antiguru Jul 30, 2021 Maintainer

Summary

Goals

Non-Goals

Description

Memory consumption when scaling worker threads

Worker CPU time overhead

Where does Materialize consume memory?

Relating the number of Timely workers to memory

Default buffer sizes

Memory allocations in Differential

Alternatives

Open questions

Replies: 2 comments · 2 replies

antiguru Aug 2, 2021 Maintainer Author

awang Aug 9, 2021

antiguru Aug 10, 2021 Maintainer Author

antiguru Aug 2, 2021 Maintainer Author

antiguru
Jul 30, 2021
Maintainer

Replies: 2 comments 2 replies

antiguru
Aug 2, 2021
Maintainer Author

antiguru Aug 10, 2021
Maintainer Author

antiguru
Aug 2, 2021
Maintainer Author