Systems challenges in Materialize #7619
antiguru
started this conversation in
Technical musings
Replies: 2 comments 2 replies
-
I ran Materialized with the LazyExchange and message length path, and here are the jemalloc stats. Master: prof_master.txt Patched: prof_patched.txt The allocation pattern is shifted towards smaller allocations, especially around the 8KiB bucket. The total memory consumption is lower, but not directly comparable here because of a different MZ uptime. |
Beta Was this translation helpful? Give feedback.
2 replies
-
After running it a bit longer: prof_patched2.txt Now, the memory consumption is about halved and the allocation classes are still shifted towards the 8KiB class. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(Converted from #7337)
Summary
Materialize shows a sublinear behavior when scaling the number of worker threads.
While some of this is expected, we need to gain a better understanding of how much this affects us and potential remedies to improve the scaling behavior.
This document outlines problems and suggests solutions that can help avoid the problems in the future.
Goals
The goal of this discussion is to analyze memory and CPU-utilization problems and point out possible solutions.
Non-Goals
The goal is to provide insights into Materialize's behavior assuming a fixed set of queries.
It is a non-goal to improve the performance within the SQL/optimization layer or by guiding users on how to rewrite queries.
Description
We need to validate the following three challenges.
Materialize's memory consumption increases superlinearly with the number of worker threads.
The number of worker threads influences buffer allocations for exchange channels.
The communication between workers follows an all-to-all pattern which means that the buffer space alone increases quadratically with the number of workers.
Timely communicates updates to operator capabilities through broadcasts to all worker threads.
In some situations this results in many fine-grained updates carrying only little information.
A downside is that each update might be a small allocation.
We should investigate if this is a problem and what magnitude it has.
Materialize needs to transfer relatively large pieces of data between operators, including across thread boundaries.
It has the potential impact of stressing the memory allocator, which needs to pass back memory to other threads.
For the first and second challenge, Timely does not yet expose the right mechanisms to provide a different policy.
The last problem could be addressed by not passing owned data but instead only revealing references to data.
The set of challenges are based on experience and observations, but not on reproducible tests.
Before we tackle any of them, we should establish a benchmark that clearly highlights the problem and can be used to validate any solution.
Let's look at the problems in detail.
Memory consumption when scaling worker threads
We conduct an experiment to measure the idle memory consumption of Materialize.
First, we build Materialize in release mode and clean the
mzdata
directory, which only preserves the system tables and collections.Next, we start Materialize with a specific amount of workers and measure the Rss after a fixed delay, after which we stop the instance.
The following graphs show the idle memory consumption of Materialize.
The x-axis shows the number of threads, the y-axis indicates the Rss in GiB.
The data is presented in a log-log plot.
The first plot shows data up to 32 threads, the second up to 128 threads.
We collect the data on an AMD Ryzen 3955wx CPU with 16 physical cores and 32 hardware threads, combined with 128 GiB RAM.
Note that the second graph oversubscribes the available CPU cores.
We observe that the memory consumption grows with the number of threads and seems to follow an exponential pattern.
A single worker consumes 330 MiB Rss, which grows to 7.2 GiB for 32 workers and increases further to 48.6 GiB at 128 worker threads.
Worker CPU time overhead
In this experiment, we again start an idle and unconfigured Materialize instance while varying the number of Timely workers.
After 2 minutes, we measure how much time it spend in user and kernel mode (utime and stime).
Each test is repeated eight times.
For each worker configuration on the x-axis, we show how much time Materialize spends in user and kernel mode on the y-axis, and the same number normalized by the number Timely worker threads.
We observe that the time Materialize spends in user and kernel mode increases with the number of Timely worker threads.
Normalizing the time by the number of workers first shows a decrease in the number of time per worker until it start to increase again.
Both user and system time follow a similar pattern, albeit the system time at a lower magnitude.
Where does Materialize consume memory?
Materialize handles queries and maintains tables and indexes, both requiring to store the data and maintaining metadata. Dataflows often maintain their outputs as an arrangement, which contains a view of a relation potentially at multiple times, future updates and non-consolidated data. Timely operators can retain memory for efficiency reasons to to reuse allocations. The communication layer requires buffer allocations to send data to downstream operators or across thread/process boundaries. On top of this, all coordination metadata needs to be retained as well. In this analysis, we want to focus on aspects related to Timely and Differential and drill down into each of them.
Relating the number of Timely workers to memory
Above, we described that with an increase in the number of workers, the memory consumption increases superlinearly, at least quadratically. A reason for this can be found in the structure of exchange channels that transfer data across thread boundaries. For efficiency reasons, an exchange channel maintains buffers for each destination until all data has been seen, and only then flushes the per-destination buffers (simplified, flush occurs on full buffers).
The current implementation eagerly allocates the per-destination buffers. This explains a quadratic growth in memory consumption because every worker maintains exchange channels to all other workers. The benefit of eagerly allocating the buffers is that no allocations happen during distributing data, buffers are of equal size and can potentially reused in the future.
We can change the behavior in various ways, where either deferring allocations until they are needed or releasing memory are two options to explore. Also see TimelyDataflow/timely-dataflow#397, TimelyDataflow/timely-dataflow#400, TimelyDataflow/timely-dataflow#403, and TimelyDataflow/timely-dataflow#406 for related discussions.
Default buffer sizes
Timely's default is to allocate buffers of 1024 elements. In the past, this was a reasonable default, but with different kinds of data and correspondingly different allocation sizes, we might want to revisit this choice. For a
usize
element type, this results in 8KiB buffers on a 64-bit architecture, forRow
with 32 bytes base size it results in 32KiB. In Materialize, many stream types are combinations ofRow
s (key/value), timestamp, and delta, which per element result in 80KiB (2*64+8+8 per element) allocations. This causes allocations to spill into jemalloc's shared pool and does not fit into CPU L1 caches.An alternative to use fixed-element buffers is to use fixed-size buffers, which can be achieved by setting the default length of the buffers to a buffer size divided by the element size. In TimelyDataflow/timely-dataflow#403, we demonstrate this by basing the default length off a fixed buffer size, ensuring that at least a single element fits if the buffer size would otherwise not be large enough for one element.
Applying such a change for all of Timely might not be ideal. It would enforce the same sizing policy to all of Timely's clients, likely causing a change in behavior. From this follows the question: Is it possible to introduce a policy into Timley such that clients can express their default buffer behavior in a more granular fashion?
A similar problem exists for Timely's logging infrastructure. It allows to observe events within Timely, which are evaluated when a staging buffer reaches its capacity. The size of this buffer is set to 1024 elements, independent of the size of the elements. The element size, depending on the logger, can be large and is around 80 bytes for Timely's native logging. Again, this causes large allocations, which we would like to avoid.
Memory allocations in Differential
Differential arrangements store timestamped data in batches. Once the frontier advances, some batches become eligible for merging. During a merge, two batches are combined into a potentially larger one. Storing logarithmically sized batches amortizes the need to merge batches and reduces memory allocation, but still presents the memory allocator with possibly large allocations.
The size of retained data in arrangements depends on the data, amount of updates and timestamps, and the merge effort Differential spends to compact batches. The current implementation stores each batch as a contiguous allocation in memory, which potentially presents many differently-sized allocations to the allocator during the lifetime of a query.
Here, we might want to experiment with different allocation strategies:
unsafe
code.For example, the
mz_metrics
table stores a sliding window of five minutes of metrics, which are collected at one-second intervals. Each metric is a JSON key, timestamp and a value. The value is a floating-point number, which prevents using the value as the difference.Alternatives
Open questions
Beta Was this translation helpful? Give feedback.
All reactions