-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GAIAPLAT-1752] INDEX: eliminate memory allocation when garbage collecting offsets #1124
Conversation
…cting offsets from index yeet unordered_set outta index_builder
This has reminded me that I need to finally get rid of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just be able to use std::vector instead of std::array everywhere, for simpler code and no hardcoded array sizes.
{ | ||
const auto& log_record = records.log_records[i]; | ||
gaia_offset_t offset = deallocate_new_offsets ? log_record.new_offset : log_record.old_offset; | ||
std::array<gaia_offset_t, c_offset_buffer_size> collected_offsets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code would be simpler if you used std::vector
instead of std::array
, and instead of maintaining parallel arrays of offsets and their types, you just used a single vector of (offset, type)
pairs. That would allow you to just use push_back()
to add elements instead of maintaining a "last index" variable, and you wouldn't need to keep the two arrays' current insert positions in sync. You also wouldn't need to hardcode the array size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to remove any heap allocations whatsoever the size needs to be hard coded and keep track of the number of elements. At that point the code is going to be similarly complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "removing any heap allocations whatsoever" is a reasonable goal; this is not an embedded or hard real-time system. We should just try to minimize dynamic allocations in hot paths. If you don't call shrink_to_fit() then std::vector will never have to reallocate if you don't exceed its max size, and you can call reserve() or the sized constructor to allocate the maximum "reasonable" size up front, and let the vector automatically handle exceptional cases that overflow that size. This will normally perform only one dynamic allocation per instance, so it is nearly as efficient as stack-allocating using std::array, except that you don't need special code to handle overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility: in C++17, you can use a stack-allocated or static thread_local
fixed-size buffer with std::vector
via std::pmr::monotonic_buffer_resource
. Example from https://www.cppstories.com/2020/06/pmr-hacking.html/:
#include <iostream>
#include <memory_resource> // pmr core types
#include <vector> // pmr::vector
int main() {
char buffer[64] = {}; // a small buffer on the stack
std::fill_n(std::begin(buffer), std::size(buffer) - 1, '_');
std::cout << buffer << '\n';
std::pmr::monotonic_buffer_resource pool{std::data(buffer), std::size(buffer)};
std::pmr::vector<char> vec{ &pool };
for (char ch = 'a'; ch <= 'z'; ++ch)
vec.push_back(ch);
std::cout << buffer << '\n';
}
(Note that I'm not suggesting this is actually warranted in this case.)
gaia_type_t indexed_type = it.second->table_type(); | ||
|
||
// This index does not contain any of the deleted offsets. | ||
if (std::find(offset_types.begin(), offset_types.end(), indexed_type) == offset_types.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you used std::find_if, I think you could use (offset, type) pairs and the code could be just as concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that remove_entries_with_offsets
could be updated to also work with an array of pairs, so that only on type of array is needed.
@@ -48,6 +49,8 @@ class index_writer_guard_t | |||
T_structure& m_data; | |||
}; | |||
|
|||
constexpr size_t c_offset_buffer_size = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any need to hardcode the array size if you just use std::vector.
production/db/inc/index/index.inc
Outdated
@@ -42,7 +42,7 @@ void index_t<T_structure, T_iterator>::remove_index_entry_with_offsets(const std | |||
|
|||
for (auto it = m_data.begin(); it != m_data.end();) | |||
{ | |||
if (offsets.find(it->second.offset) != offsets.end()) | |||
if (std::find(offsets.begin(), offsets.end(), it->second.offset) != offsets.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, I think you could use std::find_if with a vector of (offset, type) pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (don't have anything to add to the given comments).
const auto& log_record = records.log_records[i]; | ||
gaia_offset_t offset = deallocate_new_offsets ? log_record.new_offset : log_record.old_offset; | ||
std::array<gaia_offset_t, c_offset_buffer_size> collected_offsets; | ||
std::array<gaia_type_t, c_offset_buffer_size> offset_types; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need two arrays or can you just use an array of pairs to hold two values for each entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array of pairs is ok, memory layout is better for two separate arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, "array of structs" can be cheaper or more expensive WRT cache efficiency than "struct of arrays". It all depends on the access patterns (e.g. OLTP row-oriented tuple storage vs. OLAP column-oriented tuple storage).
switch (it.second->type()) | ||
{ | ||
case catalog::index_type_t::range: | ||
remove_entries_with_offsets<range_index_t>(it.second.get(), collected_offsets, records.begin_ts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused - doesn't collected_offsets
contain offsets of different types? Why would we pass the full array to an index of a specific type?
Perhaps a better approach is to build a map of types to offset arrays/vectors so that we can only pass relevant offsets in these calls. Building that map can be done with trivial changes to the code that fills the 2 arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if the offset array is small enough it would be reasonable to just scan the whole array (assuming it contains offset/type pairs) and filter out offsets with inapplicable types. This would be really concise with C++20 ranges but still wouldn't be much code in a for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset array is small limited to 32 records. Passing the type does not provide any benefit over straight checking against offset: if the type is correct, we still need to do another comparison, if it is wrong, we already did one comparison. Comparing straight against the offset is one comparison regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a fragile assumption. If you rely on the size being small, then that should be statically asserted somewhere, otherwise someone can easily edit the constant value and bump it to a much larger value than you expect to have to deal with in this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32 is a somewhat arbitrary size set by me. I have not done any benchmarking and picked this value because it seems reasonable. I think it's fine if someone bumps it up (in fact my guess is the performance at 128 or 256 elements is acceptable). The important thing is that the array is bounded at some size.
The dangerous case for gc still remains on 'hot' indexes with many, many entries, and I have not addressed this danger. That case would require a different strategy altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more comments to the constant declaration and, if possible, some static_asserts to document our expectations on the size of these arrays. Their length is declared in one place and the reliance on that length being small happens in a different place. It's very easy for someone to update the constant without being aware of the reliance on a small value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the offset array is small is rather irrelevant given that the number of arrays you can batch here is not really bounded.
Furthermore, any comparison that you avoid once when you build your array will add overhead to the find operation executed for all indexes of a different type. The penalty of including offsets of wrong type gets multiplied by the number of indexes of different type.
I.e.: Let's say you collect 2 values each for 8 different types. You avoided 16 type comparisons during this collection. But now for each index, you'll perform a linear search through a 16 element array instead of performing a search through a 2 element array (or no search at all for indexes of different types). Your performance is 8 times worse in this example!
This is why I'd rather see an implementation done with no regard to memory allocation and then look to see if it can be optimized further to remove that memory allocation. As it is now, you're probably doing more operations doing linear array searches than you'd do allocating memory for some vectors. This is premature optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the important worst-case bound here is not on the size of a single array, but on the work done by a thread in a single GC pass. So far I've avoided bounding per-thread GC work in favor of approximating fairness by GC handoff on contention. But if there's no opportunity for contention detection and backoff in this case, we may need to design fairness explicitly into the protocol, which implies batching work across GC threads. I think I need to understand the possible worst-case scenarios better here before commenting further.
I'm not terribly worried about optimizing comparisons, given that basically any computation done in an already-resident cache line is free. Indirections from a mapping structure could easily swamp the overhead of unnecessary comparisons. But of course we don't want to scan thousands or millions of entries for a single entry of a given type. We could possibly optimize this without introducing expensive mappings, e.g. by sorting the (offset, type) pairs by type and binary searching for the first occurrence of a type. Seems like maybe this would be better discussed in a meeting, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, I'm not worried at all about linear searches of an array that fits into 1 or 2 cache lines. Those comparisons are effectively free (once you've taken the cache miss). Performance will definitely not be "8 times worse" for a 16-element array than a 2-element array. But I do think we could safely use much larger arrays if we sorted (offset, type) pairs by type and binary-searched for the first occurrence of a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left several comments and suggestions, some of which apply to the original approach too - my apologies for not making those comments when the code was originally introduced.
Needed to do a lot of rewriting to eliminate whole index scans. |
Has this PR been superseded by #1151? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I still think the data structure design should be revisited, but we can do that after this is merged, given that another PR depends on it.
…cting offsets (#1151) * [GAIAPLAT-1810] Do not scan entire index structure when garbage-collecting offsets * Addressed one comment. Co-authored-by: yiwen-wong <[email protected]>
Removed unordered_set and replaced the with fixed size arrays/buffers for garbage collection of offset for indices.
Also opportunistically switched the catalog integrity check against dropped types to std::vector. Note: memory allocation has not been removed from that path.