-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ann-bench multithreading #2021
Fix ann-bench multithreading #2021
Conversation
…cing the raft stream to CPU in all raft algos
Progress notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one substantive question, but this looks great. Love the design of configured_raft_resources
.
* | ||
* To avoid copying the index (database) in each thread, we make both the index and | ||
* the gpu_resource shared. | ||
* This means faiss GPU streams are possibly shared among the CPU threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an issue tracking this?
@wphicks this PR isn't quite ready to be merged yet. We have some further discussions before this is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure we discuss this before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjnolet pointed me to this PR because I was asking about uses of MR comparisons. Looking at the call to is_equal
used in this PR I had a few other suggestions to improve the surrounding code.
configured_raft_resources() | ||
: configured_raft_resources{ | ||
{[]() { | ||
auto* mr = new rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>{ | ||
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull}; | ||
rmm::mr::set_current_device_resource(mr); | ||
return mr; | ||
}(), | ||
[](rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>* mr) { | ||
if (rmm::mr::get_current_device_resource()->is_equal(*mr)) { | ||
rmm::mr::set_current_device_resource(mr->get_upstream()); | ||
} | ||
delete mr; | ||
}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Initialize the members of the class by name, not as a struct.
suggestion: don't bury the re-setting of the previous device resource inside an unnamed deleter lambda. Just reset it in your class destructor.
question: does the mr_ really need to be shared?
suggestion: use a unique_ptr.
configured_raft_resources() | |
: configured_raft_resources{ | |
{[]() { | |
auto* mr = new rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>{ | |
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull}; | |
rmm::mr::set_current_device_resource(mr); | |
return mr; | |
}(), | |
[](rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>* mr) { | |
if (rmm::mr::get_current_device_resource()->is_equal(*mr)) { | |
rmm::mr::set_current_device_resource(mr->get_upstream()); | |
} | |
delete mr; | |
}}} | |
configured_raft_resources() | |
: mr_{make_unique<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>>( | |
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull)} | |
{ | |
} | |
~configured_raft_resources() | |
{ | |
// reset the previous current_device_resource before deleting our pool. | |
rmm::mr::set_current_device_resource(mr_->get_upstream()); | |
} | |
Note that this also removes the call to is_equal()
-- is there a reason in this benchmark that you are worried about the possibility of the current resource changing during the lifetime of this class?
In the future, please use operator==
on MRs, because is_equal
is likely to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mark, thanks for having a look at this!
First of all, yes, we must use the shared pointer here. We create a new configured_raft_resources
at most once per benchmark case, and let multiple concurrent threads spawned by gbench copy it. We do this because we need some of the resources not be shared due to thread safety (the raft handle and the event). However, we must have exactly one rmm pool resource shared by everything, which is dictated by rmm current_device_resource semantics. Hence we need the pool resource be initialized with the original configured_raft_resources
and destructed after the last copy of configured_raft_resources
is destructed.
The same applies to setting/unsetting the rmm current_device_resource. Hence I cannot just reset it my class destructor and not bury it inside an unnamed deleter lambda. An alternative would be to create a new class with the destructor that does the same and put it in the shared pointer, but why bother?
Also, I'm just curious as I'm not that experienced with the latest C++ best practices. What are the downsides of using curly braces to initialize the members "as a struct" as I did in this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downsides are that you aren't initializing them by name, so it is less readable, and if the order of the class members changes this initialization will be invalid.
My suggestions were aimed at improving the readability and maintainability of the code, which currently I find obfuscated and error prone. But really this is not a library I have to read or maintain so it's up to you and the RAFT reviewers.
The one thing that I ask (and the thing that brought me here) is that you remove the call to is_equal()
for the resources, as this is probably going away. If you must compare the resources, use ==
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out to the rmm issue, I've updated the code and added a bit more comments alongside to make the intent clear.
I'm genuinely willing to improve the readability of the code, but I'm still not sure I understand what you mean regarding the initializing-as-a-struct issue. To me the pattern seems pretty similar to your rmm::cuda_stream, which I don't find obfuscated or unreadable. Could you please adjust your suggestion to not change the semantics of the code (lifetime of the pool)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achirkin the big difference between the rmm example you've pointed out and the example here is that the rmm initializers are still naming the members so if the names should change or he rearranged, maintenance becomes much more straightforward.
I do agree with Mark here, and to be quite honest, I find the use of the pattern here (unnamed parameter initialization through lambdas) to be quite hard to read myself. I also prefer that we at least name the parameters, especially for a list that includes multiple different members. It seems like not much additional work in proportion to the significant gain in readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I really feel dumb staring at the code and not seeing what you both mean :) I'm just calling here another constructor with a single argument that takes the pool, am I not? If you're referring to the aggregate initialization, it's not even possible for this class, because it has two user-declared constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/merge |
With the changes introduced by #2021, the copied FAISS benchmark wrapper contains a cuda event that is used for synchronizing between streams during search. The lifetime of the event is the same as of the wrapper, but the event handle itself is copied between the wrappers; this leads to illegal memory accesses and crashes. This PR fixes the bug by creating a new cuda event on each wrapper copy, so that the wrappers do not share their synchronization events. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2062
In the current state, ann-benchmarks running in the `--throughput` mode (multi-threaded) share ANN wrappers among CPU threads. This is not thread-safe and may result in incorrectly measured time (e.g. sharing cuda events among CPU threads) or various exceptions and segfaults (e.g. doing state-changing cublas calls from multiple CPU threads). This PR makes the search benchmarks copy ANN wrappers in each thread. The copies of the wrappers then selectively: - share thread-safe resources (e.g. rmm memory pool) and large objects that are not expected to change during search (e.g. index data); - duplicate the resources that are not thread-safe or carry the thread-specific state (e.g. cublas handles, CUDA events and streams). Alongside, the PR adds a few small changes, including: - enables ann-bench NVTX annotations for the non-common-executable mode (shows benchmark labels and iterations in nsys timeline); - fixes compile errors for the common-executable mode. Authors: - Artem M. Chirkin (https://github.com/achirkin) - William Hicks (https://github.com/wphicks) Approvers: - William Hicks (https://github.com/wphicks) - Mark Harris (https://github.com/harrism) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#2021
…apidsai#2062) With the changes introduced by rapidsai#2021, the copied FAISS benchmark wrapper contains a cuda event that is used for synchronizing between streams during search. The lifetime of the event is the same as of the wrapper, but the event handle itself is copied between the wrappers; this leads to illegal memory accesses and crashes. This PR fixes the bug by creating a new cuda event on each wrapper copy, so that the wrappers do not share their synchronization events. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#2062
In the current state, ann-benchmarks running in the
--throughput
mode (multi-threaded) share ANN wrappers among CPU threads. This is not thread-safe and may result in incorrectly measured time (e.g. sharing cuda events among CPU threads) or various exceptions and segfaults (e.g. doing state-changing cublas calls from multiple CPU threads).This PR makes the search benchmarks copy ANN wrappers in each thread. The copies of the wrappers then selectively:
Alongside, the PR adds a few small changes, including: