-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support both row-wise and col-wise multi-threading #2699
Conversation
@Laurae2 for more benchmarks |
src/treelearner/ocl/histogram16.cl
Outdated
@@ -437,6 +404,8 @@ R""() | |||
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 0, 1, 2, 3, 4, 5, 6, 7's gradients for example 8, 9, 10, 11, 12, 13, 14, 15 | |||
#if CONST_HESSIAN == 0 | |||
atomic_local_add_f(gh_hist + addr2, stat2); | |||
#else | |||
atom_inc((uint*)(gh_hist + addr2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huanzhang12 is this added okay?
or simply atomic_local_add_f(gh_hist + addr2, 1.0f)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the content at address (gh_hist + addr2) is an integer, we can use atom_inc(). If it is a floating point number, we must use atomic_local_add_f(gh_hist + addr2, 1.0f). I am not sure if you want to store a float or int here?
atomic_local_add_f() can be hundreds times slower than atom_inc(), since there is no native floating point atomics support on GPUs. So using atom_inc() if possible (but we must make sure the content is an integer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we store it as int firstly and use inc(), then multiply it with hessian[0] and change it to float in-place?
src/treelearner/ocl/histogram16.cl
Outdated
|
||
// now thread 0 - 7 holds feature 0 - 7's gradient for bin 0 and counter bin 0 | ||
// now thread 8 - 15 holds feature 0 - 7's hessian for bin 0 and counter bin 1 | ||
// now thread 16- 23 holds feature 0 - 7's gradient for bin 1 and counter bin 2 | ||
// now thread 24- 31 holds feature 0 - 7's hessian for bin 1 and counter bin 3 | ||
// etc, | ||
|
||
// FIXME: correct way to fix hessians | ||
#if CONST_HESSIAN == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huanzhang12 could you help to fix the hessian fixing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a little bit on what you want to do here? Do we need to add anything new to the histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! As the count is removed, the naive solution is to add gh_hist as before and remove the cnt_hist. However, for const_hessian cases, we can use +1
in h_hist firstly and then multiply all h_hist with hessians[0], If +1
is faster than +hessians[i]
. Therefore, here we need to multiply all h_hist with hessians[0].
@huanzhang12 I made some changes on |
@jameslamb could you help for the R's tests? |
Reminder for myself: run tests against swapped compilers before merging. |
sure! Want me to push to this PR or make a separate one? |
@jameslamb you can directly push to this PR. |
32f3f6f
to
2ad4af5
Compare
Co-Authored-By: James Lamb <[email protected]>
src/io/multi_val_dense_bin.hpp
Outdated
|
||
#include <cstdint> | ||
#include <cstring> | ||
#include <omp.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke Shouldn't this include (here and in all other files) be handled via openmp_wrapper
to not break threadless version building?
@StrikerRUS do you know why updated: it seems the |
also refer to scikit-learn/scikit-learn#14106 |
Nope, I have never seen that this test failing before. Is AVX2 critical for this PR? |
@StrikerRUS |
@StrikerRUS I am gonging to merge this PR, do we need to swap back the compilers in CI? |
Hmm, really interesting...
Yes. As it's already merged, I'll do this in a separate PR. |
To continue #216
Before this PR, LightGBM only supports col-wise multi-threading, which may is inefficient for sparse data.
The row-wise multi-threading is efficient for sparse data, but the overhead (more histograms and the merge cost) will increase with num_threads.
Therefore, as both of them are not perfect, we implement them both and automatically choose the faster one in run-time.
Two new parameters are added,
force_col_wise
andforce_row_wise
.Some other changes in this PR:
cnt
in Histogram, to reduce histogram merge costTimer
to profile the run time costBenchmark:
Run by 16 threads, on Azure ND24s VM
Todo:
updated:
A new PR for lint fix.