-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix specifying gpu_id, add tests. #3851
Conversation
@trivialfis We may want to list this as a known issue. Do you have a short description of the issue? |
@hcho3 I'm working on it. Please give me some time. If I failed to make a fix today I will let you know. |
@trivialfis Got it. I'll trust your judgment on this. |
@hcho3 Tested with the R script provided by @joegaotao . Thanks. I think it should now be fixed. @RAMitchell But it's not clear to me, what's the point of having |
@hcho3 Not really. Just talked to @RAMitchell , the current fix will disable dividing training sessions across GPUs. Although the script won't fail but it defeats the purpose of specifying |
Let me keep trying. |
@RAMitchell Ready for a review. I haven't figure out how to make a unified way to allocate GPU, but the current fix won't add any burden to the code base, see if we can merge it. |
src/common/common.h
Outdated
<< "n_gpu + gpu_id should less than or equal to total number of available devices." | ||
<< "\n n_gpu: " << Size() | ||
<< ", gpu_id: " << gpu_id | ||
<< ", number of available devices: " << n_devices_visible; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove Unnormalised(), I think it is wrong as the range can be larger than the number of GPUs. This unnormalised/normalised terminology is confusing and I can see it somehow has different meanings across different parts of the code. From now on whenever a device index is referred to it means the physical device ordinal of the GPU.
src/common/common.h
Outdated
/*! \brief Counting from gpu_id */ | ||
GPUSet Normalised(int gpu_id) const { | ||
int n_devices_visible = AllVisible().Size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this function , without unnormalised/normalised terminology.
src/common/common.h
Outdated
@@ -197,7 +198,9 @@ class GPUSet { | |||
bool IsEmpty() const { return Size() == 0; } | |||
/*! \brief Get un-normalised index. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above let's remove the function Index().
src/tree/updater_gpu_hist.cu
Outdated
@@ -251,15 +251,15 @@ struct DeviceHistogram { | |||
thrust::device_vector<GradientPairSumT::ValueT> data; | |||
const size_t kStopGrowingSize = 1 << 26; // Do not grow beyond this size | |||
int n_bins; | |||
int device_idx; | |||
int normalised_device_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should all be device_idx, as before let's remove all normalised/unnormalised terminology.
src/common/transform.h
Outdated
// Ignore other attributes of GPUDistribution for spliting index. | ||
int d_unnormalised = devices.Index(device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In situations like this you need to have some index beginning from 0 you could get it from the enclosing for loop:
for (omp_ulong i = 0; i < devices.Size(); ++i) {
auto device_idx = devices[i];
size_t shard_size =
GPUDistribution::Block(devices).ShardSize(range_size, i);
src/tree/updater_gpu_hist.cu
Outdated
@@ -1336,6 +1338,7 @@ class GPUHistMaker : public TreeUpdater { | |||
common::Monitor monitor_; | |||
dh::AllReducer reducer_; | |||
std::vector<ValueConstraint> node_value_constraints_; | |||
/*! List storing normalised device index*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove this.
@hcho3 I can't address these problems today. I need some more time to completely remove the normalise/unnormalise stuffs. So known issue it is. |
@trivialfis Okay, I'll add it to the known issues. How should I summarize this issue (in a sentence)? How about " |
"Specifying gpu-id is not yet supported." is more realistic from my perspective. |
Due to #3794 , even if we try to support specifying |
1a389ea
to
ff826b6
Compare
* Remove normalised/unnormalised operatios. * Address difference between `Index' and `Device ID'. * Modify doc for `gpu_id'. * Better LOG for GPUSet. * Check specified n_gpus. * Remove inappropriate `device_idx' term.
eec547b
to
f5feeb8
Compare
@RAMitchell Ready for another review now. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good, please just clarify the usage of GpuIndex.
src/common/common.h
Outdated
* Hence, `DeviceId' converts a zero-based index to actual device id, | ||
* `Index' converts a device id to a zero-based index. | ||
*/ | ||
GpuIndex DeviceId(GpuIndex index) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest index is NOT of type GpuIndex. I assume GpuIndex is intended to refer to a physical device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. :)
src/common/common.h
Outdated
<< std::endl; | ||
return result; | ||
} | ||
GpuIndex Index(GpuIndex device) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above maybe this function should return an integer and not GpuIndex? You should be clear about what this type is and not use it for both these purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed GpuIndex
to GpuIdType
, and use it along with size_t
to indicate different types of indices.
@trivialfis I just added a new multi-GPU worker instance (so that we have two instances instead of one), but somehow I broke the old one. Let me reboot it. |
@hcho3 Thanks. :) |
@trivialfis Both workers are now up and running. |
Address #3850 .
I don't have the multi-gpu machine for testing currently.