Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix specifying gpu_id, add tests. #3851

Merged
merged 4 commits into from
Nov 6, 2018
Merged

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Nov 1, 2018

Address #3850 .

I don't have the multi-gpu machine for testing currently.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

@trivialfis We may want to list this as a known issue. Do you have a short description of the issue?

@trivialfis
Copy link
Member Author

@hcho3 I'm working on it. Please give me some time. If I failed to make a fix today I will let you know.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

@trivialfis Got it. I'll trust your judgment on this.

@trivialfis trivialfis changed the title Fix gpu_id in transform, add tests. Fix specifying gpu_id, add tests. Nov 2, 2018
@trivialfis trivialfis requested a review from RAMitchell November 2, 2018 00:20
@trivialfis
Copy link
Member Author

@hcho3 Tested with the R script provided by @joegaotao . Thanks. I think it should now be fixed.

@RAMitchell But it's not clear to me, what's the point of having normalised_device_index. Can we just use the device index obtained from CUDA?

@trivialfis
Copy link
Member Author

@hcho3 Not really. Just talked to @RAMitchell , the current fix will disable dividing training sessions across GPUs. Although the script won't fail but it defeats the purpose of specifying gpu_id.

@trivialfis
Copy link
Member Author

Let me keep trying.

@trivialfis
Copy link
Member Author

@RAMitchell Ready for a review. I haven't figure out how to make a unified way to allocate GPU, but the current fix won't add any burden to the code base, see if we can merge it.

<< "n_gpu + gpu_id should less than or equal to total number of available devices."
<< "\n n_gpu: " << Size()
<< ", gpu_id: " << gpu_id
<< ", number of available devices: " << n_devices_visible;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove Unnormalised(), I think it is wrong as the range can be larger than the number of GPUs. This unnormalised/normalised terminology is confusing and I can see it somehow has different meanings across different parts of the code. From now on whenever a device index is referred to it means the physical device ordinal of the GPU.

/*! \brief Counting from gpu_id */
GPUSet Normalised(int gpu_id) const {
int n_devices_visible = AllVisible().Size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this function , without unnormalised/normalised terminology.

@@ -197,7 +198,9 @@ class GPUSet {
bool IsEmpty() const { return Size() == 0; }
/*! \brief Get un-normalised index. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above let's remove the function Index().

@@ -251,15 +251,15 @@ struct DeviceHistogram {
thrust::device_vector<GradientPairSumT::ValueT> data;
const size_t kStopGrowingSize = 1 << 26; // Do not grow beyond this size
int n_bins;
int device_idx;
int normalised_device_idx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be device_idx, as before let's remove all normalised/unnormalised terminology.

// Ignore other attributes of GPUDistribution for spliting index.
int d_unnormalised = devices.Index(device);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In situations like this you need to have some index beginning from 0 you could get it from the enclosing for loop:

 for (omp_ulong i = 0; i < devices.Size(); ++i) {
    auto device_idx = devices[i];
    size_t shard_size =
            GPUDistribution::Block(devices).ShardSize(range_size, i);

@@ -1336,6 +1338,7 @@ class GPUHistMaker : public TreeUpdater {
common::Monitor monitor_;
dh::AllReducer reducer_;
std::vector<ValueConstraint> node_value_constraints_;
/*! List storing normalised device index*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to remove this.

@trivialfis
Copy link
Member Author

@hcho3 I can't address these problems today. I need some more time to completely remove the normalise/unnormalise stuffs. So known issue it is.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 2, 2018

@trivialfis Okay, I'll add it to the known issues. How should I summarize this issue (in a sentence)? How about "gpu_id cannot be set when one GPU is used"?

@trivialfis
Copy link
Member Author

"Specifying gpu-id is not yet supported." is more realistic from my perspective.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 3, 2018

Due to #3794 , even if we try to support specifying gpu_id, users still need to have a copy of their dataset for each GPU.

@trivialfis trivialfis force-pushed the fix/gpu-id branch 2 times, most recently from 1a389ea to ff826b6 Compare November 5, 2018 03:57
* Remove normalised/unnormalised operatios.
* Address difference between `Index' and `Device ID'.
* Modify doc for `gpu_id'.
* Better LOG for GPUSet.
* Check specified n_gpus.
* Remove inappropriate `device_idx' term.
@trivialfis
Copy link
Member Author

@RAMitchell Ready for another review now. :)

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good, please just clarify the usage of GpuIndex.

* Hence, `DeviceId' converts a zero-based index to actual device id,
* `Index' converts a device id to a zero-based index.
*/
GpuIndex DeviceId(GpuIndex index) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest index is NOT of type GpuIndex. I assume GpuIndex is intended to refer to a physical device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. :)

<< std::endl;
return result;
}
GpuIndex Index(GpuIndex device) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above maybe this function should return an integer and not GpuIndex? You should be clear about what this type is and not use it for both these purposes.

Copy link
Member Author

@trivialfis trivialfis Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed GpuIndex to GpuIdType, and use it along with size_t to indicate different types of indices.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 6, 2018

@trivialfis I just added a new multi-GPU worker instance (so that we have two instances instead of one), but somehow I broke the old one. Let me reboot it.

@trivialfis
Copy link
Member Author

@hcho3 Thanks. :)

@hcho3
Copy link
Collaborator

hcho3 commented Nov 6, 2018

@trivialfis Both workers are now up and running.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants