Fix specifying gpu_id, add tests. #3851

trivialfis · 2018-11-01T08:00:48Z

Address #3850 .

I don't have the multi-gpu machine for testing currently.

hcho3 · 2018-11-01T21:57:59Z

@trivialfis We may want to list this as a known issue. Do you have a short description of the issue?

trivialfis · 2018-11-01T22:28:05Z

@hcho3 I'm working on it. Please give me some time. If I failed to make a fix today I will let you know.

hcho3 · 2018-11-01T22:32:07Z

@trivialfis Got it. I'll trust your judgment on this.

trivialfis · 2018-11-02T00:51:30Z

@hcho3 Tested with the R script provided by @joegaotao . Thanks. I think it should now be fixed.

@RAMitchell But it's not clear to me, what's the point of having normalised_device_index. Can we just use the device index obtained from CUDA?

trivialfis · 2018-11-02T01:46:44Z

@hcho3 Not really. Just talked to @RAMitchell , the current fix will disable dividing training sessions across GPUs. Although the script won't fail but it defeats the purpose of specifying gpu_id.

trivialfis · 2018-11-02T01:49:38Z

Let me keep trying.

trivialfis · 2018-11-02T03:38:01Z

@RAMitchell Ready for a review. I haven't figure out how to make a unified way to allocate GPU, but the current fix won't add any burden to the code base, see if we can merge it.

RAMitchell · 2018-11-02T05:33:35Z

src/common/common.h

+        << "n_gpu + gpu_id should less than or equal to total number of available devices."
+        << "\n n_gpu: " << Size()
+        << ", gpu_id: " << gpu_id
+        << ", number of available devices: " << n_devices_visible;


Lets remove Unnormalised(), I think it is wrong as the range can be larger than the number of GPUs. This unnormalised/normalised terminology is confusing and I can see it somehow has different meanings across different parts of the code. From now on whenever a device index is referred to it means the physical device ordinal of the GPU.

RAMitchell · 2018-11-02T05:35:02Z

src/common/common.h

  /*! \brief Counting from gpu_id */
  GPUSet Normalised(int gpu_id) const {
+    int n_devices_visible = AllVisible().Size();


Let's rename this function , without unnormalised/normalised terminology.

RAMitchell · 2018-11-02T05:37:42Z

src/common/common.h

@@ -197,7 +198,9 @@ class GPUSet {
  bool IsEmpty() const { return Size() == 0; }
  /*! \brief Get un-normalised index. */


As above let's remove the function Index().

RAMitchell · 2018-11-02T05:43:46Z

src/tree/updater_gpu_hist.cu

@@ -251,15 +251,15 @@ struct DeviceHistogram {
  thrust::device_vector<GradientPairSumT::ValueT> data;
  const size_t kStopGrowingSize = 1 << 26;  // Do not grow beyond this size
  int n_bins;
-  int device_idx;
+  int normalised_device_idx;


This should all be device_idx, as before let's remove all normalised/unnormalised terminology.

RAMitchell · 2018-11-02T05:55:23Z

src/common/transform.h

        // Ignore other attributes of GPUDistribution for spliting index.
+        int d_unnormalised = devices.Index(device);


In situations like this you need to have some index beginning from 0 you could get it from the enclosing for loop:

for (omp_ulong i = 0; i < devices.Size(); ++i) { auto device_idx = devices[i]; size_t shard_size = GPUDistribution::Block(devices).ShardSize(range_size, i);

RAMitchell · 2018-11-02T05:59:28Z

src/tree/updater_gpu_hist.cu

@@ -1336,6 +1338,7 @@ class GPUHistMaker : public TreeUpdater {
  common::Monitor monitor_;
  dh::AllReducer reducer_;
  std::vector<ValueConstraint> node_value_constraints_;
+  /*! List storing normalised device index*/


We should be able to remove this.

trivialfis · 2018-11-02T06:17:12Z

@hcho3 I can't address these problems today. I need some more time to completely remove the normalise/unnormalise stuffs. So known issue it is.

hcho3 · 2018-11-02T06:42:46Z

@trivialfis Okay, I'll add it to the known issues. How should I summarize this issue (in a sentence)? How about "gpu_id cannot be set when one GPU is used"?

trivialfis · 2018-11-02T06:55:41Z

"Specifying gpu-id is not yet supported." is more realistic from my perspective.

trivialfis · 2018-11-03T23:51:57Z

Due to #3794 , even if we try to support specifying gpu_id, users still need to have a copy of their dataset for each GPU.

* Remove normalised/unnormalised operatios. * Address difference between `Index' and `Device ID'. * Modify doc for `gpu_id'. * Better LOG for GPUSet. * Check specified n_gpus. * Remove inappropriate `device_idx' term.

trivialfis · 2018-11-05T05:16:49Z

@RAMitchell Ready for another review now. :)

RAMitchell

PR looks good, please just clarify the usage of GpuIndex.

RAMitchell · 2018-11-06T01:04:26Z

src/common/common.h

+   * Hence, `DeviceId' converts a zero-based index to actual device id,
+   * `Index' converts a device id to a zero-based index.
+   */
+  GpuIndex DeviceId(GpuIndex index) const {


Suggest index is NOT of type GpuIndex. I assume GpuIndex is intended to refer to a physical device?

RAMitchell · 2018-11-06T01:05:16Z

src/common/common.h

+                            << std::endl;
+    return result;
+  }
+  GpuIndex Index(GpuIndex device) const {


As above maybe this function should return an integer and not GpuIndex? You should be clear about what this type is and not use it for both these purposes.

Renamed GpuIndex to GpuIdType, and use it along with size_t to indicate different types of indices.

hcho3 · 2018-11-06T03:53:52Z

@trivialfis I just added a new multi-GPU worker instance (so that we have two instances instead of one), but somehow I broke the old one. Let me reboot it.

trivialfis · 2018-11-06T03:56:00Z

@hcho3 Thanks. :)

hcho3 · 2018-11-06T04:06:51Z

@trivialfis Both workers are now up and running.

trivialfis changed the title ~~Fix gpu_id in transform, add tests.~~ Fix specifying gpu_id, add tests. Nov 2, 2018

trivialfis requested a review from RAMitchell November 2, 2018 00:20

RAMitchell reviewed Nov 2, 2018

View reviewed changes

trivialfis force-pushed the fix/gpu-id branch 2 times, most recently from 1a389ea to ff826b6 Compare November 5, 2018 03:57

Rewrite gpu_id related code.

f5feeb8

* Remove normalised/unnormalised operatios. * Address difference between `Index' and `Device ID'. * Modify doc for `gpu_id'. * Better LOG for GPUSet. * Check specified n_gpus. * Remove inappropriate `device_idx' term.

trivialfis force-pushed the fix/gpu-id branch from eec547b to f5feeb8 Compare November 5, 2018 05:09

RAMitchell reviewed Nov 6, 2018

View reviewed changes

trivialfis added 3 commits November 6, 2018 14:13

Clarify GpuIdType and size_t.

dc8bc2d

Address sign conversion.

18d0795

Use static_cast.

5338448

trivialfis merged commit f1275f5 into dmlc:master Nov 6, 2018

trivialfis mentioned this pull request Nov 6, 2018

gpu_id error in R when it's non-zero(xgboost 0.81.0.1) #3850

Closed

trivialfis deleted the fix/gpu-id branch November 7, 2018 08:48

trivialfis mentioned this pull request Jan 27, 2019

Check failed: n_gpus <= n_available_devices (3 vs. 2) Starting from gpu id: 2, there are only 2 available devices, while n_gpus is set to: 3\ #4084

Closed

lock bot locked as resolved and limited conversation to collaborators Feb 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix specifying gpu_id, add tests. #3851

Fix specifying gpu_id, add tests. #3851

trivialfis commented Nov 1, 2018 •

edited

Loading

hcho3 commented Nov 1, 2018

trivialfis commented Nov 1, 2018

hcho3 commented Nov 1, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

RAMitchell Nov 2, 2018

RAMitchell Nov 2, 2018

RAMitchell Nov 2, 2018

RAMitchell Nov 2, 2018

RAMitchell Nov 2, 2018

RAMitchell Nov 2, 2018

trivialfis commented Nov 2, 2018

hcho3 commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 3, 2018 •

edited

Loading

trivialfis commented Nov 5, 2018

RAMitchell left a comment

RAMitchell Nov 6, 2018

trivialfis Nov 6, 2018

RAMitchell Nov 6, 2018

trivialfis Nov 6, 2018 •

edited

Loading

hcho3 commented Nov 6, 2018 •

edited

Loading

trivialfis commented Nov 6, 2018

hcho3 commented Nov 6, 2018

		@@ -197,7 +198,9 @@ class GPUSet {
		bool IsEmpty() const { return Size() == 0; }
		/! \brief Get un-normalised index. /

		// Ignore other attributes of GPUDistribution for spliting index.
		int d_unnormalised = devices.Index(device);

Fix specifying gpu_id, add tests. #3851

Fix specifying gpu_id, add tests. #3851

Conversation

trivialfis commented Nov 1, 2018 • edited Loading

hcho3 commented Nov 1, 2018

trivialfis commented Nov 1, 2018

hcho3 commented Nov 1, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Nov 2, 2018

hcho3 commented Nov 2, 2018

trivialfis commented Nov 2, 2018

trivialfis commented Nov 3, 2018 • edited Loading

trivialfis commented Nov 5, 2018

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

hcho3 commented Nov 6, 2018 • edited Loading

trivialfis commented Nov 6, 2018

hcho3 commented Nov 6, 2018

trivialfis commented Nov 1, 2018 •

edited

Loading

trivialfis commented Nov 3, 2018 •

edited

Loading

trivialfis Nov 6, 2018 •

edited

Loading

hcho3 commented Nov 6, 2018 •

edited

Loading