-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] prevent multi-gpu usage #4749
Conversation
@@ -25,7 +25,7 @@ def eprint(*args, **kwargs): | |||
# reduced to fit onto 1 gpu but still be large | |||
rows3 = 5000 # small | |||
rows2 = 4360032 # medium | |||
rows1 = 42360032 # large | |||
rows1 = 32360032 # large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original size crashes on my Titan V, so lowering it a bit.
62f8718
to
2a55290
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest looks good to me...
include/xgboost/generic_parameters.h
Outdated
@@ -40,7 +40,7 @@ struct GenericParameter : public dmlc::Parameter<GenericParameter> { | |||
.describe("The primary GPU device ordinal."); | |||
DMLC_DECLARE_FIELD(n_gpus) | |||
.set_default(0) | |||
.set_lower_bound(-1) | |||
.set_lower_bound(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm hoping no one uses the moving target (the head version) in their testing, as this is a change in behavior in their configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a breaking change if you are using multi-gpu.
Please provide some informative messages instead of setting range. Like it's removed in 1.0, what are the alternative options or a reference to related doc page. I think the parameter is quite popular despite its limitations. :-) |
The parameter description refers to distributed training with one process per GPU. Do we have a canonical document/tutorial for that? Maybe using dask? |
Demo yes. But not official document as the interface is not mature. |
But sure, we should mention dask at least. I will work on it after sorting out current PRs, assuming no one else beats me to it. |
Happy to point it when you have it ready. Do you want me to change the wording on the current PR? |
Finally got the CI to pass. Please take another look. @RAMitchell @trivialfis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, perhaps slightly change the wording of the error message.
src/learner.cc
Outdated
generic_param_.n_gpus = 1; | ||
} | ||
if (generic_param_.n_gpus != 1) { | ||
LOG(FATAL) << "Multi-GPU training is no longer supported. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message could mislead a user to think multi-GPU training is completely removed, perhaps it is better to say multi-GPU training using threads is no longer supported, and that this can now be achieved instead using distributed training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Please mention distributed training with dask. :) |
Reworded the messages, PTAL. |
@@ -585,8 +585,9 @@ class LearnerImpl : public Learner { | |||
generic_param_.n_gpus = 1; | |||
} | |||
if (generic_param_.n_gpus != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed for this PR. But after removing n_gpus
, the UseGPU
function should follow ;-).
@RAMitchell What do you think? Ready to go? |
See RFC #4531
Part of 1.0.0. roadmap #4680
@RAMitchell @trivialfis @sriramch