Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] prevent multi-gpu usage #4749

Merged
merged 7 commits into from
Aug 12, 2019
Merged

[BREAKING] prevent multi-gpu usage #4749

merged 7 commits into from
Aug 12, 2019

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Aug 7, 2019

See RFC #4531
Part of 1.0.0. roadmap #4680

@RAMitchell @trivialfis @sriramch

@@ -25,7 +25,7 @@ def eprint(*args, **kwargs):
# reduced to fit onto 1 gpu but still be large
rows3 = 5000 # small
rows2 = 4360032 # medium
rows1 = 42360032 # large
rows1 = 32360032 # large
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original size crashes on my Titan V, so lowering it a bit.

@rongou rongou force-pushed the sans-mgpu branch 2 times, most recently from 62f8718 to 2a55290 Compare August 8, 2019 02:12
@trivialfis trivialfis self-requested a review August 8, 2019 02:14
Copy link
Contributor

@sriramch sriramch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest looks good to me...

tests/cpp/predictor/test_gpu_predictor.cu Outdated Show resolved Hide resolved
@@ -40,7 +40,7 @@ struct GenericParameter : public dmlc::Parameter<GenericParameter> {
.describe("The primary GPU device ordinal.");
DMLC_DECLARE_FIELD(n_gpus)
.set_default(0)
.set_lower_bound(-1)
.set_lower_bound(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm hoping no one uses the moving target (the head version) in their testing, as this is a change in behavior in their configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a breaking change if you are using multi-gpu.

@rongou rongou changed the title prevent multi-gpu usage [BREAKING] prevent multi-gpu usage Aug 8, 2019
@trivialfis
Copy link
Member

Please provide some informative messages instead of setting range. Like it's removed in 1.0, what are the alternative options or a reference to related doc page. I think the parameter is quite popular despite its limitations. :-)

@rongou
Copy link
Contributor Author

rongou commented Aug 8, 2019

The parameter description refers to distributed training with one process per GPU. Do we have a canonical document/tutorial for that? Maybe using dask?

@trivialfis
Copy link
Member

Demo yes. But not official document as the interface is not mature.

@trivialfis
Copy link
Member

But sure, we should mention dask at least. I will work on it after sorting out current PRs, assuming no one else beats me to it.

@rongou
Copy link
Contributor Author

rongou commented Aug 8, 2019

Happy to point it when you have it ready. Do you want me to change the wording on the current PR?

@rongou
Copy link
Contributor Author

rongou commented Aug 9, 2019

Finally got the CI to pass. Please take another look. @RAMitchell @trivialfis

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, perhaps slightly change the wording of the error message.

src/learner.cc Outdated
generic_param_.n_gpus = 1;
}
if (generic_param_.n_gpus != 1) {
LOG(FATAL) << "Multi-GPU training is no longer supported. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message could mislead a user to think multi-GPU training is completely removed, perhaps it is better to say multi-GPU training using threads is no longer supported, and that this can now be achieved instead using distributed training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@trivialfis
Copy link
Member

Please mention distributed training with dask. :)

@rongou
Copy link
Contributor Author

rongou commented Aug 12, 2019

Reworded the messages, PTAL.

@@ -585,8 +585,9 @@ class LearnerImpl : public Learner {
generic_param_.n_gpus = 1;
}
if (generic_param_.n_gpus != 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR. But after removing n_gpus, the UseGPU function should follow ;-).

@trivialfis
Copy link
Member

@RAMitchell What do you think? Ready to go?

@RAMitchell RAMitchell merged commit c5b2296 into dmlc:master Aug 12, 2019
@rongou rongou deleted the sans-mgpu branch August 13, 2019 17:55
@lock lock bot locked as resolved and limited conversation to collaborators Nov 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants