Skip to content
This repository has been archived by the owner on Dec 1, 2021. It is now read-only.

Adding --gpu={gpu index} flag to docker option in tutorials. #1212

Merged
merged 10 commits into from
Sep 23, 2020

Conversation

ananno
Copy link
Contributor

@ananno ananno commented Sep 16, 2020

What this patch does to fix the issue.

Adding gpu flag in docs for docker runtime to use GPU. Without this the instructions in tutorials are incomplete and will not use GPU.

Link to any relevant issues or pull requests.

@blueoil-butler blueoil-butler bot added the CI: auto-run Run CI automatically label Sep 16, 2020
@bo-code-review-bot
Copy link

This PR needs Approvals as follows.

  • Ownership Approval for / from iizukak, tkng, ruimashita

Please choose reviewers and requet reviews!

Click to see how to approve each reviews

You can approve this PR by triggered comments as follows.

  • Approve all reviews requested to you (readability and ownership) and LGTM review
    Approval, LGTM

  • Approve all ownership reviews
    Ownership Approval or OA

  • Approve all readability reviews
    Readability Approval or RA

  • Approve specified review targets

    • Example of Ownership Reviewer of /: Ownership Approval for / or OA for /
    • Example of Readability Reviewer of Python: Readability Approval for Python or RA for Python
  • Approve LGTM review
    LGTM

See all trigger comments

Please replace [Target] to review target

  • Ownership Approval
    • Ownership Approval for [Target]
    • OA for [Target]
    • Ownership Approval
    • OA
    • Approval
  • Readability Approval
    • Readability Approval for [Target]
    • RA for [Target]
    • [Target] Readability Approval
    • [Target] RA
    • Readability Approval
    • RA
    • Approval
  • LGTM
    • LGTM
    • lgtm

@CLAassistant
Copy link

CLAassistant commented Sep 16, 2020

CLA assistant check
All committers have signed the CLA.

@ananno ananno self-assigned this Sep 16, 2020
@ananno ananno requested a review from suttang September 16, 2020 01:47
@ananno ananno requested a review from iizukak September 16, 2020 01:47
@iizukak
Copy link
Member

iizukak commented Sep 16, 2020

@ananno
Sorry, I don't understand perfect. CUDA_VISIBLE_DEVICES is not enough for now?
If CUDA_VISIBLE_DEVICES is not working and --gpus is available, please remove description about CUDA_VISIBLE_DEVICES.

@ananno
Copy link
Contributor Author

ananno commented Sep 16, 2020

@ananno
Sorry, I don't understand perfect. CUDA_VISIBLE_DEVICES is not enough for now?
If CUDA_VISIBLE_DEVICES is not working and --gpus is available, please remove description about CUDA_VISIBLE_DEVICES.

@iizukak San:
The CUDA_VISIBLE_DEVICES is an environment variable to restrict TensorFlow to use particular GPU device(s). But --gpu flag is needed for docker to use how many GPU device(s). But only CUDA_VISIBLE_DEVICES is not enough and docker does not initiate CUDA.

I've tested with different combination of these two parameter. The filter first use CUDA_VISIBLE_DEVICES to select allowed devices. Then docker will select the GPU(s) provided by --gpu={_count_}. Check the following combinations of GPU index used in lmsrv05 with 2 TitanX gpus:

CUDA_VISIBLE_DEVICES --gpus Training on GPU
0 0 Success
1 0 Success
0 1 Failed
1 1 Failed
0,1 0 Success
0,1 1 Success
0,1 2 Success
Not provided 0 Success
Not provided 1 Success
Not provided 2 Success

Not providing CUDA_VISIBLE_DEVICES means all available GPUs.
--gpus=0 means all available GPUs filtered by CUDA_VISIBLE_DEVICES.
Providing number of gpus in --gpus={} parameter is required.

So, it seems like the GPU list is first filtered by CUDA_VISIBLE_DEVICES than docker use GPU with provided number of gpus from that filtered list.

@suttang suttang requested a review from hadusam September 16, 2020 03:33
@suttang
Copy link
Contributor

suttang commented Sep 16, 2020

@hadusam As I am unable to judge the validity of this content, could you please confirm it when you have time?

@joelN123
Copy link
Contributor

joelN123 commented Sep 16, 2020

oh, it's interesting. So, it seems like --gpus=0 and not providing CUDA_VISIBLE_DEVICES will mean that gpu1 will be left free? So, I think that's what @iizukak means, that maybe the CUDA_VISIBLE_DEVICES option is actually not necessary?
p.s. the --gpus flag is the newer version of --runtime=nvidia maybe?
edit: oh, maybe --gpus device=0,1 is more similar to the CUDA_VISIBLE_DEVICES="0,1" usage

@iizukak
Copy link
Member

iizukak commented Sep 16, 2020

@ananno
I'm confusing...
Why Training on GPU successed with --gpus=0? 0 dose not mean don't use GPU?
And why failed with --gpus=1 and CUDA_VISIBLE_DEVICES=0 or 1 pattern?

@ananno
Copy link
Contributor Author

ananno commented Sep 16, 2020

oh, it's interesting. So, it seems like --gpus=0 and not providing CUDA_VISIBLE_DEVICES will mean that gpu1 will be left free? So, I think that's what @iizukak means, that maybe the CUDA_VISIBLE_DEVICES option is actually not necessary?
p.s. the --gpus flag is the newer version of --runtime=nvidia maybe?
edit: oh, maybe --gpus device=0,1 is more similar to the CUDA_VISIBLE_DEVICES="0,1" usage

@joelN123 San @iizukak San:
I've tested --gpus=0 means all available GPUs. In this test I didn't use CUDA_VISIBLE_DEVICES which by default exposes all available GPUs to tensorflow. And with --gpus=0 docker will run training with all available GPUs. The effect is same as not using CUDA_VISIBLE_DEVICES.

And to answer to your question, If you do not use CUDA_VISIBLE_DEVICES in a multi-gpu server it will create tensorflow process on all the GPUs with ~150MB gpu memory. To avoid this and restrict the tensorflow process attach to one or specific gpu you have to use CUDA_VISIBLE_DEVICES. If you don't want to use CUDA_VISIBLE_DEVICES env variable you can use --gpus device=1 to select specific devices and restrict tensorflow process to use GPU=1 only.

@joelN123
Copy link
Contributor

joelN123 commented Sep 16, 2020

understood ! thank you. In that case, it might be good to have --gpus device=<gpu indices> in the documentation, and then can remove the CUDA_VISIBLE_DEVICES environment variable from the documentation? Maybe it's down to preference. Anyway, I think that is what I will be doing from now on, when running docker on gpu.

@ananno
Copy link
Contributor Author

ananno commented Sep 16, 2020

understood ! thank you. In that case, it might be good to have --gpus device=<gpu indices> in the documentation, and then can remove the CUDA_VISIBLE_DEVICES environment variable from the documentation? Maybe it's down to preference. Anyway, I think that is what I will be doing from now on, when running docker on gpu.

@joelN123 San:
I'm doing it also. I agree to remove CUDA_VISIBLE_DEVICES all together and use --gpus device=<gpu indices> option.

Copy link
Contributor

@hadusam hadusam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananno
Thanks!
It looks good for me to use --gpus with device= instead of CUDA_VISIBVLE_DEVICES.
I left some comments, please check them!

Comment on lines 119 to 120
> For old docker versions use `--runtime=nvidia` with `-e CUDA_VISIBLE_DEVICES=<gpu indices>` environment variable
> instead of `--gpu device=<gpu indices>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I think we don't need these lines because our supported docker version is Docker >= 19.03.
https://docs.blue-oil.org/install/install.html#prerequisites

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll remove it. Thanks for the review.

-v $(pwd)/cifar:/home/blueoil/cifar \
-v $(pwd)/config:/home/blueoil/config \
-v $(pwd)/saved:/home/blueoil/saved \
blueoil_$(id -un):{TAG} \
blueoil train -c config/cifar10_test.py

Just like init, set the value of `{TAG}` to the value obtained by `docker images`.
Change the value of `CUDA_VISIBLE_DEVICES` according to your environment.
Change the value of `--gpu device=<gpu indices>` according to your environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some typo: --gpu -> --gpus, please check them. 🙇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-v $(pwd)/cifar:/home/blueoil/cifar \
-v $(pwd)/config:/home/blueoil/config \
-v $(pwd)/saved:/home/blueoil/saved \
blueoil_$(id -un):{TAG} \
blueoil train -c config/cifar10_test.py

Just like init, set the value of `{TAG}` to the value obtained by `docker images`.
Change the value of `CUDA_VISIBLE_DEVICES` according to your environment.
Change the value of `--gpu device=<gpu indices>` according to your environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some typo: --gpu -> --gpus, please check them. 🙇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -106,15 +106,18 @@ If configuration finishes, the configuration file is generated in the `cifar10_t
Train your model by running `blueoil train` with model configuration.

$ docker run --rm \
-e CUDA_VISIBLE_DEVICES=0 \
--gpus device=0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my environment (Docker v19.03.7), this doesn't work.
I think we need to use --gpus '"device=0"'. It's not beautiful though... 😭
See: NVIDIA/nvidia-docker#1257

Copy link
Contributor Author

@ananno ananno Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm changing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ananno ananno requested a review from hadusam September 16, 2020 04:53
@suttang suttang removed their request for review September 16, 2020 04:55
Copy link
Contributor

@hadusam hadusam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananno
LGTM!

Could you please fix the title of this PR with the correct option? Because the title will be used to commit message of this PR.
Thanks for the nice PR!

Copy link
Member

@iizukak iizukak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananno @joelN123
Thanks. I understand just now.
OA

@iizukak
Copy link
Member

iizukak commented Sep 23, 2020

/ready

@bo-mergebot
Copy link
Contributor

bo-mergebot bot commented Sep 23, 2020

⏳Merge job is queued...

@bo-mergebot bo-mergebot bot merged commit 2c78ec5 into master Sep 23, 2020
@bo-mergebot bo-mergebot bot deleted the adding_gpu_flag_to_docs branch September 23, 2020 01:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CI: auto-run Run CI automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants