Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about multi-GPU training #2

Open
xhh232018 opened this issue Jul 5, 2018 · 13 comments
Open

Questions about multi-GPU training #2

xhh232018 opened this issue Jul 5, 2018 · 13 comments

Comments

@xhh232018
Copy link

xhh232018 commented Jul 5, 2018

Hi, due to the quite long training time, I want to know how can I use the keras.utils.multi_gpu_model?

@TrinhQuocNguyen
Copy link

Hello xhh232018,
Have you successfully trained the network?

@xhh232018
Copy link
Author

Hello TrinhQuoc,
I have emailed to you about my latest training results.

@TrinhQuocNguyen
Copy link

Hi xhh232018,
Thank you, I currently testing it and modifying the source for my own masks 😄 . It's running, but it is gonna take some times to retrain the networks and validate them.

@NerminSalem
Copy link

@TrinhQuocNguyen how did u train yr own?

@Mendel1
Copy link

Mendel1 commented Sep 14, 2018

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

@Mistariano
Copy link

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

I met the same problem. Have you guys found the way to deal with it?

@ZDD2009
Copy link

ZDD2009 commented Feb 15, 2019

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

@ZDD2009
Copy link

ZDD2009 commented Feb 15, 2019

@Mistariano Have you solved your problem?i have the same problem,thank you!

@Mistariano
Copy link

Mistariano commented Feb 15, 2019

@ZDD2009 I tried to build the models on my CPU first and then used multi_gpu_model to create parallel models on GPUs, and it worked.

My code likes this:

# origin version:

# model = build_model()
# model.compile(...)
# model.fit(...)

########################

# multi-gpu version

with tf.device('/cpu:0'):
    model = build_model()

parallel_model = multi_gpu_model(model)

parallel_model.compile(...)
parallel_model.fit(...)  # compile & fit the parallel model, so it can be trained on multiple gpus

model.save(...)  # and save the template model

You can perform this trick on both pconv_model and vgg. It can exactly speed up the training.

However, the first gpu still used much more mem than others after I did that.
I set log_device_placement=True and analyzed the log, than I found that model.compile works on just one gpu, so all of the loss were computed on /gpu:0.

I have no idea how to deal with the problem.

@ZDD2009
Copy link

ZDD2009 commented Feb 15, 2019

@Mistariano thank you very much!
do you have the problem " Found 0 images belonging to 0 classes." when you train your datasets? very thanks! i need your help!

@MathiasGruber
Copy link
Owner

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.

@jiguanglu
Copy link

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.
In pcony_model.py file,
in line 22 you must change the gpus=8

def init(self, img_rows=512, img_cols=512, vgg_weights="imagenet", inference_only=False, net_name='default', gpus=8, vgg_device=None):

@ghost
Copy link

ghost commented Aug 20, 2019

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

Here is solution.
'ZeroDivisionError' caused by the wrong path shows "Found 0 images belonging to 1 classes. If you do on your own dataset or imagenet dataset, make sure pick out data in the directory.
For example code, check notebooks/Step4/ "#Pick out an example codeline" with next(train / val / test_generator).

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants