Questions about multi-GPU training #2

xhh232018 · 2018-07-05T11:22:29Z

Hi, due to the quite long training time， I want to know how can I use the keras.utils.multi_gpu_model?

TrinhQuocNguyen · 2018-07-10T01:50:21Z

Hello xhh232018,
Have you successfully trained the network?

xhh232018 · 2018-07-11T02:01:41Z

Hello TrinhQuoc,
I have emailed to you about my latest training results.

TrinhQuocNguyen · 2018-07-11T02:20:13Z

Hi xhh232018,
Thank you, I currently testing it and modifying the source for my own masks 😄 . It's running, but it is gonna take some times to retrain the networks and validate them.

NerminSalem · 2018-07-31T13:25:53Z

@TrinhQuocNguyen how did u train yr own?

Mendel1 · 2018-09-14T08:57:47Z

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

Mistariano · 2019-02-02T08:57:00Z

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

I met the same problem. Have you guys found the way to deal with it?

ZDD2009 · 2019-02-15T01:10:29Z

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

ZDD2009 · 2019-02-15T06:43:28Z

@Mistariano Have you solved your problem?i have the same problem,thank you!

Mistariano · 2019-02-15T07:22:33Z

@ZDD2009 I tried to build the models on my CPU first and then used multi_gpu_model to create parallel models on GPUs, and it worked.

My code likes this:

# origin version:

# model = build_model()
# model.compile(...)
# model.fit(...)

########################

# multi-gpu version

with tf.device('/cpu:0'):
    model = build_model()

parallel_model = multi_gpu_model(model)

parallel_model.compile(...)
parallel_model.fit(...)  # compile & fit the parallel model, so it can be trained on multiple gpus

model.save(...)  # and save the template model

You can perform this trick on both pconv_model and vgg. It can exactly speed up the training.

However, the first gpu still used much more mem than others after I did that.
I set log_device_placement=True and analyzed the log, than I found that model.compile works on just one gpu, so all of the loss were computed on /gpu:0.

I have no idea how to deal with the problem.

ZDD2009 · 2019-02-15T08:24:06Z

@Mistariano thank you very much!
do you have the problem " Found 0 images belonging to 0 classes." when you train your datasets? very thanks! i need your help!

MathiasGruber · 2019-03-01T12:42:21Z

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.

jiguanglu · 2019-06-27T12:30:13Z

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.
In pcony_model.py file,
in line 22 you must change the gpus=8

def init(self, img_rows=512, img_cols=512, vgg_weights="imagenet", inference_only=False, net_name='default', gpus=8, vgg_device=None):

ghost · 2019-08-20T02:35:16Z

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

Here is solution.
'ZeroDivisionError' caused by the wrong path shows "Found 0 images belonging to 1 classes. If you do on your own dataset or imagenet dataset, make sure pick out data in the directory.
For example code, check notebooks/Step4/ "#Pick out an example codeline" with next(train / val / test_generator).

Best regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about multi-GPU training #2

Questions about multi-GPU training #2

xhh232018 commented Jul 5, 2018 •

edited

Loading

TrinhQuocNguyen commented Jul 10, 2018

xhh232018 commented Jul 11, 2018

TrinhQuocNguyen commented Jul 11, 2018

NerminSalem commented Jul 31, 2018

Mendel1 commented Sep 14, 2018 •

edited

Loading

Mistariano commented Feb 2, 2019

ZDD2009 commented Feb 15, 2019

ZDD2009 commented Feb 15, 2019

Mistariano commented Feb 15, 2019 •

edited

Loading

ZDD2009 commented Feb 15, 2019

MathiasGruber commented Mar 1, 2019

jiguanglu commented Jun 27, 2019

ghost commented Aug 20, 2019 •

edited by ghost

Loading

Questions about multi-GPU training #2

Questions about multi-GPU training #2

Comments

xhh232018 commented Jul 5, 2018 • edited Loading

TrinhQuocNguyen commented Jul 10, 2018

xhh232018 commented Jul 11, 2018

TrinhQuocNguyen commented Jul 11, 2018

NerminSalem commented Jul 31, 2018

Mendel1 commented Sep 14, 2018 • edited Loading

Mistariano commented Feb 2, 2019

ZDD2009 commented Feb 15, 2019

ZDD2009 commented Feb 15, 2019

Mistariano commented Feb 15, 2019 • edited Loading

ZDD2009 commented Feb 15, 2019

MathiasGruber commented Mar 1, 2019

jiguanglu commented Jun 27, 2019

ghost commented Aug 20, 2019 • edited by ghost Loading

xhh232018 commented Jul 5, 2018 •

edited

Loading

Mendel1 commented Sep 14, 2018 •

edited

Loading

Mistariano commented Feb 15, 2019 •

edited

Loading

ghost commented Aug 20, 2019 •

edited by ghost

Loading