inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

nicklhy · 2018-01-12T11:19:53Z

Hi, I just updated mxnet today from 0.10.0 to 1.0.0 in order to use some new features. Both versions are installed with pip like pip3 install mxnet-cu80==1.0.0. However, after a detailed benchmark test, I observed a significant speed drop when running resnet inference especially when batch size is small. The result for resnet152 is like below (network json file is downloaded from here)

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 85.146210 samples/s
	avg forward time: mean = 0.011743 s, std = 0.000341 s
########################################################
speed test for batch size: 4
	avg forward speed: 215.284806 samples/s
	avg forward time: mean = 0.018579 s, std = 0.000175 s
########################################################
speed test for batch size: 16
	avg forward speed: 297.233244 samples/s
	avg forward time: mean = 0.053827 s, std = 0.001030 s
########################################################
speed test for batch size: 64
	avg forward speed: 316.399717 samples/s
	avg forward time: mean = 0.202272 s, std = 0.001754 s
########################################################
speed test for batch size: 128
	avg forward speed: 321.620336 samples/s
	avg forward time: mean = 0.397981 s, std = 0.002363 s

MXNet version: 1.0.0
########################################################
speed test for batch size: 1
	avg forward speed: 67.866811 samples/s
	avg forward time: mean = 0.014733 s, std = 0.000391 s
########################################################
speed test for batch size: 4
	avg forward speed: 188.020417 samples/s
	avg forward time: mean = 0.021272 s, std = 0.000563 s
########################################################
speed test for batch size: 16
	avg forward speed: 286.253890 samples/s
	avg forward time: mean = 0.055892 s, std = 0.000565 s
########################################################
speed test for batch size: 64
	avg forward speed: 310.045353 samples/s
	avg forward time: mean = 0.206418 s, std = 0.004860 s
########################################################
speed test for batch size: 128
	avg forward speed: 320.566647 samples/s
	avg forward time: mean = 0.399289 s, std = 0.002575 s

PS. I noticed that when batch size is small (i.e. batch_size=1), the GPU usage is 95~100% in mxnet 0.10.0 and 80-83% in mxnet 1.0.0 which means the GPU is not fully utilized at all.

Software env: Ubuntu 16.04, Python 3.5, CUDA 8.0, CUDNN 5.1.
GPU: GTX 1080 Ti.

I also test on a server with Titan XP and got a similar result. The speed test script is pasted below:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import time
import argparse
import mxnet as mx
import numpy as np


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='speed test')
    parser.add_argument('--net', type=str, required=True,
                        help='network symbol json file')
    parser.add_argument('--size', type=int, default=224,
                        help='image size')
    parser.add_argument('--n-batch', type=int, default=100,
                        help='batch number for test')
    parser.add_argument('--gpu', type=int, default=0,
                        help='gpu device id')
    args = parser.parse_args()

    print('MXNet version: %s' % mx.__version__)

    ctx = mx.gpu(args.gpu)

    batch_size_list = [1, 4, 16, 64, 128]

    mod = mx.mod.Module(symbol=mx.sym.load(args.net),
                        context=ctx,
                        data_names=['data', ],
                        label_names=['softmax_label', ])
    mod.bind(data_shapes=[('data', (1, 3, args.size, args.size))],
             label_shapes=[('softmax_label', (1,))],
             for_training=False)
    mod.init_params(initializer=mx.init.Normal())

    for batch_size in batch_size_list:
        print('########################################################')
        print('speed test for batch size: %d' % batch_size)
        mod.reshape(data_shapes=[('data', (batch_size,
                                           3,
                                           args.size,
                                           args.size))],
                    label_shapes=[('softmax_label', (batch_size,))])

        #  pre-allocate
        batch_data = mx.nd.random_normal(0, 0.5, (batch_size, 3, args.size, args.size), ctx=ctx)

        #  warm up GPU
        for _ in range(50):
            mod.forward(mx.io.DataBatch(data=[batch_data, ],
                                        label=None),
                        is_train=False)
            out = mod.get_outputs()[0].asnumpy()

        k = 0
        t_start = time.time()
        t_fwd_list = []
        for _ in range(args.n_batch):
            t1 = time.time()
            mod.forward(mx.io.DataBatch(data=[batch_data, ],
                                        label=None),
                        is_train=False)
            out = mod.get_outputs()[0]
            out.wait_to_read()
            t2 = time.time()
            t_fwd_list.append(t2-t1)
        t_end = time.time()
        n_samples = args.n_batch*batch_size

        print('\tavg forward speed: %f samples/s' % (n_samples/(t_end-t_start)))
        print('\tavg forward time: mean = %f s, std = %f s' %
              (np.mean(t_fwd_list), np.std(t_fwd_list)))

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2018-01-12T20:39:00Z

might be related to #9055

nicklhy · 2018-01-13T01:48:39Z

@eric-haibin-lin , great to see this work. I will test the performance after it is merged.

lupesko · 2018-01-16T00:26:15Z

@nicklhy #9055 was merged. Can you please build latest and re-test?

@eric-haibin-lin can you please label:

Pending Requester Info
Performance

nicklhy · 2018-01-16T04:45:32Z

@lupesko , Hi, just made a test and the result is listed below:

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 84.061132 samples/s
	avg forward time: mean = 0.011895 s, std = 0.000187 s
########################################################
speed test for batch size: 4
	avg forward speed: 208.959420 samples/s
	avg forward time: mean = 0.019141 s, std = 0.000436 s
########################################################
speed test for batch size: 16
	avg forward speed: 284.647810 samples/s
	avg forward time: mean = 0.056207 s, std = 0.001876 s
########################################################
speed test for batch size: 64
	avg forward speed: 299.845795 samples/s
	avg forward time: mean = 0.213439 s, std = 0.004175 s
########################################################
speed test for batch size: 128
	avg forward speed: 306.651349 samples/s
	avg forward time: mean = 0.417408 s, std = 0.008048 s

MXNet version: 1.0.1
########################################################
speed test for batch size: 1
	avg forward speed: 76.590411 samples/s
	avg forward time: mean = 0.013055 s, std = 0.000209 s
########################################################
speed test for batch size: 4
	avg forward speed: 197.583762 samples/s
	avg forward time: mean = 0.020243 s, std = 0.000270 s
########################################################
speed test for batch size: 16
	avg forward speed: 281.058522 samples/s
	avg forward time: mean = 0.056925 s, std = 0.000418 s
########################################################
speed test for batch size: 64
	avg forward speed: 300.547282 samples/s
	avg forward time: mean = 0.212941 s, std = 0.002727 s
########################################################
speed test for batch size: 128
	avg forward speed: 309.225150 samples/s
	avg forward time: mean = 0.413934 s, std = 0.001367 s

As I mentioned here, the inference speed of resnet152 did increased a lot when batch size is small, but still a bit lower than the 0.10.0 version. GPU usage are 100% in 0.10.0, 83% in 1.0.0, 95% in 1.0.1 respectively. By analyzing the CUDA calls, I noticed that the avg time cost of cudaStreamSynchronize in mxnet 1.0.1 is a bit larger than 0.10.0.

# mxnet 0.10.0
# GPU usage: 99~100%
==21421== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.04%  7.76526s      3691  2.1038ms  1.4510us  9.5652ms  cudaStreamSynchronize
 21.31%  3.17943s    659205  4.8230us  3.3450us  1.2354ms  cudaLaunch
 11.92%  1.77868s         4  444.67ms  60.328us  889.60ms  cudaStreamCreate

==11655== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 53.84%  8.38694s      3695  2.2698ms  1.5780us  12.897ms  cudaStreamSynchronize
 20.01%  3.11738s    661308  4.7130us  3.3310us  2.6520ms  cudaLaunch
  7.98%  1.24364s        16  77.727ms  8.0110us  620.95ms  cudaStreamCreateWithFlags
  4.59%  715.63ms       292  2.4508ms     346ns  176.42ms  cudaFree
  4.52%  703.51ms       787  893.91us  160.25us  537.44ms  cudaMemGetInfo

When batch size is large enough (i.e. 64, 128), the new mxnet's performance looks good.

In conclusion, #9055 fixed a large part of this problem, but I guess there still exists some other potential "bugs" that we should work on.

eric-haibin-lin · 2018-01-16T21:12:03Z

Hi @nicklhy what network are you using? Does it include any custom operators?

nicklhy · 2018-01-17T00:45:53Z

@eric-haibin-lin The above results are tested with a ResNet152 json file downloaded from http://data.mxnet.io/models/imagenet/resnet/152-layers/resnet-152-symbol.json.

BTW, I also tested ResNet101 and got a similar result:

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 116.469000 samples/s
	avg forward time: mean = 0.008585 s, std = 0.000308 s
########################################################
speed test for batch size: 4
	avg forward speed: 293.537501 samples/s
	avg forward time: mean = 0.013625 s, std = 0.000497 s
########################################################
speed test for batch size: 16
	avg forward speed: 407.152972 samples/s
	avg forward time: mean = 0.039295 s, std = 0.000929 s
########################################################
speed test for batch size: 64
	avg forward speed: 432.887962 samples/s
	avg forward time: mean = 0.147841 s, std = 0.001603 s
########################################################
speed test for batch size: 128
	avg forward speed: 443.354422 samples/s
	avg forward time: mean = 0.288704 s, std = 0.003057 s

MXNet version: 1.0.1
########################################################
speed test for batch size: 1
	avg forward speed: 109.414104 samples/s
	avg forward time: mean = 0.009137 s, std = 0.000553 s
########################################################
speed test for batch size: 4
	avg forward speed: 288.660234 samples/s
	avg forward time: mean = 0.013856 s, std = 0.000231 s
########################################################
speed test for batch size: 16
	avg forward speed: 412.575750 samples/s
	avg forward time: mean = 0.038778 s, std = 0.001016 s
########################################################
speed test for batch size: 64
	avg forward speed: 447.079919 samples/s
	avg forward time: mean = 0.143147 s, std = 0.001947 s
########################################################
speed test for batch size: 128
	avg forward speed: 458.120138 samples/s
	avg forward time: mean = 0.279399 s, std = 0.004146 s

cjolivier01 · 2018-01-17T03:34:50Z

can you try setting OMP_NUM_THREADS=1 in the environment and try again? does it speed up?

cjolivier01 · 2018-01-17T03:35:12Z

(set before running python)

nicklhy · 2018-01-17T04:27:49Z

@cjolivier01 Same result after setting it. The GPU usage is still 93~95% when testing ResNet101 or ResNet152(batch_size=1). However, mxnet 0.10.0 can easily reach 99-100%.

Notice there are no disk IO or image pre-processing operations in my speed test script. The bottleneck should be in GPU, or more specifically, cudaStreamSynchronize. I guess "OMP_NUM_THREADS" won't be able to solve this problem?

cjolivier01 · 2018-01-17T04:51:39Z

Apparently not -- was checking if something cpu-related was bottleneck but it seems not.

eric-haibin-lin · 2018-01-20T00:12:30Z

@nicklhy what happens if you call mx.nd.waitall() before all the forward calls, remove output.wait_to_read(), and call mx.nd.waitall() after all 100 batches are done?
I don't know see any reason why streamSync becomes more expensive ...

nicklhy · 2018-01-22T06:07:16Z

@eric-haibin-lin , Still the same result. There is always a small speed gap between mxnet 0.10.0 and the current version.

eric-haibin-lin · 2018-03-15T18:37:14Z

@nicklhy I bisected the changes between 0.10.0 and 1.0 and found the following on a p2.xlarge(K80) instance. The commits are patched with the fix in PR 9055

git checkout xxx; 
git submodule update --recursive; 
git cherry-pick 9cc8ea3be23fb7adf4630e4cf065a2473094fbc8 -X theirs
make

and below is the result

ff21e1f Changed FullyConnected to use new linalg gemm, plus TensorCore if fp16 I/O. (#7505)

speed test for batch size: 1
        avg forward speed: 24.484983 samples/s
        avg forward time: mean = 0.040839 s, std = 0.000095 s

56eae58 Fixed Makefile so a null CUDA_ARCH is treated like an unset one. (#7515) - Fast

########################################################
speed test for batch size: 1
        avg forward speed: 25.461191 samples/s
        avg forward time: mean = 0.039270 s, std = 0.000095 s
########################################################

Looks like the commit ff21e1f caused the 4% slowdown during inference.

@DickJC123 were you aware of this?

DickJC123 · 2018-03-15T19:41:09Z

I was not. Is it easy for you to run your perf test on newer architectures than Kepler?

…

On March 15, 2018 at 11:38 AM Haibin Lin ***@***.***> wrote: @nicklhy https://github.com/nicklhy I bisected the changes between 0.10.0 and 1.0 and found the following on a p2.xlarge(K80) instance. The commits are patched with the fix in PR 9055 git checkout xxx; git submodule update --recursive; git cherry-pick 9cc8ea3 -X theirs make and below is the result ff21e1f Changed FullyConnected to use new linalg gemm, plus TensorCore if fp16 I/O. (#7505) speed test for batch size: 1 avg forward speed: 24.484983 samples/s avg forward time: mean = 0.040839 s, std = 0.000095 s 56eae58 Fixed Makefile so a null CUDA_ARCH is treated like an unset one. (#7515) - Fast ######################################################## speed test for batch size: 1 avg forward speed: 25.461191 samples/s avg forward time: mean = 0.039270 s, std = 0.000095 s ######################################################## Looks like the commit ff21e1f ff21e1f caused the 4% slowdown during inference. @DickJC123 https://github.com/dickjc123 were you aware of this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub #9396 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ABn95cDNeZmmZxQZ_Jskw0yivlr9C3eNks5terUOgaJpZM4RcJ9U .

eric-haibin-lin · 2018-03-15T23:42:25Z

Let me try v100

eric-haibin-lin · 2018-03-16T17:12:19Z

Hi @DickJC123

ff21e1f with PR 9055 fix:

########################################################
speed test for batch size: 1
        avg forward speed: 58.503896 samples/s
        avg forward time: mean = 0.017091 s, std = 0.000046 s
########################################################

56eae58 with PR 9055 fix:

########################################################
speed test for batch size: 1
        avg forward speed: 62.070470 samples/s
        avg forward time: mean = 0.016109 s, std = 0.000014 s
########################################################

Looks like it is also slower on V100... @DickJC123 what should be the next step?

eric-haibin-lin · 2018-03-21T18:34:17Z

Hi @DickJC123 could you share some benchmark result for updated FC operator in ff21e1f ? That will be helpful information to decide how to fix the performance degradation for resnet, or revert the change

DickJC123 · 2018-03-23T20:48:29Z

Sorry for the delay. I've been busy with prep for an upcoming conference. I should be able to look at this perf regression next week.

…

On March 21, 2018 at 11:35 AM Haibin Lin ***@***.***> wrote: Hi @DickJC123 https://github.com/dickjc123 could you share some benchmark result for updated FC operator in ff21e1f ff21e1f ? That will be helpful information to decide how to fix the performance degradation for resnet, or revert the change — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub #9396 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ABn95QoDoTvRij3zS6tJ-aGKabJJlTnwks5tgp1ggaJpZM4RcJ9U .

eric-haibin-lin · 2018-03-23T21:15:44Z

@DickJC123 thanks! That will be great.

eric-haibin-lin · 2018-04-02T17:31:23Z

@DickJC123 gentle ping - any update? Were you able to reproduce the issue? Just want to check if there's any update since the next release candidate will be cut in a week or so

nicklhy · 2018-04-25T02:21:44Z

@eric-haibin-lin @DickJC123 , any update about this ?

lupesko · 2018-07-14T07:46:45Z

@DickJC123 bouncing again.

vandanavk · 2018-09-26T06:11:03Z

@DickJC123 did you get a chance to check this issue with the latest code or @eric-haibin-lin's inputs?

piyushghai · 2018-10-09T20:36:18Z

@DickJC123 bouncing once more...

vrakesh · 2018-11-27T21:48:11Z

@DickJC123 Requesting an update regarding the issue, have the recent version(s) of the mxnet solved the issue for you?

kalyc · 2018-12-10T19:34:35Z

@lanking520 @aaronmarkham requesting to close this issue due to lack of activity

lanking520 · 2018-12-10T19:52:54Z

@DickJC123 @nicklhy looks like this has been stale for a while. Please test it over again and feel free to reopen this issue if you are still facing the failure. Close for now.

nicklhy changed the title ~~speed drop after updating mxnet from 0.10.0 to 1.0.0~~ inference speed drop after updating mxnet from 0.10.0 to 1.0.0 Jan 12, 2018

nicklhy mentioned this issue Jan 15, 2018

Re-enable all op segments when in batch mode #9055

Merged

6 tasks

eric-haibin-lin added the Performance label Jan 16, 2018

lanking520 closed this as completed Dec 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

nicklhy commented Jan 12, 2018 •

edited

Loading

eric-haibin-lin commented Jan 12, 2018

nicklhy commented Jan 13, 2018

lupesko commented Jan 16, 2018

nicklhy commented Jan 16, 2018

eric-haibin-lin commented Jan 16, 2018

nicklhy commented Jan 17, 2018 •

edited

Loading

cjolivier01 commented Jan 17, 2018

cjolivier01 commented Jan 17, 2018

nicklhy commented Jan 17, 2018

cjolivier01 commented Jan 17, 2018

eric-haibin-lin commented Jan 20, 2018

nicklhy commented Jan 22, 2018

eric-haibin-lin commented Mar 15, 2018

DickJC123 commented Mar 15, 2018 via email

eric-haibin-lin commented Mar 15, 2018

eric-haibin-lin commented Mar 16, 2018

eric-haibin-lin commented Mar 21, 2018

DickJC123 commented Mar 23, 2018 via email

eric-haibin-lin commented Mar 23, 2018

eric-haibin-lin commented Apr 2, 2018

nicklhy commented Apr 25, 2018

lupesko commented Jul 14, 2018

vandanavk commented Sep 26, 2018

piyushghai commented Oct 9, 2018

vrakesh commented Nov 27, 2018

kalyc commented Dec 10, 2018

lanking520 commented Dec 10, 2018

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

Comments

nicklhy commented Jan 12, 2018 • edited Loading

eric-haibin-lin commented Jan 12, 2018

nicklhy commented Jan 13, 2018

lupesko commented Jan 16, 2018

nicklhy commented Jan 16, 2018

eric-haibin-lin commented Jan 16, 2018

nicklhy commented Jan 17, 2018 • edited Loading

cjolivier01 commented Jan 17, 2018

cjolivier01 commented Jan 17, 2018

nicklhy commented Jan 17, 2018

cjolivier01 commented Jan 17, 2018

eric-haibin-lin commented Jan 20, 2018

nicklhy commented Jan 22, 2018

eric-haibin-lin commented Mar 15, 2018

DickJC123 commented Mar 15, 2018 via email

eric-haibin-lin commented Mar 15, 2018

eric-haibin-lin commented Mar 16, 2018

eric-haibin-lin commented Mar 21, 2018

DickJC123 commented Mar 23, 2018 via email

eric-haibin-lin commented Mar 23, 2018

eric-haibin-lin commented Apr 2, 2018

nicklhy commented Apr 25, 2018

lupesko commented Jul 14, 2018

vandanavk commented Sep 26, 2018

piyushghai commented Oct 9, 2018

vrakesh commented Nov 27, 2018

kalyc commented Dec 10, 2018

lanking520 commented Dec 10, 2018

nicklhy commented Jan 12, 2018 •

edited

Loading

nicklhy commented Jan 17, 2018 •

edited

Loading