Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

Closed
nicklhy opened this issue Jan 12, 2018 · 27 comments
Closed

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

nicklhy opened this issue Jan 12, 2018 · 27 comments

Comments

@nicklhy
Copy link
Contributor

nicklhy commented Jan 12, 2018

Hi, I just updated mxnet today from 0.10.0 to 1.0.0 in order to use some new features. Both versions are installed with pip like pip3 install mxnet-cu80==1.0.0. However, after a detailed benchmark test, I observed a significant speed drop when running resnet inference especially when batch size is small. The result for resnet152 is like below (network json file is downloaded from here)

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 85.146210 samples/s
	avg forward time: mean = 0.011743 s, std = 0.000341 s
########################################################
speed test for batch size: 4
	avg forward speed: 215.284806 samples/s
	avg forward time: mean = 0.018579 s, std = 0.000175 s
########################################################
speed test for batch size: 16
	avg forward speed: 297.233244 samples/s
	avg forward time: mean = 0.053827 s, std = 0.001030 s
########################################################
speed test for batch size: 64
	avg forward speed: 316.399717 samples/s
	avg forward time: mean = 0.202272 s, std = 0.001754 s
########################################################
speed test for batch size: 128
	avg forward speed: 321.620336 samples/s
	avg forward time: mean = 0.397981 s, std = 0.002363 s
MXNet version: 1.0.0
########################################################
speed test for batch size: 1
	avg forward speed: 67.866811 samples/s
	avg forward time: mean = 0.014733 s, std = 0.000391 s
########################################################
speed test for batch size: 4
	avg forward speed: 188.020417 samples/s
	avg forward time: mean = 0.021272 s, std = 0.000563 s
########################################################
speed test for batch size: 16
	avg forward speed: 286.253890 samples/s
	avg forward time: mean = 0.055892 s, std = 0.000565 s
########################################################
speed test for batch size: 64
	avg forward speed: 310.045353 samples/s
	avg forward time: mean = 0.206418 s, std = 0.004860 s
########################################################
speed test for batch size: 128
	avg forward speed: 320.566647 samples/s
	avg forward time: mean = 0.399289 s, std = 0.002575 s

PS. I noticed that when batch size is small (i.e. batch_size=1), the GPU usage is 95~100% in mxnet 0.10.0 and 80-83% in mxnet 1.0.0 which means the GPU is not fully utilized at all.

Software env: Ubuntu 16.04, Python 3.5, CUDA 8.0, CUDNN 5.1.
GPU: GTX 1080 Ti.

I also test on a server with Titan XP and got a similar result. The speed test script is pasted below:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import time
import argparse
import mxnet as mx
import numpy as np


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='speed test')
    parser.add_argument('--net', type=str, required=True,
                        help='network symbol json file')
    parser.add_argument('--size', type=int, default=224,
                        help='image size')
    parser.add_argument('--n-batch', type=int, default=100,
                        help='batch number for test')
    parser.add_argument('--gpu', type=int, default=0,
                        help='gpu device id')
    args = parser.parse_args()

    print('MXNet version: %s' % mx.__version__)

    ctx = mx.gpu(args.gpu)

    batch_size_list = [1, 4, 16, 64, 128]

    mod = mx.mod.Module(symbol=mx.sym.load(args.net),
                        context=ctx,
                        data_names=['data', ],
                        label_names=['softmax_label', ])
    mod.bind(data_shapes=[('data', (1, 3, args.size, args.size))],
             label_shapes=[('softmax_label', (1,))],
             for_training=False)
    mod.init_params(initializer=mx.init.Normal())

    for batch_size in batch_size_list:
        print('########################################################')
        print('speed test for batch size: %d' % batch_size)
        mod.reshape(data_shapes=[('data', (batch_size,
                                           3,
                                           args.size,
                                           args.size))],
                    label_shapes=[('softmax_label', (batch_size,))])

        #  pre-allocate
        batch_data = mx.nd.random_normal(0, 0.5, (batch_size, 3, args.size, args.size), ctx=ctx)

        #  warm up GPU
        for _ in range(50):
            mod.forward(mx.io.DataBatch(data=[batch_data, ],
                                        label=None),
                        is_train=False)
            out = mod.get_outputs()[0].asnumpy()

        k = 0
        t_start = time.time()
        t_fwd_list = []
        for _ in range(args.n_batch):
            t1 = time.time()
            mod.forward(mx.io.DataBatch(data=[batch_data, ],
                                        label=None),
                        is_train=False)
            out = mod.get_outputs()[0]
            out.wait_to_read()
            t2 = time.time()
            t_fwd_list.append(t2-t1)
        t_end = time.time()
        n_samples = args.n_batch*batch_size

        print('\tavg forward speed: %f samples/s' % (n_samples/(t_end-t_start)))
        print('\tavg forward time: mean = %f s, std = %f s' %
              (np.mean(t_fwd_list), np.std(t_fwd_list)))
@nicklhy nicklhy changed the title speed drop after updating mxnet from 0.10.0 to 1.0.0 inference speed drop after updating mxnet from 0.10.0 to 1.0.0 Jan 12, 2018
@eric-haibin-lin
Copy link
Member

might be related to #9055

@nicklhy
Copy link
Contributor Author

nicklhy commented Jan 13, 2018

@eric-haibin-lin , great to see this work. I will test the performance after it is merged.

@lupesko
Copy link
Contributor

lupesko commented Jan 16, 2018

@nicklhy #9055 was merged. Can you please build latest and re-test?

@eric-haibin-lin can you please label:

  • Pending Requester Info
  • Performance

@nicklhy
Copy link
Contributor Author

nicklhy commented Jan 16, 2018

@lupesko , Hi, just made a test and the result is listed below:

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 84.061132 samples/s
	avg forward time: mean = 0.011895 s, std = 0.000187 s
########################################################
speed test for batch size: 4
	avg forward speed: 208.959420 samples/s
	avg forward time: mean = 0.019141 s, std = 0.000436 s
########################################################
speed test for batch size: 16
	avg forward speed: 284.647810 samples/s
	avg forward time: mean = 0.056207 s, std = 0.001876 s
########################################################
speed test for batch size: 64
	avg forward speed: 299.845795 samples/s
	avg forward time: mean = 0.213439 s, std = 0.004175 s
########################################################
speed test for batch size: 128
	avg forward speed: 306.651349 samples/s
	avg forward time: mean = 0.417408 s, std = 0.008048 s
MXNet version: 1.0.1
########################################################
speed test for batch size: 1
	avg forward speed: 76.590411 samples/s
	avg forward time: mean = 0.013055 s, std = 0.000209 s
########################################################
speed test for batch size: 4
	avg forward speed: 197.583762 samples/s
	avg forward time: mean = 0.020243 s, std = 0.000270 s
########################################################
speed test for batch size: 16
	avg forward speed: 281.058522 samples/s
	avg forward time: mean = 0.056925 s, std = 0.000418 s
########################################################
speed test for batch size: 64
	avg forward speed: 300.547282 samples/s
	avg forward time: mean = 0.212941 s, std = 0.002727 s
########################################################
speed test for batch size: 128
	avg forward speed: 309.225150 samples/s
	avg forward time: mean = 0.413934 s, std = 0.001367 s

As I mentioned here, the inference speed of resnet152 did increased a lot when batch size is small, but still a bit lower than the 0.10.0 version. GPU usage are 100% in 0.10.0, 83% in 1.0.0, 95% in 1.0.1 respectively. By analyzing the CUDA calls, I noticed that the avg time cost of cudaStreamSynchronize in mxnet 1.0.1 is a bit larger than 0.10.0.

# mxnet 0.10.0
# GPU usage: 99~100%
==21421== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.04%  7.76526s      3691  2.1038ms  1.4510us  9.5652ms  cudaStreamSynchronize
 21.31%  3.17943s    659205  4.8230us  3.3450us  1.2354ms  cudaLaunch
 11.92%  1.77868s         4  444.67ms  60.328us  889.60ms  cudaStreamCreate
==11655== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 53.84%  8.38694s      3695  2.2698ms  1.5780us  12.897ms  cudaStreamSynchronize
 20.01%  3.11738s    661308  4.7130us  3.3310us  2.6520ms  cudaLaunch
  7.98%  1.24364s        16  77.727ms  8.0110us  620.95ms  cudaStreamCreateWithFlags
  4.59%  715.63ms       292  2.4508ms     346ns  176.42ms  cudaFree
  4.52%  703.51ms       787  893.91us  160.25us  537.44ms  cudaMemGetInfo

When batch size is large enough (i.e. 64, 128), the new mxnet's performance looks good.

In conclusion, #9055 fixed a large part of this problem, but I guess there still exists some other potential "bugs" that we should work on.

@eric-haibin-lin
Copy link
Member

Hi @nicklhy what network are you using? Does it include any custom operators?

@nicklhy
Copy link
Contributor Author

nicklhy commented Jan 17, 2018

@eric-haibin-lin The above results are tested with a ResNet152 json file downloaded from http://data.mxnet.io/models/imagenet/resnet/152-layers/resnet-152-symbol.json.

BTW, I also tested ResNet101 and got a similar result:

MXNet version: 0.10.0
########################################################
speed test for batch size: 1
	avg forward speed: 116.469000 samples/s
	avg forward time: mean = 0.008585 s, std = 0.000308 s
########################################################
speed test for batch size: 4
	avg forward speed: 293.537501 samples/s
	avg forward time: mean = 0.013625 s, std = 0.000497 s
########################################################
speed test for batch size: 16
	avg forward speed: 407.152972 samples/s
	avg forward time: mean = 0.039295 s, std = 0.000929 s
########################################################
speed test for batch size: 64
	avg forward speed: 432.887962 samples/s
	avg forward time: mean = 0.147841 s, std = 0.001603 s
########################################################
speed test for batch size: 128
	avg forward speed: 443.354422 samples/s
	avg forward time: mean = 0.288704 s, std = 0.003057 s
MXNet version: 1.0.1
########################################################
speed test for batch size: 1
	avg forward speed: 109.414104 samples/s
	avg forward time: mean = 0.009137 s, std = 0.000553 s
########################################################
speed test for batch size: 4
	avg forward speed: 288.660234 samples/s
	avg forward time: mean = 0.013856 s, std = 0.000231 s
########################################################
speed test for batch size: 16
	avg forward speed: 412.575750 samples/s
	avg forward time: mean = 0.038778 s, std = 0.001016 s
########################################################
speed test for batch size: 64
	avg forward speed: 447.079919 samples/s
	avg forward time: mean = 0.143147 s, std = 0.001947 s
########################################################
speed test for batch size: 128
	avg forward speed: 458.120138 samples/s
	avg forward time: mean = 0.279399 s, std = 0.004146 s

@cjolivier01
Copy link
Member

can you try setting OMP_NUM_THREADS=1 in the environment and try again? does it speed up?

@cjolivier01
Copy link
Member

(set before running python)

@nicklhy
Copy link
Contributor Author

nicklhy commented Jan 17, 2018

@cjolivier01 Same result after setting it. The GPU usage is still 93~95% when testing ResNet101 or ResNet152(batch_size=1). However, mxnet 0.10.0 can easily reach 99-100%.

Notice there are no disk IO or image pre-processing operations in my speed test script. The bottleneck should be in GPU, or more specifically, cudaStreamSynchronize. I guess "OMP_NUM_THREADS" won't be able to solve this problem?

@cjolivier01
Copy link
Member

Apparently not -- was checking if something cpu-related was bottleneck but it seems not.

@eric-haibin-lin
Copy link
Member

@nicklhy what happens if you call mx.nd.waitall() before all the forward calls, remove output.wait_to_read(), and call mx.nd.waitall() after all 100 batches are done?
I don't know see any reason why streamSync becomes more expensive ...

@nicklhy
Copy link
Contributor Author

nicklhy commented Jan 22, 2018

@eric-haibin-lin , Still the same result. There is always a small speed gap between mxnet 0.10.0 and the current version.

@eric-haibin-lin
Copy link
Member

@nicklhy I bisected the changes between 0.10.0 and 1.0 and found the following on a p2.xlarge(K80) instance. The commits are patched with the fix in PR 9055

git checkout xxx; 
git submodule update --recursive; 
git cherry-pick 9cc8ea3be23fb7adf4630e4cf065a2473094fbc8 -X theirs
make

and below is the result

ff21e1f Changed FullyConnected to use new linalg gemm, plus TensorCore if fp16 I/O. (#7505)

speed test for batch size: 1
        avg forward speed: 24.484983 samples/s
        avg forward time: mean = 0.040839 s, std = 0.000095 s
56eae58 Fixed Makefile so a null CUDA_ARCH is treated like an unset one. (#7515) - Fast

########################################################
speed test for batch size: 1
        avg forward speed: 25.461191 samples/s
        avg forward time: mean = 0.039270 s, std = 0.000095 s
########################################################

Looks like the commit ff21e1f caused the 4% slowdown during inference.

@DickJC123 were you aware of this?

@DickJC123
Copy link
Contributor

DickJC123 commented Mar 15, 2018 via email

@eric-haibin-lin
Copy link
Member

Let me try v100

@eric-haibin-lin
Copy link
Member

Hi @DickJC123

ff21e1f with PR 9055 fix:

########################################################
speed test for batch size: 1
        avg forward speed: 58.503896 samples/s
        avg forward time: mean = 0.017091 s, std = 0.000046 s
########################################################

56eae58 with PR 9055 fix:

########################################################
speed test for batch size: 1
        avg forward speed: 62.070470 samples/s
        avg forward time: mean = 0.016109 s, std = 0.000014 s
########################################################

Looks like it is also slower on V100... @DickJC123 what should be the next step?

@eric-haibin-lin
Copy link
Member

Hi @DickJC123 could you share some benchmark result for updated FC operator in ff21e1f ? That will be helpful information to decide how to fix the performance degradation for resnet, or revert the change

@DickJC123
Copy link
Contributor

DickJC123 commented Mar 23, 2018 via email

@eric-haibin-lin
Copy link
Member

@DickJC123 thanks! That will be great.

@eric-haibin-lin
Copy link
Member

@DickJC123 gentle ping - any update? Were you able to reproduce the issue? Just want to check if there's any update since the next release candidate will be cut in a week or so

@nicklhy
Copy link
Contributor Author

nicklhy commented Apr 25, 2018

@eric-haibin-lin @DickJC123 , any update about this ?

@lupesko
Copy link
Contributor

lupesko commented Jul 14, 2018

@DickJC123 bouncing again.

@vandanavk
Copy link
Contributor

@DickJC123 did you get a chance to check this issue with the latest code or @eric-haibin-lin's inputs?

@piyushghai
Copy link
Contributor

@DickJC123 bouncing once more...

@vrakesh
Copy link
Contributor

vrakesh commented Nov 27, 2018

@DickJC123 Requesting an update regarding the issue, have the recent version(s) of the mxnet solved the issue for you?

@kalyc
Copy link
Contributor

kalyc commented Dec 10, 2018

@lanking520 @aaronmarkham requesting to close this issue due to lack of activity

@lanking520
Copy link
Member

@DickJC123 @nicklhy looks like this has been stale for a while. Please test it over again and feel free to reopen this issue if you are still facing the failure. Close for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants