-
Notifications
You must be signed in to change notification settings - Fork 6.8k
inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396
Comments
might be related to #9055 |
@eric-haibin-lin , great to see this work. I will test the performance after it is merged. |
@nicklhy #9055 was merged. Can you please build latest and re-test? @eric-haibin-lin can you please label:
|
@lupesko , Hi, just made a test and the result is listed below:
As I mentioned here, the inference speed of resnet152 did increased a lot when batch size is small, but still a bit lower than the 0.10.0 version. GPU usage are 100% in 0.10.0, 83% in 1.0.0, 95% in 1.0.1 respectively. By analyzing the CUDA calls, I noticed that the avg time cost of
When batch size is large enough (i.e. 64, 128), the new mxnet's performance looks good. In conclusion, #9055 fixed a large part of this problem, but I guess there still exists some other potential "bugs" that we should work on. |
Hi @nicklhy what network are you using? Does it include any custom operators? |
@eric-haibin-lin The above results are tested with a ResNet152 json file downloaded from http://data.mxnet.io/models/imagenet/resnet/152-layers/resnet-152-symbol.json. BTW, I also tested ResNet101 and got a similar result:
|
can you try setting OMP_NUM_THREADS=1 in the environment and try again? does it speed up? |
(set before running python) |
@cjolivier01 Same result after setting it. The GPU usage is still 93~95% when testing ResNet101 or ResNet152(batch_size=1). However, mxnet 0.10.0 can easily reach 99-100%. Notice there are no disk IO or image pre-processing operations in my speed test script. The bottleneck should be in GPU, or more specifically, |
Apparently not -- was checking if something cpu-related was bottleneck but it seems not. |
@nicklhy what happens if you call |
@eric-haibin-lin , Still the same result. There is always a small speed gap between mxnet 0.10.0 and the current version. |
@nicklhy I bisected the changes between 0.10.0 and 1.0 and found the following on a p2.xlarge(K80) instance. The commits are patched with the fix in PR 9055 git checkout xxx;
git submodule update --recursive;
git cherry-pick 9cc8ea3be23fb7adf4630e4cf065a2473094fbc8 -X theirs
make and below is the result
Looks like the commit ff21e1f caused the 4% slowdown during inference. @DickJC123 were you aware of this? |
I was not. Is it easy for you to run your perf test on newer architectures than Kepler?
… On March 15, 2018 at 11:38 AM Haibin Lin ***@***.***> wrote:
@nicklhy https://github.com/nicklhy I bisected the changes between 0.10.0 and 1.0 and found the following on a p2.xlarge(K80) instance. The commits are patched with the fix in PR 9055
git checkout xxx;
git submodule update --recursive;
git cherry-pick 9cc8ea3 -X theirs
make
and below is the result
ff21e1f Changed FullyConnected to use new linalg gemm, plus TensorCore if fp16 I/O. (#7505)
speed test for batch size: 1
avg forward speed: 24.484983 samples/s
avg forward time: mean = 0.040839 s, std = 0.000095 s
56eae58 Fixed Makefile so a null CUDA_ARCH is treated like an unset one. (#7515) - Fast
########################################################
speed test for batch size: 1
avg forward speed: 25.461191 samples/s
avg forward time: mean = 0.039270 s, std = 0.000095 s
########################################################
Looks like the commit ff21e1f ff21e1f caused the 4% slowdown during inference.
@DickJC123 https://github.com/dickjc123 were you aware of this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #9396 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ABn95cDNeZmmZxQZ_Jskw0yivlr9C3eNks5terUOgaJpZM4RcJ9U .
|
Let me try v100 |
Hi @DickJC123 ff21e1f with PR 9055 fix:
56eae58 with PR 9055 fix:
Looks like it is also slower on V100... @DickJC123 what should be the next step? |
Hi @DickJC123 could you share some benchmark result for updated FC operator in ff21e1f ? That will be helpful information to decide how to fix the performance degradation for resnet, or revert the change |
Sorry for the delay. I've been busy with prep for an upcoming conference. I should be able to look at this perf regression next week.
… On March 21, 2018 at 11:35 AM Haibin Lin ***@***.***> wrote:
Hi @DickJC123 https://github.com/dickjc123 could you share some benchmark result for updated FC operator in ff21e1f ff21e1f ? That will be helpful information to decide how to fix the performance degradation for resnet, or revert the change
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #9396 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ABn95QoDoTvRij3zS6tJ-aGKabJJlTnwks5tgp1ggaJpZM4RcJ9U .
|
@DickJC123 thanks! That will be great. |
@DickJC123 gentle ping - any update? Were you able to reproduce the issue? Just want to check if there's any update since the next release candidate will be cut in a week or so |
@eric-haibin-lin @DickJC123 , any update about this ? |
@DickJC123 bouncing again. |
@DickJC123 did you get a chance to check this issue with the latest code or @eric-haibin-lin's inputs? |
@DickJC123 bouncing once more... |
@DickJC123 Requesting an update regarding the issue, have the recent version(s) of the mxnet solved the issue for you? |
@lanking520 @aaronmarkham requesting to close this issue due to lack of activity |
@DickJC123 @nicklhy looks like this has been stale for a while. Please test it over again and feel free to reopen this issue if you are still facing the failure. Close for now. |
Hi, I just updated mxnet today from 0.10.0 to 1.0.0 in order to use some new features. Both versions are installed with pip like
pip3 install mxnet-cu80==1.0.0
. However, after a detailed benchmark test, I observed a significant speed drop when running resnet inference especially when batch size is small. The result for resnet152 is like below (network json file is downloaded from here)PS. I noticed that when batch size is small (i.e. batch_size=1), the GPU usage is 95~100% in mxnet 0.10.0 and 80-83% in mxnet 1.0.0 which means the GPU is not fully utilized at all.
Software env: Ubuntu 16.04, Python 3.5, CUDA 8.0, CUDNN 5.1.
GPU: GTX 1080 Ti.
I also test on a server with Titan XP and got a similar result. The speed test script is pasted below:
The text was updated successfully, but these errors were encountered: