-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change to use apex for better fp16 and multi-gpu support #116
Conversation
That's really awesome! I love the work you guys did on apex and I would be super happy to have an 'official' implementation of BERT using apex (plus it showcases all the major modules: FusedAdam, FusedLayerNorm, 16bits, distributed optimizer...). And the speed improvement is impressive, fine-tuning BERT-large on SQuAD in 1h is amazing! Just three general questions:
|
Hi @thomwolf ,
-Deyu |
update:
command used:
-Deyu |
Ok thanks for the update! It looks good to me, I will do a few tests on various hardwares and it'll be included in the new 0.4.0 release coming out today (hopefully) Congrats on the MLPerf results by the way! |
@FDecaYed I am trying to reproduce your numbers but I can't get very close. I am using an Azure NDv2 server with 8 NVIDIA Tesla V100 NVLINK interconnected GPUs and 40 Intel Skylake cores. Switching to fp16 lowers the memory usage by half indeed but the training time stays about the same ie around (e.g. 100 seconds for I have the new release of PyTorch 1.0.0, CUDA 10 and installed apex with cpp/cuda extensions. I am using the fourth-release branch on the present repo which was rebased from master with your PR. If you have any insight I would be interested. Could the difference come from using a DGX versus an Azure server? Can you give me the exact command you used to train the |
there could be a lot of things, let's sort them out one by one: From my past experience with cloud, single GPU number should not be that far from any DGX, unless you are bound by input. I doubt that's the case base on the workload. If we indeed are running and reporting the same thing, there must be some software differences. We are still in the progress moving up to pytorch 1.0, so my test was on 0.4. I'll merge your release branch and try on pytorch 1.0 on my side on DGX today. Meanwhile, this is the container I used for testing. You could try it on Azure and see if you can get my result. Note that it does not have latest apex installed, so you need uninstall apex and build latest inside. -Deyu |
I tested on pytorch 1.0 and still getting the same speed up |
Ok, I got the 3-4x speed-up using the pytorch dockerhub 1.0-cuda10.0-cudnn7-devel image 🔥 I'm still wondering why I can't get these speedups outside of the docker container so I will try to investigate that a bit further (in particular since other people may start opening issues here :-). If you have any further insight, don't hesitate to share :-) |
Ok nailed it I think it was a question of not installing |
Great! It'll be great if we can later update readme to document V100 expected speed as well. |
Thanks for the nice work! @FDecaYed @thomwolf I tried fp16 training for bert-large. It has the imbalanced memory problem, which wastes gpu power a lot. The nvidia-smi results are shown as follows: +-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 0000A761:00:00.0 Off | 0 |
| N/A 39C P0 124W / 250W | 15128MiB / 16130MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 0000C0BA:00:00.0 Off | 0 |
| N/A 41C P0 116W / 250W | 10012MiB / 16130MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 0000D481:00:00.0 Off | 0 |
| N/A 38C P0 80W / 250W | 10012MiB / 16130MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 0000EC9F:00:00.0 Off | 0 |
| N/A 40C P0 61W / 250W | 10012MiB / 16130MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11870 C python 15117MiB |
| 1 11870 C python 10001MiB |
| 2 11870 C python 10001MiB |
| 3 11870 C python 10001MiB |
+-----------------------------------------------------------------------------+ |
Change to use apex for better fp16 and multi-gpu support
Is it already in add in pytorch-transformers? If so how do I use it, where should i specify the settings that I want to use Fp16 and apex and is apex already added in installation of pytorch transformers on anaconda 3? |
switch to use pytorch backend when triton is not available at train mode
Hi there,
This PR includes changes to improve FP16 and multi-gpu performance. We get over 3.5x performance increase on Tesla V100 across all examples.
NVIDIA Apex(https://github.com/NVIDIA/apex) is added as a new dependency. It fixed issues with existing fp16 implementation(for example not converting loss/grad to float before scaling) as well as provide a more efficient implementation.
Below is test results we run on MRPC and SQuAD examples. All test baselines(
before
numbers) are fp32, since we found it actually is the best performing config. Reason being optimizer is forced on cpu under fp16.The
after
numbers are running with--fp16
after this PR. All tests done on single tesla V100 16GB.MRPC on BERT-base:
SQuAD on BERT-base:
SQuAD on BERT-large:
optimize_on_cpu
option is also removed entirely from code since I can't find any situation where it is faster thangradient_accumulation_steps
. Of course assuming at least batch 1 can fit into GPU memory.