-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Race Conditions on Pascal GPUs? #4858
Comments
which example exactly? |
I tried train_cifar.py and also train_imagenet.py..It doesn't matter which one, both are identical in behavior. If there is any flag to enable that can help to figure out what may be going on? |
@ap-hynninen Have you seen anything like this? |
Also..one more thing is that it happens with latest cuda drivers. The old ones were fine. Not sure if the new drivers are exposing any bugs. |
I haven't seen this. I run on gtx-1080, titan x pascal, and P100 regularly. @amithr1 Did you compile with the Pascal sm 60 flags in config.mk? Which GPUs have you tested? If you can give me your hardware and software details. |
yes..I changed the sm 60 flags. The new driver is 375.xx |
@piiswrong I think I am facing similar issue. The log looks like the following when I am training inception-v3 on imagenet. INFO:root:Epoch[0] Batch [1400] Speed: 344.78 samples/sec Train-accuracy=0.007031 My CPU is i7 5930. Is it enough to feed data for 4 titan x pascal? @amithr1 How do you enable NativeEngine? |
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
Have you already found a workaround or a solution? I encountered similar problem when running train_cifar10.py on four P100 GPUs. I found that compiled with DEBUG=1 also created similar problem, not only limit to using NaiveEngine. I have created my ticket here #10123. |
Closing this issue in favor of - #10123 |
Hi All,
I think I may be running into some race conditions with Pascal GPUs. I hit this while am running simple CIFAR tests. The test just hangs once in a while. When I enable NativeEngine, it always passes (in the runs tested so far). However, any threaded engines, it hangs. Wondering if there are any races in the threaded implementations? This only happens with Pascal GPUs.
The text was updated successfully, but these errors were encountered: