You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, could you please suggest some solutions? The training stopped after 3 epochs, even when I reduced the batch size. and for the evaluation both GPUs show 5% utilization, which is very low and takes long time. and thank you in advance for your help.
File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 290, in call
File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op
3 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[div_no_nan_1/ReadVariableOp_2/_22]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[div_no_nan_1/_31]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[[div_no_nan_1/ReadVariableOp_2/_22]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(2) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
I also previously used 2080 for training, and it worked fine, but it was years ago... Sometimes it could be caused by the environment like the CUDA version.
You may try executing the evals.py along to check if something is wrong.
Have you ever trained using float16? May set keras.mixed_precision.set_global_policy("mixed_float16") in the beginning of your script.
Hi, could you please suggest some solutions? The training stopped after 3 epochs, even when I reduced the batch size. and for the evaluation both GPUs show 5% utilization, which is very low and takes long time. and thank you in advance for your help.
File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op
3 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
(2) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_21448231]
The text was updated successfully, but these errors were encountered: