Looking for solutions #137

ran911da · 2024-10-22T23:08:20Z

Hi, could you please suggest some solutions? The training stopped after 3 epochs, even when I reduced the batch size. and for the evaluation both GPUs show 5% utilization, which is very low and takes long time. and thank you in advance for your help.

| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:17:00.0 Off | N/A |
| 51% 50C P2 54W / 300W | 8448MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:65:00.0 On | N/A |
| 51% 51C P2 65W / 300W | 9034MiB / 11264MiB | 1% Default |
| |

File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 290, in call

File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op

3 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/ReadVariableOp_2/_22]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/_31]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/ReadVariableOp_2/_22]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(2) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_21448231]

The text was updated successfully, but these errors were encountered:

leondgarse · 2024-10-26T00:23:26Z

I also previously used 2080 for training, and it worked fine, but it was years ago... Sometimes it could be caused by the environment like the CUDA version.

You may try executing the evals.py along to check if something is wrong.
Have you ever trained using float16? May set keras.mixed_precision.set_global_policy("mixed_float16") in the beginning of your script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for solutions #137

Looking for solutions #137

ran911da commented Oct 22, 2024 •

edited

Loading

leondgarse commented Oct 26, 2024

Looking for solutions #137

Looking for solutions #137

Comments

ran911da commented Oct 22, 2024 • edited Loading

leondgarse commented Oct 26, 2024

ran911da commented Oct 22, 2024 •

edited

Loading