Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for solutions #137

Open
ran911da opened this issue Oct 22, 2024 · 1 comment
Open

Looking for solutions #137

ran911da opened this issue Oct 22, 2024 · 1 comment

Comments

@ran911da
Copy link

ran911da commented Oct 22, 2024

Hi, could you please suggest some solutions? The training stopped after 3 epochs, even when I reduced the batch size. and for the evaluation both GPUs show 5% utilization, which is very low and takes long time. and thank you in advance for your help.

| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:17:00.0 Off | N/A |
| 51% 50C P2 54W / 300W | 8448MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:65:00.0 On | N/A |
| 51% 51C P2 65W / 300W | 9034MiB / 11264MiB | 1% Default |
| |

File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 290, in call

File "/home/user/anaconda3/envs/py311/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 262, in convolution_op

3 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/ReadVariableOp_2/_22]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/_31]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[div_no_nan_1/ReadVariableOp_2/_22]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(2) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,128,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/stack2_block4_deep_2_cot_embed_1_conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_21448231]

@leondgarse
Copy link
Owner

I also previously used 2080 for training, and it worked fine, but it was years ago... Sometimes it could be caused by the environment like the CUDA version.

  • You may try executing the evals.py along to check if something is wrong.
  • Have you ever trained using float16? May set keras.mixed_precision.set_global_policy("mixed_float16") in the beginning of your script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants