Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #1124

Closed
albertz opened this issue Sep 13, 2022 · 2 comments

Comments

@albertz
Copy link
Member

albertz commented Sep 13, 2022

I'm posting this here because I have not seen this error before.

...
train epoch 2, step 128, cost:model_transducer_time_sync_full_sum_neg_log_prob_transducer_time_sync_full_sum_neg_log_prob 319.9956197736683, loss 2239.9692, max_size:data 1422, max_size:orth_classes 73, mem_usage:GPU:0 5.0GB, num_seqs 7, 0.819 sec/step, elapsed 0:01:08, exp. remaining 0:12:41, complete 8.28%
train epoch 2, step 129, cost:model_transducer_time_sync_full_sum_neg_log_prob_transducer_time_sync_full_sum_neg_log_prob 351.5988874315881, loss 2109.5933, max_size:data 1488, max_size:orth_classes 74, mem_usage:GPU:0 5.0GB, num_seqs 6, 0.863 sec/step, elapsed 0:01:09, exp. remaining 0:12:42, complete 8.37%
2022-09-13 18:13:47.063942: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: 
failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2022-09-13 18:13:47.064019: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 
1
Fatal Python error: Aborted

...

Thread 0x0000147f0e9b3700 (most recent call first): 
  File "/u/zeyer/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441 in _call_tf_sessionrun
@albertz
Copy link
Member Author

albertz commented Sep 13, 2022

It might be related to the WarpRna native CUDA implementation. Maybe that messes up sth.

But when you search for the error, you find lots of people reporting it:
pytorch/pytorch#21819
tensorflow/tensorflow#50735

But reading through the comments, it sounds more like that an update to CuDNN or CUDA or so fixed the issue. Or it's an hardware issue. It seems kind of arbitrary. Nothing really hints at a bug, but that does not say that this is the case here.

@albertz
Copy link
Member Author

albertz commented Sep 16, 2022

It might be related to the WarpRna native CUDA implementation. Maybe that messes up sth.

Yes, that was it. See my fixes here:
https://github.com/rwth-i6/returnn/commits/master/returnn/extern/WarpRna

Specifically:
rwth-i6/warp-rna@9e61931
rwth-i6/warp-rna@5543cd9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant