-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Multiple undefined behavior errors when getting device count with pytorch #272
Comments
Hi @LunNova. Internal ticket has been created to fix this issue. Thanks! |
Hi @LunNova, thanks for reaching out and reporting the errors! All three errors has been accessed so far, and we will be working on addressing them shortly. For the first error, it is due to using reinterpret_cast here
For the second error, it is likely a misuse of For the third one, it is caused by passing a struct by reference without initializing it properly. It should have been fixed in a recent commit 2 weeks ago here 441bd9f#diff-fbf39ce9f5a449510205e3e855b5e57e27194285122b7ba5b6a7b48ab54d536bR510. I will keep you updated on the progress. Thanks! Thanks! |
I made an attempt at fixing the second error here in case it's useful #274 Thanks for the fast response! |
I can confirm 441bd9f fixes the third error |
Hi @LunNova, just thought I would give an update regarding the fixes since it has been taking some time. Upon further investigation, the second error turns out to be much more complicated than simply casting the type to hsa_amd_region_info_t by mistake. It is a result of trying to bootstrap AMD specific behavior (hsa_amd_region_info_t) to the HSA standard (hsa_region_info_t). The ideal way to fix this is introducing proper amd-specific APIs to handle these behaviors, but this is not going to be a trivial fix. I have been in discussion with several senior devs regarding what's the the best way to handle this. The first error is similar to the second one. The call back handles have different signatures in the AMD-specific implementation compared to the more generic HSA standard signatures, despite both are essentially wrappers around opaque C-handles. Fixing this is going to be very tricky and likely will break a lot of existing code. The good news is that the both behaviors are intended are they have been functional for the past couple years. The bad news is that because they are functional, there are a lot of codes that depends on the current behaviors, which makes implementing changes more complicated. At this point I am working pushing a change for the second error, and I am not quite sure if there's a way around for fixing the first one. |
Getting the device count while using a ROCR-Runtime built with
-fsanitize=undefined
triggers multiple undefined behavior errors.Tested on rocm-6.3.1 built with
-fsanitize=undefined
and with pytorch nightly 2024-12-29The text was updated successfully, but these errors were encountered: