Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Multiple undefined behavior errors when getting device count with pytorch #272

Open
LunNova opened this issue Dec 30, 2024 · 5 comments

Comments

@LunNova
Copy link

LunNova commented Dec 30, 2024

Getting the device count while using a ROCR-Runtime built with -fsanitize=undefined triggers multiple undefined behavior errors.

/build/source/runtime/hsa-runtime/core/runtime/amd_cpu_agent.cpp:160:27: runtime error: call to function rocr::image::FindKernelArgPool(hsa_amd_memory_pool_s, void*) through pointer to incorrect function type 'hsa_status_t (*)(hsa_region_s, void *)'
(/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x4f76a8): note: rocr::image::FindKernelArgPool(hsa_amd_memory_pool_s, void*) defined here
    #0 0x7fff766c1041 in VisitRegion /build/source/runtime/hsa-runtime/core/runtime/amd_cpu_agent.cpp:160
    #1 0x7fff767150bb in hsa_amd_agent_iterate_memory_pools /build/source/runtime/hsa-runtime/core/runtime/hsa_ext_amd.cpp:747
    #2 0x7ffff38a81a1 in roctracer::hsa_support::detail::hsa_amd_agent_iterate_memory_pools_callback(hsa_agent_s, hsa_status_t (*)(hsa_amd_memory_pool_s, void*), void*) (/nix/store/843f8857yvnjkc9c9dyzs4w4imdx3rnq-rocm-merged/lib/libroctracer64.so.4+0x1e1a1)
    #3 0x7fff768f7a36 in rocr::image::ImageRuntime::CreateImageManager(hsa_agent_s, void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x4f7a36)
    #4 0x7fff7673dde6 in operator() /build/source/runtime/hsa-runtime/core/inc/exceptions.h:88
    #5 0x7fff7673dde6 in IterateAgent /build/source/runtime/hsa-runtime/core/runtime/runtime.cpp:332
    #6 0x7fff76702458 in rocr::HSA::hsa_iterate_agents(hsa_status_t (*)(hsa_agent_s, void*), void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x302458)
    #7 0x7fff768f7fe2 in CreateSingleton /build/source/runtime/hsa-runtime/image/image_runtime.cpp:185
    #8 0x7fff768f7e35 in instance /build/source/runtime/hsa-runtime/image/image_runtime.cpp:165
    #9 0x7fff768f52f5 in rocr::image::hsa_amd_image_get_info_max_dim(hsa_agent_s, hsa_agent_info_t, void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x4f52f5)
    #10 0x7fff76702527 in hsa_agent_get_info /build/source/runtime/hsa-runtime/core/runtime/hsa.cpp:581
    #11 0x7ffff38a277f in roctracer::hsa_support::detail::hsa_agent_get_info_callback(hsa_agent_s, hsa_agent_info_t, void*) (/nix/store/843f8857yvnjkc9c9dyzs4w4imdx3rnq-rocm-merged/lib/libroctracer64.so.4+0x1877f)
    #12 0x7fff868b4e84  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b4e84)
    #13 0x7fff868b3676  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b3676)
    #14 0x7fff868b2598  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b2598)
    #15 0x7fff8680ccab  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x40ccab)
    #16 0x7fff868a0af9  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4a0af9)
    #17 0x7fff86568c7a  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x168c7a)
    #18 0x7ffff6c9bd56 in __pthread_once_slow.isra.0 (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bd56) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #19 0x7ffff6c9bdd0 in ___pthread_once (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bdd0) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #20 0x7fff86586b03  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x186b03)
    #21 0x7fffaf358092 in device_count_impl /build/source/c10/hip/HIPFunctions.cpp:20
    #22 0x7fffaf358092 in operator() /build/source/c10/hip/HIPFunctions.cpp:102
    #23 0x7fffaf358092 in c10::hip::device_count() /build/source/c10/hip/HIPFunctions.cpp:113
    #24 0x7fffb1b5c4d8 in at::cuda::is_available() /build/source/aten/src/ATen/hip/HIPContextLight.h:70
    #25 0x7fffb1b5c4d8 in at::cuda::detail::CUDAHooks::hasCUDA() const /build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:152
    #26 0x7fffb95ef5b1 in at::Context::hasCUDA() /build/source/aten/src/ATen/Context.h:128
    #27 0x7fffb95ef5b1 in at::hasCUDA() /build/source/aten/src/ATen/Context.h:480
    #28 0x7fffb95ef5b1 in at::getNumGPUs() /build/source/aten/src/ATen/Context.h:521
    #29 0x7fffb95e96f7 in operator() /build/source/aten/src/ATen/Context.cpp:308
    #30 0x7fffb95e96f7 in at::Context::blasPreferredBackend() /build/source/aten/src/ATen/Context.cpp:317
    #31 0x7ffff0534aac in operator() /build/source/torch/csrc/Module.cpp:2192
    #32 0x7ffff0534aac in call_impl<at::BlasBackend, initModule()::<lambda()>&, pybind11::detail::void_type> /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/cast.h:1631
    #33 0x7ffff0534aac in call<at::BlasBackend, pybind11::detail::void_type, initModule()::<lambda()>&> /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/cast.h:1599
    #34 0x7ffff0534aac in operator() /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:279
    #35 0x7ffff0534aac in _FUN /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:249
    #36 0x7fffefb59bee in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:971
    #37 0x7ffff7239988 in cfunction_call (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x239988) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #38 0x7ffff723a710 in _PyObject_MakeTpCall (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x23a710) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #39 0x7ffff7325117 in _PyEval_EvalFrameDefault (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x325117) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #40 0x7ffff732e2ec in PyEval_EvalCode.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x32e2ec) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #41 0x7ffff733269f in run_eval_code_obj (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x33269f) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #42 0x7ffff73a5445 in run_mod (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x3a5445) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #43 0x7ffff7417dfc in _PyRun_SimpleFileObject (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x417dfc) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #44 0x7ffff7418c00 in _PyRun_AnyFileObject.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x418c00) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #45 0x7ffff741f35a in Py_RunMain.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x41f35a) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #46 0x7ffff6c2a1fb in __libc_start_call_main (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x2a1fb) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #47 0x7ffff6c2a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x2a2b8) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #48 0x401074 in _start (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/bin/python3.12+0x401074) (BuildId: a7973433d3175eb248d85d13a1b80d149f683c56)

/build/source/runtime/hsa-runtime/core/runtime/hsa.cpp:1077:30: runtime error: load of value 40961, which is not a valid value for type 'hsa_region_info_t'
    #0 0x7fff7670453e in hsa_region_get_info /build/source/runtime/hsa-runtime/core/runtime/hsa.cpp:1077
    #1 0x7fff769002d8 in Initialize /build/source/runtime/hsa-runtime/image/image_manager_kv.cpp:179
    #2 0x7fff768f78c6 in rocr::image::ImageRuntime::CreateImageManager(hsa_agent_s, void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x4f78c6)
    #3 0x7fff7673df66 in operator() /build/source/runtime/hsa-runtime/core/inc/exceptions.h:88
    #4 0x7fff7673df66 in IterateAgent /build/source/runtime/hsa-runtime/core/runtime/runtime.cpp:332
    #5 0x7fff76702458 in rocr::HSA::hsa_iterate_agents(hsa_status_t (*)(hsa_agent_s, void*), void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x302458)
    #6 0x7fff768f7fe2 in CreateSingleton /build/source/runtime/hsa-runtime/image/image_runtime.cpp:185
    #7 0x7fff768f7e35 in instance /build/source/runtime/hsa-runtime/image/image_runtime.cpp:165
    #8 0x7fff768f52f5 in rocr::image::hsa_amd_image_get_info_max_dim(hsa_agent_s, hsa_agent_info_t, void*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x4f52f5)
    #9 0x7fff76702527 in hsa_agent_get_info /build/source/runtime/hsa-runtime/core/runtime/hsa.cpp:581
    #10 0x7ffff38a277f in roctracer::hsa_support::detail::hsa_agent_get_info_callback(hsa_agent_s, hsa_agent_info_t, void*) (/nix/store/843f8857yvnjkc9c9dyzs4w4imdx3rnq-rocm-merged/lib/libroctracer64.so.4+0x1877f)
    #11 0x7fff868b4e84  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b4e84)
    #12 0x7fff868b3676  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b3676)
    #13 0x7fff868b2598  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b2598)
    #14 0x7fff8680ccab  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x40ccab)
    #15 0x7fff868a0af9  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4a0af9)
    #16 0x7fff86568c7a  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x168c7a)
    #17 0x7ffff6c9bd56 in __pthread_once_slow.isra.0 (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bd56) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #18 0x7ffff6c9bdd0 in ___pthread_once (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bdd0) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #19 0x7fff86586b03  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x186b03)
    #20 0x7fffaf358092 in device_count_impl /build/source/c10/hip/HIPFunctions.cpp:20
    #21 0x7fffaf358092 in operator() /build/source/c10/hip/HIPFunctions.cpp:102
    #22 0x7fffaf358092 in c10::hip::device_count() /build/source/c10/hip/HIPFunctions.cpp:113
    #23 0x7fffb1b5c4d8 in at::cuda::is_available() /build/source/aten/src/ATen/hip/HIPContextLight.h:70

/build/source/runtime/hsa-runtime/core/runtime/runtime.cpp:984:27: runtime error: null pointer passed as argument 2, which is declared to never be null
/nix/store/babj1136imm7z22i72111h8m4fgaz5y3-gcc-prefix/lib/gcc/x86_64-unknown-linux-gnu/14.2.1/../../../../x86_64-unknown-linux-gnu/include/string.h:44:28: note: nonnull attribute specified here
    #0 0x7fff76741a9b in PtrInfo /build/source/runtime/hsa-runtime/core/runtime/runtime.cpp:984
    #1 0x7fff766f2271 in AllowAccess /build/source/runtime/hsa-runtime/core/runtime/amd_memory_region.cpp:489
    #2 0x7fff767428ad in AllowAccess /build/source/runtime/hsa-runtime/core/runtime/runtime.cpp:708
    #3 0x7fff76715364 in rocr::AMD::hsa_amd_agents_allow_access(unsigned int, hsa_agent_s const*, unsigned int const*, void const*) (/nix/store/62bsg8k600m71hih5l9fm2igx1rfcf23-rocm-runtime-6.3.1/lib/libhsa-runtime64.so.1+0x315364)
    #4 0x7ffff389fd59 in roctracer::hsa_support::(anonymous namespace)::AgentsAllowAccessIntercept(unsigned int, hsa_agent_s const*, unsigned int const*, void const*) (/nix/store/843f8857yvnjkc9c9dyzs4w4imdx3rnq-rocm-merged/lib/libroctracer64.so.4+0x15d59)
    #5 0x7ffff38a87e0 in roctracer::hsa_support::detail::hsa_amd_agents_allow_access_callback(unsigned int, hsa_agent_s const*, unsigned int const*, void const*) (/nix/store/843f8857yvnjkc9c9dyzs4w4imdx3rnq-rocm-merged/lib/libroctracer64.so.4+0x1e7e0)
    #6 0x7fff868b72dc  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b72dc)
    #7 0x7fff868d248b  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4d248b)
    #8 0x7fff868b14e4  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b14e4)
    #9 0x7fff868b37eb  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b37eb)
    #10 0x7fff868b2598  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4b2598)
    #11 0x7fff8680ccab  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x40ccab)
    #12 0x7fff868a0af9  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x4a0af9)
    #13 0x7fff86568c7a  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x168c7a)
    #14 0x7ffff6c9bd56 in __pthread_once_slow.isra.0 (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bd56) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #15 0x7ffff6c9bdd0 in ___pthread_once (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x9bdd0) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #16 0x7fff86586b03  (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libamdhip64.so.6+0x186b03)
    #17 0x7fffaf358092 in device_count_impl /build/source/c10/hip/HIPFunctions.cpp:20
    #18 0x7fffaf358092 in operator() /build/source/c10/hip/HIPFunctions.cpp:102
    #19 0x7fffaf358092 in c10::hip::device_count() /build/source/c10/hip/HIPFunctions.cpp:113
    #20 0x7fffb1b5c4d8 in at::cuda::is_available() /build/source/aten/src/ATen/hip/HIPContextLight.h:70
    #21 0x7fffb1b5c4d8 in at::cuda::detail::CUDAHooks::hasCUDA() const /build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:152
    #22 0x7fffb95ef5b1 in at::Context::hasCUDA() /build/source/aten/src/ATen/Context.h:128
    #23 0x7fffb95ef5b1 in at::hasCUDA() /build/source/aten/src/ATen/Context.h:480
    #24 0x7fffb95ef5b1 in at::getNumGPUs() /build/source/aten/src/ATen/Context.h:521
    #25 0x7fffb95e96f7 in operator() /build/source/aten/src/ATen/Context.cpp:308
    #26 0x7fffb95e96f7 in at::Context::blasPreferredBackend() /build/source/aten/src/ATen/Context.cpp:317
    #27 0x7ffff0534aac in operator() /build/source/torch/csrc/Module.cpp:2192
    #28 0x7ffff0534aac in call_impl<at::BlasBackend, initModule()::<lambda()>&, pybind11::detail::void_type> /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/cast.h:1631
    #29 0x7ffff0534aac in call<at::BlasBackend, pybind11::detail::void_type, initModule()::<lambda()>&> /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/cast.h:1599
    #30 0x7ffff0534aac in operator() /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:279
    #31 0x7ffff0534aac in _FUN /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:249
    #32 0x7fffefb59bee in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /nix/store/clrlqkcvr4h67fbnv00pf29v02k4lhyr-python3.12-pybind11-2.13.6/include/pybind11/pybind11.h:971
    #33 0x7ffff7239988 in cfunction_call (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x239988) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #34 0x7ffff723a710 in _PyObject_MakeTpCall (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x23a710) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #35 0x7ffff7325117 in _PyEval_EvalFrameDefault (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x325117) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #36 0x7ffff732e2ec in PyEval_EvalCode.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x32e2ec) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #37 0x7ffff733269f in run_eval_code_obj (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x33269f) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #38 0x7ffff73a5445 in run_mod (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x3a5445) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #39 0x7ffff7417dfc in _PyRun_SimpleFileObject (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x417dfc) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #40 0x7ffff7418c00 in _PyRun_AnyFileObject.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x418c00) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #41 0x7ffff741f35a in Py_RunMain.localalias (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/lib/libpython3.12.so.1.0+0x41f35a) (BuildId: 76e44d012632c54edc4c62b5eb28bc9cf4dcb48c)
    #42 0x7ffff6c2a1fb in __libc_start_call_main (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x2a1fb) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #43 0x7ffff6c2a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/7k5zyk2jw8vi3ddc3j6zflxxsidqh297-rocm-hip-libraries-meta/lib/libc.so.6+0x2a2b8) (BuildId: bf320110569c8ec2425e9a0c5e4eb7e97f1fb6e4)
    #44 0x401074 in _start (/nix/store/c9m6yd8fg1flz2j5r4bif1ib5j20a0cy-python3-3.12.8/bin/python3.12+0x401074) (BuildId: a7973433d3175eb248d85d13a1b80d149f683c56)

Tested on rocm-6.3.1 built with -fsanitize=undefined and with pytorch nightly 2024-12-29

@ppanchad-amd
Copy link

Hi @LunNova. Internal ticket has been created to fix this issue. Thanks!

@tcgu-amd
Copy link

tcgu-amd commented Jan 3, 2025

Hi @LunNova, thanks for reaching out and reporting the errors!

All three errors has been accessed so far, and we will be working on addressing them shortly.

For the first error, it is due to using reinterpret_cast here

return reinterpret_cast<const AMD::CpuAgent *>(agent)->VisitRegion(
, which works but is arguably not the best practice. I will see if there can be other options.

For the second error, it is likely a misuse of hsa_region_info_t in https://github.com/ROCm/ROCR-Runtime/blob/2cc279dbbcd628116e6e28d7b9224e5e29af6861/runtime/hsa-runtime/image/image_manager_kv.cpp#L180C51-L180C68. Should be hsa_amd_region_info_t instead. This should be a simple fix.

For the third one, it is caused by passing a struct by reference without initializing it properly. It should have been fixed in a recent commit 2 weeks ago here 441bd9f#diff-fbf39ce9f5a449510205e3e855b5e57e27194285122b7ba5b6a7b48ab54d536bR510.

I will keep you updated on the progress. Thanks!

Thanks!

@LunNova
Copy link
Author

LunNova commented Jan 3, 2025

I made an attempt at fixing the second error here in case it's useful #274

Thanks for the fast response!

@LunNova
Copy link
Author

LunNova commented Jan 11, 2025

I can confirm 441bd9f fixes the third error

@tcgu-amd
Copy link

Hi @LunNova, just thought I would give an update regarding the fixes since it has been taking some time. Upon further investigation, the second error turns out to be much more complicated than simply casting the type to hsa_amd_region_info_t by mistake. It is a result of trying to bootstrap AMD specific behavior (hsa_amd_region_info_t) to the HSA standard (hsa_region_info_t). The ideal way to fix this is introducing proper amd-specific APIs to handle these behaviors, but this is not going to be a trivial fix. I have been in discussion with several senior devs regarding what's the the best way to handle this.

The first error is similar to the second one. The call back handles have different signatures in the AMD-specific implementation compared to the more generic HSA standard signatures, despite both are essentially wrappers around opaque C-handles. Fixing this is going to be very tricky and likely will break a lot of existing code.

The good news is that the both behaviors are intended are they have been functional for the past couple years. The bad news is that because they are functional, there are a lot of codes that depends on the current behaviors, which makes implementing changes more complicated. At this point I am working pushing a change for the second error, and I am not quite sure if there's a way around for fixing the first one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants