Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hexagon apps are failing recently #7828

Open
steven-johnson opened this issue Sep 1, 2023 · 8 comments
Open

Hexagon apps are failing recently #7828

steven-johnson opened this issue Sep 1, 2023 · 8 comments
Assignees

Comments

@steven-johnson
Copy link
Contributor

steven-johnson commented Sep 1, 2023

As of a few days ago, apps/blur and apps/camera_pipe are failing to run with host-hvx builds on the linuxbot emulators. Doing some debugging on those devices, something about the paths to sim_remote is broken; running the relevant camera_pipe app directly fails with a segfault.

(EDIT: previous error message reported was wrong, please ignore)

@steven-johnson
Copy link
Contributor Author

Actual dump result:

Success!

Thread 1 "camera_pipe_pro" received signal SIGSEGV, Segmentation fault.
--Type <RET> for more, q to quit, c to continue without paging--c
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff56e5560 in HWInternal::ReadSymbolValue(char const*, unsigned int*) ()
   from /home/halidenightly/Qualcomm/Hexagon_SDK/4.3.0/tools/HEXAGON_Tools/8.4.11/Tools/lib/iss/libhexagonissv65.so
#2  0x00007ffff7fc0232 in send_message(int, std::vector<int, std::allocator<int> > const&) ()
   from /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-64-linux-cmake/halide-build/src/runtime/hexagon_remote/libhalide_hexagon_host.so
#3  0x00007ffff7fc0eff in halide_hexagon_remote_release_library ()
   from /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-64-linux-cmake/halide-build/src/runtime/hexagon_remote/libhalide_hexagon_host.so
#4  0x0000555555594af4 in halide_hexagon_device_release ()
#5  0x00007ffff7fe0f6b in _dl_fini () at dl-fini.c:138
#6  0x00007ffff79ce8a7 in __run_exit_handlers (status=0, listp=0x7ffff7b74718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,

@steven-johnson
Copy link
Contributor Author

So this appears to be a potential order-of-teardown problem: halide_hexagon_remote_release_library() is called at static dtor time; the sim variable is not yet null (ie its dtor hasn't yet been run), but apparently its contents are no longer usable (or perhaps the call into ReadSymbolValue() is crashing for other order-of-destruction issues). This seems most likely to be related to the recent changes to how this code is built (ie using CMake)... but AFAICT there wasn't a meaningful code change, just build rules. So maybe we were just "getting lucky" with destruction order before?

In any event, this is gonna be very hard for me to debug further since the contents of ReadSymbolValue() are in Qualcomm code I don't have the source for -- I'm gonna assign this to @pranavb-ca to investigate. In the meantime, we may want/need to turn off testing of Hexagon on the buildbots to avoid the failure reports clogging things up.

@pranavb-ca
Copy link
Contributor

@steven-johnson - Thanks for the spadework so far. looking into this

@pranavb-ca
Copy link
Contributor

pranavb-ca commented Sep 6, 2023

I thought I had reproduced the problem but I was wrong (It was user error)

In the past we had had problems with Hexagon tools version mismatches. That is when hexagon_sim_remote and libhalide_hexagon_host.so are built with one version of the hexagon simulator (libwrapper.so) but when used a different version of libwrapper.so is available and loaded. libwrapper.so can be used across version but there was a period around the time of Hexagon LLVM tools 8.4.x that this ability to use different versions was broken because of a bug.

Anyway, I checked the buildbot logs and it does appear that 8.4.11 tools are used to build hexagon_sim_remote and libhalide_hexagon_host.so and LD_LIBRARY_PATH points to the same tools when the apps are run. On my machine I am not able to repro the problem.

@steven-johnson - Any chance I could get hold of hexagon_sim_remote and halide-build/src/runtime/hexagon_remote/libhalide_hexagon_host.so that the buildbot builds?

@pranavb-ca
Copy link
Contributor

Also, I do not understand why the apps are failing but the correctness tests that use HVX aren't crashing in the same manner.

@abadams
Copy link
Member

abadams commented Sep 6, 2023

Steven is out for the week, but I can get you stuff from the bots if that's helpful. I'm not sure what you're asking for though, because as far as I understand it, those files are checked in to the repo, rather than automatically built on the bots. Can you point me to the buildbot log where you saw them built?

@pranavb-ca
Copy link
Contributor

@abadams - for target offloading on the hexagon simulator, we need two runtime binaries - hexagon_sim_remote and libhalide_hexagon_host.so. Both of these, at this moment have two versions - one that is checked into the repository and has been for a long time and is updated on an ad-hoc basis. The second version is the one that is built on the fly as part of every buildbot run (#7741). This second version started getting picked up by the builder once halide/build_bot#237 was merged into the buildbot repo. And I think that's when the apps started failing.

Locally, I am building the hexagon runtime (hexagon_sim_remote and libhalide_hexagon_host.so) along with the rest of Halide using the same hexagon sdk and hexagon tools as the buildbot does. However, I am unable to reproduce the failures that we see on this buildbot. So, I want to give the specific binaries that the buildbot produces a go. In a buildbot run, these are at /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-64-linux-cmake/halide-build/src/runtime/hexagon_remote/hexagon/bin/hexagon_sim_remote and home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-64-linux-cmake/halide-build/src/runtime/hexagon_remote/libhalide_hexagon_host.so

Once these issues are fixed, we plan to remove the binaries that reside in the repository.

I myself am on vacation starting tomorrow and back next Wednesday. So, I propose that I revert halide/build_bot#237 for now so that the bots pick up the checked-in binaries again. I can then come back next week and take a look at this again. Sounds good?

@abadams
Copy link
Member

abadams commented Sep 6, 2023

I see, thanks. I've merged the revert PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants