Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/ofi/mempatcher not happy with sessions when using enable-mca-dso #13021

Closed
hppritcha opened this issue Jan 6, 2025 · 4 comments
Closed

Comments

@hppritcha
Copy link
Member

more problems with ofi btl and the use of mempatcher with the ofi libfabric memory monitoring system. Again sessions_init_twice test shows the problem;

[nid001387:227530] [ 0] /lib64/libpthread.so.0(+0x16910)[0x1545eea76910]
[nid001387:227530] [ 1] g/ompi/foobar/lib/libopen-pal.so.0(opal_patcher_base_restore_all+0x54)[0x1545ee81d914]
[nid001387:227530] [ 2] g/ompi/foobar/lib/libopen-pal.so.0(+0xbdaa8)[0x1545ee81daa8]
[nid001387:227530] [ 3] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_close+0x12c)[0x1545ee79f29c]
[nid001387:227530] [ 4] g/ompi/foobar/lib/libopen-pal.so.0(+0xbb181)[0x1545ee81b181]
[nid001387:227530] [ 5] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_component_close+0x2c)[0x1545ee79c9f1]
[nid001387:227530] [ 6] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_components_close+0x57)[0x1545ee79cafd]
[nid001387:227530] [ 7] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_components_close+0x2d)[0x1545ee79caa4]
[nid001387:227530] [ 8] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_close+0x146)[0x1545ee79f2b6]
[nid001387:227530] [ 9] g/ompi/foobar/lib/libopen-palmca_common_ofi.so.0(+0x2b8e)[0x1545ed09bb8e]
[nid001387:227530] [10] g/ompi/foobar/lib/libopen-palmca_common_ofi.so.0(opal_common_ofi_close+0x2d)[0x1545ed09bbee]
[nid001387:227530] [11] g/ompi/foobar/lib/openmpi/mca_btl_ofi.so(+0x424e)[0x1545ed0a724e]
[nid001387:227530] [12] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_component_close+0x2c)[0x1545ee79c9f1]
[nid001387:227530] [13] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_components_close+0x57)[0x1545ee79cafd]
[nid001387:227530] [14] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_components_close+0x2d)[0x1545ee79caa4]
[nid001387:227530] [15] [nid001387:227529] [ 0] /lib64/libpthread.so.0(+0x16910)[0x148dd60d2910]
[nid001387:227529] [ 1] g/ompi/foobar/lib/libopen-pal.so.0(+0xaf946)[0x1545ee80f946]
[nid001387:227530] [16] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_close+0x12c)[0x1545ee79f29c]
[nid001387:227530] [17] g/ompi/foobar/lib/libopen-pal.so.0(opal_patcher_base_restore_all+0x54)[0x148dd5e79914]
[nid001387:227529] [ 2] g/ompi/foobar/lib/libopen-pal.so.0(+0xbdaa8)[0x148dd5e79aa8]
[nid001387:227529] [ 3] g/ompi/foobar/lib/libmpi.so.0(+0x12f30f)[0x1545eebd730f]
[nid001387:227530] [18] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_close+0x12c)[0x1545ee79f29c]
[nid001387:227530] [19] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_framework_close+0x12c)[0x148dd5dfb29c]
[nid001387:227529] [ 4] g/ompi/foobar/lib/libopen-pal.so.0(+0xbb181)[0x148dd5e77181]
[nid001387:227529] [ 5] g/ompi/foobar/lib/libopen-pal.so.0(mca_base_component_close+0x2c)[0x148dd5df89f1]
[nid001387:227529] [ 6] g/ompi/foobar/lib/libmpi.so.0(+0x9103e)[0x1545eeb3903e]

happens in the second call to MPI_Session_finalize.
This doesn't happen with default build, but when using --enable-mca-dso.

@jsquyres
Copy link
Member

jsquyres commented Jan 6, 2025

@hppritcha Could this be related to #13014?

@hppritcha
Copy link
Member Author

maybe not sure.

hppritcha added a commit to hppritcha/ompi that referenced this issue Jan 6, 2025
Turns out that when Open MPI is configured with --enable-mca-dso
and is using the OFI MTL/BTL/common, a problem is brought out
with the patcher framework the second time through closing the
bml and hence btl frameworks.

See issue open-mpi#13021.

This patch fixes this problem.

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member Author

This was a different problem - not related to #13014

hppritcha added a commit to hppritcha/ompi that referenced this issue Jan 10, 2025
Turns out that when Open MPI is configured with --enable-mca-dso
and is using the OFI MTL/BTL/common, a problem is brought out
with the patcher framework the second time through closing the
bml and hence btl frameworks.

See issue open-mpi#13021.

This patch fixes this problem.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 860bbd6)
@hppritcha
Copy link
Member Author

fixed via #13020 and #13034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants