-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Property to terminate tbb threads #11650
Property to terminate tbb threads #11650
Conversation
During inference done, tbb threads cannot be closed by itself, which cause memory leak and unload/lingering threads. Sometimes the tbb threads need to be terminate for resource(memory, thread) consumption This PR contains: 1. Add a new property to control whether force to terminate tbb threads. 2. Property key is "FORCE_TBB_TERMINATE", default value is false. 3. Explicitly to terminate tbb task scheduler during unload openvino dll if this property is set true. e.g: core.set_property(device, ov::force_tbb_terminate(true)); 4. If not set FORCE_TBB_TERMINATE, there will be no any additional tbb operations. Change-Id: I32dc0ba122bb19a9dbf3ba12fdd596aad9ac54b4
Change executorManager from static to be dynamic, the test case should fit this change.
Make frontendManger to be non-static instance. We should guard it is not released before Model, due to Model will use the mem allocated by frontendManger. So put frontendManager reference in ov::Model to make it work.
1. Add basic test case for ov::force_tbb_terminate property 2. set ov::force_tbb_terminate to be false
There is segmentfault issue during Constant destruction, which is caused by some shared memory is double free Test case is: ie = IECore() net = ie.read_network(model=test_net_xml, weights=test_net_bin) query_res = ie.query_network(net, device) func_net = ng.function_from_cnn(net) ops_net = func_net.get_ordered_ops() ie and net will be released before ops_net destruction, so Constant will free the shared memory that has been freed
3a030e5
to
31e5178
Compare
Tbb blocking_terminate calling will cause some segmentfault during run some special models, the reason may comes from block_terminate cause current thread block here to wait for tbb exit, but cannot handle some resource dependencies. After adopt terminate(), the dependencies can be resolved and no segmentfault any more. Change-Id: I0b920630a25cd3fd2747c57ec71ca749ba35573b
src/core/src/op/constant.cpp
Outdated
@@ -14,6 +14,7 @@ | |||
#include "ngraph/log.hpp" | |||
#include "ngraph/op/util/attr_types.hpp" | |||
#include "ngraph/util.hpp" | |||
#include "openvino/frontend/manager.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single include creates a new dependency between the Core and the frontends module and I strongly think they should remain separated. It's ok if the other modules depend on the Core objects but not the other way around.
There's another way this could be done but without introducing such a dependency.
https://github.com/openvinotoolkit/openvino/pull/11236/files#diff-39eca8b24df9cf0ff779cc909e0c48937bc04e48613e9b2a99542f309065faa7R123-R124
In this solution the pointer has to be set "from outside" which is less convenient but doesn't introduce a whole new dependency in the architecture of the whole framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, actually I have adopt this solution for ov::model, but not notice here. I will follow your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -716,6 +716,7 @@ class OPENVINO_API Constant : public Op { | |||
element::Type m_element_type; | |||
Shape m_shape{}; | |||
std::shared_ptr<ngraph::runtime::AlignedBuffer> m_data; | |||
std::shared_ptr<ov::frontend::FrontEndManager> m_femgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be moved to the Node
class and additionally the pointer doesn't have to be a particular type like this one. A shared pointer to void should be sufficient to solve the issue this PR solves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I only found op::constant will trigger segment fault during its destruction, so only put the dependency into op::constant. Ok let me check whether it can work if move to node:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have applied void for it: std::shared_ptr m_femgr;
src/inference/src/ie_core.cpp
Outdated
@@ -34,6 +35,7 @@ | |||
#include "ngraph/opsets/opset.hpp" | |||
#include "ngraph/pass/constant_folding.hpp" | |||
#include "openvino/core/except.hpp" | |||
#include "openvino/frontend/manager.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This include also introduces a dependency of the Core object on an object from the frontends module. I think this should be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -488,7 +483,7 @@ CNNNetwork details::ReadNetwork(const std::string& modelPath, | |||
#endif | |||
|
|||
// Try to load with FrontEndManager | |||
auto& manager = get_frontend_manager(); | |||
auto manager = ov::frontend::get_frontend_manager(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change the get_frontend_manager
method fro this class is not used any more(in this file at least). Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it cannot, this method is located in namespace ov::frontend. If we didn't explicitly specify its namespace, it cannot be removed. We can still keep it as the same as the original code.
|
||
class FrontEndManagerHolder { | ||
std::mutex m_mutex; | ||
std::weak_ptr<FrontEndManager> m_manager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a weak pointer intentionally? Should the FrontEndManager objects created in this pattern be able to be destroyed and then replaced with other instances if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is designed as a weak pointer so that it can be destroyed during openvino dll/lib unloaded, ov::core also take a reference of it so that it will not be released before ov::core destruction.
* - False will not involve additional tbb operations when core destruction | ||
* @ingroup ov_runtime_cpp_prop_api | ||
*/ | ||
static constexpr Property<bool, PropertyMutability::RW> force_tbb_terminate{"FORCE_TBB_TERMINATE"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bindings to it here:
https://github.com/openvinotoolkit/openvino/blob/master/src/bindings/python/src/pyopenvino/core/properties/properties.cpp
Without it Python API will lack new property. Tests for it could be useful as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
As CVS-68982 description, we should disable the test case which will load dynamic library in openvino static library compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python side is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OV Core
should be independent to any FrontEnd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point I'm ok with the solution. Any possible future problems can be fixed iteratively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok as of fixing this particular problem.
@@ -114,6 +114,9 @@ class OPENVINO_API Model : public std::enable_shared_from_this<Model> { | |||
/// \brief Clones the original model | |||
std::shared_ptr<ov::Model> clone() const; | |||
|
|||
/// \brief Reference to frontend maanager | |||
std::shared_ptr<void> m_femgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to keep frontend manager instance is not released before ov::Model, due to frontend manager has been work as a dynamic instance rather than static instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we don't need to have FR manager here. We need to keep only one frontend which created this particular model. And it already should be done. Here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to keep frontend manager here, ov::Model already keeps a reference to a particular frontend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we don't need to have FR manager here. We need to keep only one frontend which created this particular model. And it already should be done. Here:
It cannot work, we will encounter problem in some CI test, sometimes frontend manager will released before ov::Model, which cause segment fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to keep frontend manager here, ov::Model already keeps a reference to a particular frontend
I add such reference at ov::Model is to fix segment fault in CI test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we don't need to have FR manager here. We need to keep only one frontend which created this particular model. And it already should be done. Here:
https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/core/model.hpp#L46It cannot work, we will encounter problem in some CI test, sometimes frontend manager will released before ov::Model, which cause segment fault.
Can we investigate these issues in order to understand how frontend manager affects ov::Model
? Frontend manager doesn't create ov::Model
, it creates frontends and frontend creates ov::Model
, it means that ov::Model
depends on frontend but doesn't depend on FE manager.
@@ -716,6 +716,7 @@ class OPENVINO_API Constant : public Op { | |||
element::Type m_element_type; | |||
Shape m_shape{}; | |||
std::shared_ptr<ngraph::runtime::AlignedBuffer> m_data; | |||
std::shared_ptr<void> m_femgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to keep frontend manager instance is not released before ov::op::Constant which has shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ov::op::Constant
should be hold by ov::Model
I don't understand why we need to duplicate frontends here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is duplicated, because we first resolve the segment fault for ov::op::Constant caused by dynamic frontend manager, and then found ov::Model also exist such problem, so add reference to ov::Model. I will check/verify whether can remove reference in ov::op::Constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov @ilyachur I have done some test to remove femgr reference from ov::op::constants, see test PR:#12166
Test result shows that we still need keep this reference in ov::op::constants, some test failed, such as below 2 cases:
- test_IENetwork.py
- test_NGraph.py
You can reference to PR12166 CI result.
Anyhow I reproduce the same failure in local, such as test_NGraph.py
ov2022@ov2022-Z370-AORUS-Gaming-5:~/workspace/capi/openvino/src/bindings/python/tests_compatibility$ python3 -m pytest -s test_inference_engine/test_NGraph.py -v
========================================================================================= test session starts ==========================================================================================
platform linux -- Python 3.8.10, pytest-4.6.9, py-1.8.1, pluggy-0.13.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /home/ov2022/workspace/capi/openvino/src/bindings/python
plugins: anyio-3.5.0
collected 6 itemstest_inference_engine/test_NGraph.py::test_create_IENetwork_from_nGraph PASSED
test_inference_engine/test_NGraph.py::test_get_IENetwork_from_nGraph PASSED
test_inference_engine/test_NGraph.py::test_get_ops_from_IENetwork Segmentation fault (core dumped)
The failure test item is:
def test_get_ops_from_IENetwork():
ie = IECore()
net = ie.read_network(model=test_net_xml, weights=test_net_bin)
func = ng.function_from_cnn(net)
ops = func.get_ordered_ops()
ops_names = [op.friendly_name for op in ops]
assert len(ops_names) != 0
assert 'data' in ops_names
It's obvious that ov::Model has been released before ops, I dump the call trace as below:
test_inference_engine/test_NGraph.py ..
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fff939b03ba in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x1266dc0) at /usr/include/c++/9/bits/shared_ptr_base.h:155
155 _M_dispose();
(gdb) bt
#0 0x00007fff939b03ba in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x1266dc0) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#1 0x00007fff939a9b95 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (this=0x7fffffff7908, __in_chrg=) at /usr/include/c++/9/bits/shared_ptr_base.h:730__shared_ptr (this=0x7fffffff7900, __in_chrg=)
#2 0x00007fff92924f10 in std::__shared_ptr<ngraph::runtime::AlignedBuffer, (__gnu_cxx::_Lock_policy)2>::
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#3 0x00007fff92940612 in std::__shared_ptr<ngraph::runtime::AlignedBuffer, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x1267760, __r=...) at /usr/include/c++/9/bits/shared_ptr_base.h:1265
#4 0x00007fff929382fa in std::shared_ptrngraph::runtime::AlignedBuffer::operator= (this=0x1267760, __r=...) at /usr/include/c++/9/bits/shared_ptr.h:335
#5 0x00007fff9292f5e0 in ov::op::v0::Constant::Constant (this=0x1267570, __in_chrg=) at /home/ov2022/workspace/capi/openvino/src/core/src/op/constant.cpp:243__shared_count (this=0x7fff90685880, __in_chrg=) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#6 0x00007fff923ba647 in __gnu_cxx::new_allocatorov::op::v0::Constant::destroyov::op::v0::Constant (this=0x1267570, __p=0x1267570) at /usr/include/c++/9/ext/new_allocator.h:152
#7 0x00007fff923b9c31 in std::allocator_traits<std::allocatorov::op::v0::Constant >::destroyov::op::v0::Constant (__a=..., __p=0x1267570) at /usr/include/c++/9/bits/alloc_traits.h:496
#8 0x00007fff923b916d in std::_Sp_counted_ptr_inplace<ov::op::v0::Constant, std::allocatorov::op::v0::Constant, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x1267560)
at /usr/include/c++/9/bits/shared_ptr_base.h:557
#9 0x00007fff939b03c6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x1267560) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#10 0x00007fff939a9b95 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::
#11 0x00007fff93a5563a in std::__shared_ptr<ov::op::v0::Constant, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff90685878, __in_chrg=)
at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#12 0x00007fff93a5565a in std::shared_ptrov::op::v0::Constant::~shared_ptr (this=0x7fff90685878, _in_chrg=) at /usr/include/c++/9/bits/shared_ptr.h:103
#13 0x00007fff93a5ca2c in pybind11::class<ov::op::v0::Constant, std::shared_ptrov::op::v0::Constant, ov::Node>::dealloc (v_h=...)
at /home/ov2022/workspace/capi/openvino/src/bindings/python/thirdparty/pybind11/include/pybind11/pybind11.h:1612
#14 0x00007fff9398f7a1 in pybind11::detail::clear_instance (self=<_pyngraph.op.Constant at remote 0x7fff90685860>)
at /home/ov2022/workspace/capi/openvino/src/bindings/python/thirdparty/pybind11/include/pybind11/detail/class.h:399
#15 0x00007fff9398f8a3 in pybind11::detail::pybind11_object_dealloc (self=<_pyngraph.op.Constant at remote 0x7fff90685860>)
at /home/ov2022/workspace/capi/openvino/src/bindings/python/thirdparty/pybind11/include/pybind11/detail/class.h:419
#16 0x00000000005ee3fb in _Py_Dealloc (op=) at ../Objects/object.c:2215
#17 _Py_DECREF (filename=, lineno=541, op=) at ../Include/object.h:478
#18 _Py_XDECREF (op=) at ../Include/object.h:541
#19 list_dealloc (op=0x7fff9060eb40) at ../Objects/listobject.c:372
#20 0x00000000005ee8c0 in _Py_Dealloc (op=) at ../Objects/object.c:2215
#21 _Py_DECREF (filename=0x88c960 "../Objects/frameobject.c", lineno=430, op=) at ../Include/object.h:478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not an issue of constant op. You can have the same problem if frontend creates ov::Tensor
. we need to fix the issue for common case, but it can require deeper refactoring of existed structures.
// guarantee m_data is released before femgr | ||
m_data = nullptr; | ||
m_femgr = nullptr; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @ilyachur
@@ -113,6 +117,7 @@ function(set_ie_threading_interface_for TARGET_NAME) | |||
if (TBB_FOUND) | |||
set(IE_THREAD_DEFINE "IE_THREAD_TBB") | |||
ie_target_link_libraries(${TARGET_NAME} ${LINK_TYPE} ${TBB_IMPORTED_TARGETS}) | |||
target_compile_definitions(${TARGET_NAME} ${COMPILE_DEF_TYPE} TBB_PREVIEW_WAITING_FOR_WORKERS=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we should duplicate this to ie_parallel.hpp
since if the customer does not use cmake, such definition is lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me check how to solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just put definition to ie_parallel.hpp
as well :)
src/tests/functional/plugin/shared/include/behavior/ov_plugin/core_integration.hpp
Show resolved
Hide resolved
if (name == ov::force_tbb_terminate.name()) { | ||
const auto flag = executorManager()->getTbbFlag(); | ||
return decltype(ov::force_tbb_terminate)::value_type(flag); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose for such needs we need to introduce get_property
without device name, since such property is Core specific, not device-specific.
Can we do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It has been resolved in #12022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that PR you use set_property
with no device name, but I think we can also add get_property
with no device name to have mirrored API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can add such new interface.
namespace ov { | ||
namespace frontend { | ||
class FrontEndManager; | ||
std::shared_ptr<FrontEndManager> get_frontend_manager(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right that because of static frontend manager, we have issue with dynamically loaded OpenVINO.so:
- FE stores reference to OpenVINO.so
- OpenVINO stores references to frontend libraries
And because of that, we have cyclic dependencies and libraries are not unloaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is the reason that why libopenvino_ir_frontend.so cannot be unloaded. But for libopenvino.so unload issue, there are more static variables and other dependencies/reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for libopenvino.so unload issue, there are more static variables and other dependencies/reference.
Could you please provide more examples of such issues where static variable holds another DLL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quit a lot such datas, not only static variable but also dependences data, I have ever setup a debug environment but not save the data, maybe I can re-setup and share the data to you later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov I sent some details info to you by email, please check it, thanks!
@@ -114,6 +114,9 @@ class OPENVINO_API Model : public std::enable_shared_from_this<Model> { | |||
/// \brief Clones the original model | |||
std::shared_ptr<ov::Model> clone() const; | |||
|
|||
/// \brief Reference to frontend maanager | |||
std::shared_ptr<void> m_femgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this definition in public place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be move to protected or private place.
@@ -716,6 +716,7 @@ class OPENVINO_API Constant : public Op { | |||
element::Type m_element_type; | |||
Shape m_shape{}; | |||
std::shared_ptr<ngraph::runtime::AlignedBuffer> m_data; | |||
std::shared_ptr<void> m_femgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ov::op::Constant
should be hold by ov::Model
I don't understand why we need to duplicate frontends here
Details:
Tickets: