You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tt-mlir uplifts of tt-metal are blocked since Friday due to MeshDevice::close() failing with segfault / error below due to work_executor_ having uninitialized work_executor_mode (sometimes garbage value seen as WorkExecutorMode::ASYNCHRONOUS which triggers failire in newly added call to work_executor_.reset() -> WorkExecutor::stop_worker() path when this->worker_state = WorkerState::TERMINATE) during teardown (after all tests finish) after this tt-metal commit (thx @brataTT for helping bisect to this comit):
2025-01-24 2c2110c778 by Joseph Chu (Author [email protected]) : #0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice
Passes one commit before this change. More details below and proposed fix.
To Reproduce
Sorry, I don't have tt-mlir or ttnn repro, only tt-forge-fe repo right now. Including instructions there, we can test out fix if one comes (I have a suggested fix already). Don't necessarily expect anyone to use these instructions, but recording for safe keeping.
Need special docker container, and project tt-forge-fe branch I created to point to problematic tt-metal version above.
We can actually see the cause of the problem during runtime with some local debug prints at end of MeshDevice constructor showing work_executor_mode uninitialized/random. When it lands on ASYNCRONOUS problem will occur during teardown.
Metal | INFO | KCM MeshDevice::create() - starting
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 1 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create() - calling initialize
Metal | INFO | KCM MeshDevice::initialize() - mesh_id: 1 set worker_mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 2 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 2
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 3 created work_executor_ with mode: WorkExecutorMode::
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 3
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 4 created work_executor_ with mode: WorkExecutorMode::ASYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 4
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 5 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 5
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 6 created work_executor_ with mode: WorkExecutorMode::
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 6
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 7 created work_executor_ with mode: WorkExecutorMode::ASYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 7
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 8 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 8
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 9 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 9
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 10 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 10
<snip>
============================== 9 passed in 13.14s ==============================
2025-01-26 18:51:23.252 | INFO | TorchDevice - KCM starting TTSystem destructor calling close_devices()
Metal | INFO | KCM starting MeshDevice::close() for id: 1 with submeshes_.size(): 9 tid: 788731338
Metal | INFO | KCM start closing submesh: 2
Metal | INFO | KCM starting MeshDevice::close() for id: 2 with submeshes_.size(): 0 tid: 788731338
Metal | INFO | KCM Before work_executor_.reset() for Mesh id: 2 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM finished MeshDevice::close() for id: 2
Metal | INFO | KCM done closing submesh: 2
Metal | INFO | KCM start closing submesh: 3
Metal | INFO | KCM starting MeshDevice::close() for id: 3 with submeshes_.size(): 0 tid: 788731338
Metal | INFO | KCM Before work_executor_.reset() for Mesh id: 3 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::
Metal | INFO | KCM finished MeshDevice::close() for id: 3
Metal | INFO | KCM done closing submesh: 3
Metal | INFO | KCM start closing submesh: 4
Metal | INFO | KCM starting MeshDevice::close() for id: 4 with submeshes_.size(): 0 tid: 788731338
Metal | INFO | KCM Before work_executor_.reset() for Mesh id: 4 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::ASYNCHRONOUS
Metal | INFO | KCM WorkExecutor::stop_worker() - starting. worker_state: 1
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
tt_forge_signal_handler - signal: 6 (abort)
One log placement in MeshDevice::close():
log_info(tt::LogMetal, "KCM Before work_executor_.reset() for Mesh id: {} work_executor_is_valid: {} get_worker_mode: {}", this->id(), work_executor_ != nullptr, work_executor_->get_worker_mode());
work_executor_.reset();
This seems like reasonable fix. Call the same 2 functions that would normally be done in MeshDevice::Initialize() which is skipped for sub MeshDevice creation. Add these to bottom of MeshDevice::MeshDevice:
Then we see proper values when MeshDevice constructor finishes, and the same during teardown and whatever issue was exposed in stop_worker() with WorkerState::TERMINATE is avoided.
Metal | INFO | KCM MeshDevice::create() - starting
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 1 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create() - calling initialize
Metal | INFO | KCM MeshDevice::initialize() - mesh_id: 1 set worker_mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 2 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 2
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 3 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 3
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 4 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 4
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 5 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 5
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 6 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 6
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 7 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 7
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 8 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 8
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 9 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 9
Metal | INFO | KCM MeshDevice::MeshDevice() - mesh_id: 10 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
Metal | INFO | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 10
The text was updated successfully, but these errors were encountered:
…OUS in MeshDevice ctor
- Solves uninitialized work_executor from MeshDevice::create_submesh()
calls leading to ND values at runtime and segfault at
MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
…OUS in MeshDevice ctor
- Solves uninitialized work_executor from MeshDevice::create_submesh()
calls leading to ND values at runtime and segfault at
MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
I merged the PR that solved the issue exposed by tt-forge-fe regression, but leaving this bug open for @cfjchu to hopefully add a test to fill coverage hole.
…OUS in MeshDevice ctor
- Solves uninitialized work_executor from MeshDevice::create_submesh()
calls leading to ND values at runtime and segfault at
MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
…OUS in MeshDevice ctor
- Solves uninitialized work_executor from MeshDevice::create_submesh()
calls leading to ND values at runtime and segfault at
MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
proposed fix included.
Describe the bug
tt-mlir uplifts of tt-metal are blocked since Friday due to MeshDevice::close() failing with segfault / error below due to
work_executor_
having uninitializedwork_executor_mode
(sometimes garbage value seen asWorkExecutorMode::ASYNCHRONOUS
which triggers failire in newly added call towork_executor_.reset() -> WorkExecutor::stop_worker()
path whenthis->worker_state = WorkerState::TERMINATE
) during teardown (after all tests finish) after this tt-metal commit (thx @brataTT for helping bisect to this comit):Passes one commit before this change. More details below and proposed fix.
To Reproduce
Sorry, I don't have tt-mlir or ttnn repro, only tt-forge-fe repo right now. Including instructions there, we can test out fix if one comes (I have a suggested fix already). Don't necessarily expect anyone to use these instructions, but recording for safe keeping.
Need special docker container, and project tt-forge-fe branch I created to point to problematic tt-metal version above.
Expected behavior
Tests pass, no failures or segfault.
Debug Details
We can actually see the cause of the problem during runtime with some local debug prints at end of
MeshDevice
constructor showing work_executor_mode uninitialized/random. When it lands onASYNCRONOUS
problem will occur during teardown.One log placement in
MeshDevice::close()
:One log placement in
MeshDevice::MeshDevice()
:Proposed Fix
This seems like reasonable fix. Call the same 2 functions that would normally be done in
MeshDevice::Initialize()
which is skipped for subMeshDevice
creation. Add these to bottom ofMeshDevice::MeshDevice
:Then we see proper values when
MeshDevice
constructor finishes, and the same during teardown and whatever issue was exposed instop_worker()
withWorkerState::TERMINATE
is avoided.The text was updated successfully, but these errors were encountered: