Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] MeshDevice::close() fails for sub MeshDevice due to uninitialized work_executor_ (randomly ASYNC) after commit 2c2110c778 #17119

Open
kmabeeTT opened this issue Jan 26, 2025 · 1 comment
Assignees
Labels
bug Something isn't working forge P0

Comments

@kmabeeTT
Copy link
Contributor

proposed fix included.

Describe the bug

tt-mlir uplifts of tt-metal are blocked since Friday due to MeshDevice::close() failing with segfault / error below due to work_executor_ having uninitialized work_executor_mode (sometimes garbage value seen as WorkExecutorMode::ASYNCHRONOUS which triggers failire in newly added call to work_executor_.reset() -> WorkExecutor::stop_worker() path when this->worker_state = WorkerState::TERMINATE) during teardown (after all tests finish) after this tt-metal commit (thx @brataTT for helping bisect to this comit):

2025-01-24 2c2110c778 by Joseph Chu (Author [email protected]) : #0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice

Passes one commit before this change. More details below and proposed fix.

To Reproduce

Sorry, I don't have tt-mlir or ttnn repro, only tt-forge-fe repo right now. Including instructions there, we can test out fix if one comes (I have a suggested fix already). Don't necessarily expect anyone to use these instructions, but recording for safe keeping.

Need special docker container, and project tt-forge-fe branch I created to point to problematic tt-metal version above.

% ird reserve --timeout 3-0 --docker-image ghcr.io/tenstorrent/tt-forge-fe/tt-forge-fe-ird-ubuntu-22-04:latest wormhole_b0 
% git clone [email protected]:tenstorrent/tt-forge-fe.git -b kmabee/tt_metal_jan24_mesh_device_issue

% sudo apt-get update
% sudo apt install -y libgl1 libglx-mesa0

// Build - takes 20 minutes:
% cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Debug |& tee cmake_cfg.log
% cmake --build build |& tee build.log

// Run - Takes 10 seconds:
% pytest -svv forge/test/mlir/llama/tests/test_specific_ops_llama32.py::test_multiply -m "push"  |& tee 9_tests.log
<see error below>

Expected behavior

Tests pass, no failures or segfault.

Debug Details

We can actually see the cause of the problem during runtime with some local debug prints at end of MeshDevice constructor showing work_executor_mode uninitialized/random. When it lands on ASYNCRONOUS problem will occur during teardown.

                  Metal | INFO     | KCM MeshDevice::create() - starting
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 1 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create() - calling initialize
                  Metal | INFO     | KCM MeshDevice::initialize() - mesh_id: 1 set worker_mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 2 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 2
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 3 created work_executor_ with mode: WorkExecutorMode::
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 3
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 4 created work_executor_ with mode: WorkExecutorMode::ASYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 4
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 5 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 5
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 6 created work_executor_ with mode: WorkExecutorMode::
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 6
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 7 created work_executor_ with mode: WorkExecutorMode::ASYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 7
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 8 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 8
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 9 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 9
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 10 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 10
<snip>
============================== 9 passed in 13.14s ==============================
2025-01-26 18:51:23.252 | INFO     | TorchDevice     - KCM starting TTSystem destructor calling close_devices()
                  Metal | INFO     | KCM starting MeshDevice::close() for id: 1 with submeshes_.size(): 9 tid: 788731338
                  Metal | INFO     | KCM start closing submesh: 2
                  Metal | INFO     | KCM starting MeshDevice::close() for id: 2 with submeshes_.size(): 0 tid: 788731338
                  Metal | INFO     | KCM Before work_executor_.reset() for Mesh id: 2 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM finished MeshDevice::close() for id: 2
                  Metal | INFO     | KCM done closing submesh: 2
                  Metal | INFO     | KCM start closing submesh: 3
                  Metal | INFO     | KCM starting MeshDevice::close() for id: 3 with submeshes_.size(): 0 tid: 788731338
                  Metal | INFO     | KCM Before work_executor_.reset() for Mesh id: 3 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::
                  Metal | INFO     | KCM finished MeshDevice::close() for id: 3
                  Metal | INFO     | KCM done closing submesh: 3
                  Metal | INFO     | KCM start closing submesh: 4
                  Metal | INFO     | KCM starting MeshDevice::close() for id: 4 with submeshes_.size(): 0 tid: 788731338
                  Metal | INFO     | KCM Before work_executor_.reset() for Mesh id: 4 work_executor_is_valid: true get_worker_mode: WorkExecutorMode::ASYNCHRONOUS
                  Metal | INFO     | KCM WorkExecutor::stop_worker() - starting. worker_state: 1
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
tt_forge_signal_handler - signal: 6 (abort)

One log placement in MeshDevice::close():

    log_info(tt::LogMetal, "KCM Before work_executor_.reset() for Mesh id: {} work_executor_is_valid: {} get_worker_mode: {}", this->id(), work_executor_ != nullptr, work_executor_->get_worker_mode());
    work_executor_.reset();

One log placement in MeshDevice::MeshDevice():

    work_executor_ = std::make_unique<WorkExecutor>(0 /* worker_core */, mesh_id_);
    log_info(tt::LogMetal, "KCM MeshDevice::MeshDevice() - mesh_id: {} created work_executor_ with mode: {}", mesh_id_, work_executor_->get_worker_mode());
}

Proposed Fix

This seems like reasonable fix. Call the same 2 functions that would normally be done in MeshDevice::Initialize() which is skipped for sub MeshDevice creation. Add these to bottom of MeshDevice::MeshDevice:

diff --git a/tt_metal/distributed/mesh_device.cpp b/tt_metal/distributed/mesh_device.cpp
index 1ce83e3c72..a5dfb03dd5 100644
--- a/tt_metal/distributed/mesh_device.cpp
+++ b/tt_metal/distributed/mesh_device.cpp
@@ -129,6 +129,8 @@ MeshDevice::MeshDevice(
     mesh_id_(generate_unique_mesh_id()),
     parent_mesh_(std::move(parent_mesh)) {
     work_executor_ = std::make_unique<WorkExecutor>(0 /* worker_core */, mesh_id_);
+    work_executor_->initialize();
+    work_executor_->set_worker_mode(WorkExecutorMode::SYNCHRONOUS);
 }

Then we see proper values when MeshDevice constructor finishes, and the same during teardown and whatever issue was exposed in stop_worker() with WorkerState::TERMINATE is avoided.

                  Metal | INFO     | KCM MeshDevice::create() - starting
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 1 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create() - calling initialize
                  Metal | INFO     | KCM MeshDevice::initialize() - mesh_id: 1 set worker_mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 2 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 2
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 3 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 3
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 4 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 4
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 5 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 5
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 6 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 6
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 7 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 7
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 8 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 8
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 9 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 9
                  Metal | INFO     | KCM MeshDevice::MeshDevice() - mesh_id: 10 created work_executor_ with mode: WorkExecutorMode::SYNCHRONOUS
                  Metal | INFO     | KCM MeshDevice::create_submesh() - created submesh with mesh_id: 10
@kmabeeTT kmabeeTT added bug Something isn't working forge labels Jan 26, 2025
@kmabeeTT kmabeeTT added the P0 label Jan 26, 2025
kmabeeTT added a commit that referenced this issue Jan 26, 2025
…OUS in MeshDevice ctor

 - Solves uninitialized work_executor from MeshDevice::create_submesh()
   calls leading to ND values at runtime and segfault at
   MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
kmabeeTT added a commit that referenced this issue Jan 27, 2025
…OUS in MeshDevice ctor

 - Solves uninitialized work_executor from MeshDevice::create_submesh()
   calls leading to ND values at runtime and segfault at
   MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
@kmabeeTT
Copy link
Contributor Author

I merged the PR that solved the issue exposed by tt-forge-fe regression, but leaving this bug open for @cfjchu to hopefully add a test to fill coverage hole.

williamlyTT pushed a commit that referenced this issue Jan 30, 2025
…OUS in MeshDevice ctor

 - Solves uninitialized work_executor from MeshDevice::create_submesh()
   calls leading to ND values at runtime and segfault at
   MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
yieldthought pushed a commit that referenced this issue Jan 31, 2025
…OUS in MeshDevice ctor

 - Solves uninitialized work_executor from MeshDevice::create_submesh()
   calls leading to ND values at runtime and segfault at
   MeshDevice::close() in tt-forge-fe due to tt-metal/2c2110c778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working forge P0
Projects
None yet
Development

No branches or pull requests

2 participants