Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vt attempts migration to self #503

Closed
nlslatt opened this issue Oct 15, 2019 · 4 comments
Closed

vt attempts migration to self #503

nlslatt opened this issue Oct 15, 2019 · 4 comments

Comments

@nlslatt
Copy link
Collaborator

nlslatt commented Oct 15, 2019

Describe the bug
With HierarchicalLB, I've seen vt attempt to migrate a collection element from one node to the very same node. This shows up in debug output, e.g.:

vt: [1] vcc: migrateOut: col_proxy=7, this_node=1, dest=1, idx=[1,2]

There is a check for this in vt that is disabled using a pre-processor macro, so it's not being caught by default. When the check is enabled at compile time, it does detect the error. vt's stack dump is below:

vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1] -------------------------------------------- Dump Stack Backtrace on Node 1 --------------------------------------------
vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1] 0   18  0x7dbc76f     vt::debug::stack::dumpStack[abi:cxx11](int) + 56
vt: [1] 1   18  0x7dba25e     vt::runtime::Runtime::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool) + 1278
vt: [1] 2   18  0x7ca38a2     vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 198
vt: [1] 3   18  0x7ca2b4f     vt::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 114
vt: [1] 4   18  0x6bbd1ff     std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::tuple<>&&) + 273
vt: [1] 5   18  0x6d5c49c     vt::vrt::collection::MigrateStatus vt::vrt::collection::CollectionManager::migrateOut<empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >, vt::index::DenseIndexArray<int, (signed char)2> >(unsigned long const&, vt::index::DenseIndexArray<int, (signed char)2> const&, short const&) + 2116
vt: [1] 6   18  0x6d00de0     vt::vrt::collection::CollectionManager::migrate<empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> > >(vt::vrt::collection::VrtElmProxy<empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >, empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >::IndexType>, short const&)::{lambda()#1}::operator()() const + 78
vt: [1] 7   18  0x6dde8da     std::_Function_handler<void (), vt::vrt::collection::CollectionManager::migrate<empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> > >(vt::vrt::collection::VrtElmProxy<empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >, empire::pic::ParticleListsVTBackend<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >::IndexType>, short const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 32
vt: [1] 8   18  0x6bc005c     std::function<void ()>::operator()() const + 50
vt: [1] 9   18  0x7dab4a0     bool vt::vrt::collection::CollectionManager::scheduler<void>() + 114
vt: [1] 10  18  0x7da9e9d     vt::sched::Scheduler::schedulerImpl() + 93
vt: [1] 11  18  0x7da9f7d     vt::sched::Scheduler::scheduler() + 39
vt: [1] 12  18  0x7daa845     vt::runScheduler() + 17
vt: [1] 13  18  0x7ca58b9     vt::collective::barrier::Barrier::waitBarrier(std::function<void ()>, unsigned long const&, bool) + 235
vt: [1] 14  18  0x73d14bd     vt::collective::barrier::Barrier::barrier(std::function<void ()>, unsigned long const&) + 65
vt: [1] 15  18  0x73cd52b     RuntimeManager::loadBalance() + 659

To Reproduce
So far, I've only reached this state using bdot, which is not part of the vt test suite. I was running on Kahuna with 8 processes and 8 colors. I was using vt's develop branch at commit b9ec92a.

@nlslatt
Copy link
Collaborator Author

nlslatt commented Oct 15, 2019

@nlslatt
Copy link
Collaborator Author

nlslatt commented Oct 15, 2019

@lifflander, you can close this if you think it's correct behavior.

@PhilMiller
Copy link
Member

In production execution, such a migration would be a performance bug, while it's useful in debugging settings (#476 / #430). I don't think we want this behavior out of arbitrary LB strategies, though, only out of ones intended for testing purposes, so I think it's a bug in HierarchicalLB that it generates such migration instructions rather than filtering them out.

@lifflander
Copy link
Collaborator

This no longer can happen given the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants