Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1395 Fix GreedyLB Bug #1470

Merged
merged 8 commits into from
Aug 23, 2021
Merged

1395 Fix GreedyLB Bug #1470

merged 8 commits into from
Aug 23, 2021

Conversation

lifflander
Copy link
Collaborator

@lifflander lifflander commented Jun 10, 2021

Fixes #1395

  • I've started by trying to identify where the bug is, and it seems anecdotally to be in the scatter.
  • I determined this by implementing alternative strategies to the scatter (pt2pt send, and broadcast), which both seem to "fix" the problem. After 1000 runs it doesn't occur with both the send and broadcast, whereas before it always triggered with that many runs when scatter is enabled.
  • The data indicates some bad data that occasionally shows up when scatter is enabled. However, even when it fails, address sanitizer is clean, so it's not memory corruption/overwriting
vt: [1] (t) lb: migrateObjectTo, obj_id=25770721283, home=-24827, from=1, to=25687, found=false
vt: [1] (t) lb: transferMigrations: obj_id=(25770721283,-24827,1), to_node=25687

See how the home is corruption (negative value), which is never allowed.

  • I think I have identified the edge case, which is when the scatter has zero elements in it. This might be what is causing the problem. I'm writing a new scatter unit test to provide coverage over this devolved use case.

  • It turns out that the bug is unrelated to the scatter implementation which works perfectly fine. But the zero element edge case is the one causing problems due to a bug in the computed max_recs when no data exists. I have solved the bug, and run ctest/gtest 1000s of times with the fix and it never fails anymore with this fix.

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-5, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (clang-5.0, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (clang-3.9, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-6, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

Merging #1470 (79efa85) into develop (cf4bd16) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1470      +/-   ##
===========================================
- Coverage    83.00%   82.99%   -0.01%     
===========================================
  Files          783      783              
  Lines        29465    29505      +40     
===========================================
+ Hits         24457    24488      +31     
- Misses        5008     5017       +9     
Impacted Files Coverage Δ
src/vt/vrt/collection/balance/greedylb/greedylb.cc 97.63% <100.00%> (+0.26%) ⬆️
src/vt/vrt/collection/balance/greedylb/greedylb.h 100.00% <100.00%> (ø)
...vt/vrt/collection/balance/greedylb/greedylb_msgs.h 100.00% <100.00%> (ø)
tests/unit/collection/test_lb.extended.cc 89.06% <100.00%> (+0.44%) ⬆️
src/vt/topos/location/location.impl.h 87.42% <0.00%> (-5.83%) ⬇️
src/vt/vrt/collection/manager.impl.h 96.46% <0.00%> (+0.88%) ⬆️
src/vt/vrt/collection/manager.h 100.00% <0.00%> (+27.27%) ⬆️

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (intel 18.03, ubuntu, mpich)

Build for 79efa85

Build failed for unknown reason. Check build logs


Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (clang-9, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (intel 19, ubuntu, mpich)

Build for 79efa85

Build failed for unknown reason. Check build logs


Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (clang-10, ubuntu, mpich)

Build for 79efa85

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Jun 10, 2021

PR tests (clang-10, alpine, mpich)

Build for 79efa85



The following tests FAILED:
  440 - vt:TestMpiAccessGuardDeathTest.test_mpi_access_prevented_proc_2 (Failed)

Build log

#define BROADCAST 0

#if SCATTER
std::size_t max_recs = 1, max_bytes = 0;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual bug fix. max_recs must be at least 1 due to how the data is read (the first record is meta-data that contains the number of records)

@lifflander lifflander marked this pull request as ready for review June 10, 2021 22:20
@PhilMiller
Copy link
Member

Could you trim this down to just the bug fix? The alternative communication schemes should only be added if they serve an independent purpose.

@nlslatt
Copy link
Collaborator

nlslatt commented Jun 14, 2021

Could you trim this down to just the bug fix? The alternative communication schemes should only be added if they serve an independent purpose.

I think I agree with Phil here. Do you forsee using the other options?

@lifflander lifflander force-pushed the 1395-fix-greedylb-bug branch from 44bd509 to 55d6f79 Compare June 29, 2021 23:43
@lifflander
Copy link
Collaborator Author

@nlslatt Please re-review this!

@lifflander lifflander requested a review from nlslatt July 1, 2021 20:24
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to provide a hash function for the enum now that you're using the LBArgsEnumConverter. After that fix and one recommended by Codecy, this should be ready to go.

@lifflander lifflander force-pushed the 1395-fix-greedylb-bug branch from fbda3c1 to 79efa85 Compare August 23, 2021 18:24
@lifflander lifflander requested a review from nlslatt August 23, 2021 18:25
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@lifflander lifflander merged commit 7a596d3 into develop Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_load_balancer_keep_last_elm test fails occasionally
3 participants