Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1279 gossiplb inform and decide have bugs #1348

Merged
merged 30 commits into from
Apr 9, 2021

Conversation

nlslatt
Copy link
Collaborator

@nlslatt nlslatt commented Mar 26, 2021

Includes many bug fixes and improvements to GossipLB. The following LB args were added:

  • trials (default 3): how many times to repeat the requested number of iterations, hoping to find a better imbalance; helps if it’s easy to get stuck in a local minimum
  • deterministic (default false): for debugging purposes, make the migration decision deterministic assuming deterministic loads (for testing on deterministic loads, consider using the driver in development under Implement general test driver for LB testing #1265)
  • inform (default SyncInform): choice of gossiping approach
    • SyncInform (0): synchronous propagates after all recvs for a round, but has sync cost (matches LBAF approach)
    • AsyncInform (1): asynchronous propagates after the first recv for a round, but avoids sync cost
  • ordering (default Marginal): order in which to evaluate local objects for migration
    • Arbitrary (0): use the unordered_map iteration order
    • ElmID (1): order by ascending element ID
    • Marginal (2): order by descending load starting with the object of marginal load, then order ascending for larger loads
  • cmf (default NormByMax): the algorithm used for computing the CMF
    • Original (0): remove processors from the CMF as soon as they exceed the target (e.g., processor-average) load; use a CMF factor of 1.0/x, where x is the target load
    • NormByMax (1): do not remove processors from the CMF that exceed the target load until the next iteration; use a CMF factor of 1.0x, where x is the maximum of the target load and the most loaded processor in the CMF
    • NormBySelf (2): do not remove processors from the CMF that exceed the target load until the next iteration; use a CMF factor of 1.0x, where x is the load of the processor that is computing the CMF
  • rollback (default true): whether to roll back to an earlier iteration if it had the best imbalance
  • targetpole (default false): whether to replace the processor-average load with the max of that and the maximum object load, effectively redefining overloaded/underloaded based on the longest pole load when it exceeds the processor-average load

Note that I have a both a develop-based branch (this PR; 1279-gossiplb-inform-and-decide-have-bugs) and a release-based branch (1279-gossiplb-inform-and-decide-have-bugs-release).

Closes #1279

@nlslatt nlslatt marked this pull request as ready for review March 26, 2021 19:19
@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-6, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (clang-3.9, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-5, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (clang-8, alpine, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (clang-5.0, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (intel 18.03, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (intel 19, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Mar 26, 2021

PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 242282c

Compilation - successful

Testing - passed

Build log

Copy link
Collaborator

@lifflander lifflander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think it's fine if we merge this now. For develop, we could improve the documentation better (doxygen comments) and need to fix the prints since that API was refactored last week.

int n_transfers_ = 0;
};

struct GossipRejectionStatsMsg : collective::ReduceTMsg<RejectionStats> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifflander Would it be possible to avoid the need to define types like this, by just passing through constructor arguments to the encapsulated type T in ReduceTMsg<T>?

@nlslatt nlslatt force-pushed the 1279-gossiplb-inform-and-decide-have-bugs branch from c8b5260 to 242282c Compare April 9, 2021 18:51
@github-actions
Copy link

github-actions bot commented Apr 9, 2021

PR tests (clang-10, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@github-actions
Copy link

github-actions bot commented Apr 9, 2021

PR tests (clang-9, ubuntu, mpich)

Build for 242282c

Compilation - successful

Testing - passed

Build log

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #1348 (242282c) into develop (cf9e1b3) will increase coverage by 0.09%.
The diff coverage is 68.31%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1348      +/-   ##
===========================================
+ Coverage    82.50%   82.60%   +0.09%     
===========================================
  Files          759      759              
  Lines        28279    28506     +227     
===========================================
+ Hits         23333    23546     +213     
- Misses        4946     4960      +14     
Impacted Files Coverage Δ
...rc/vt/vrt/collection/balance/gossiplb/gossip_msg.h 56.81% <0.00%> (+56.81%) ⬆️
src/vt/vrt/collection/balance/gossiplb/gossiplb.h 100.00% <ø> (ø)
src/vt/vrt/collection/balance/gossiplb/gossiplb.cc 71.49% <72.88%> (+18.91%) ⬆️
src/vt/topos/location/location.impl.h 90.49% <0.00%> (-3.69%) ⬇️
src/vt/collective/barrier/barrier.cc 76.13% <0.00%> (+2.27%) ⬆️
src/vt/vrt/collection/balance/gossiplb/criterion.h 56.25% <0.00%> (+56.25%) ⬆️

@nlslatt nlslatt merged commit dee7a1a into develop Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GossipLB inform and decide have bugs
3 participants