Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2098: runtime: add code to generate node IDs #2109

Merged
merged 8 commits into from
Mar 28, 2023

Conversation

lifflander
Copy link
Collaborator

Fixes #2098

@lifflander lifflander linked an issue Mar 16, 2023 that may be closed by this pull request
@lifflander lifflander force-pushed the 2098-add-physical-node-to-lb-data-file branch from 3076672 to 5b1a93c Compare March 16, 2023 22:22
@lifflander
Copy link
Collaborator Author

lifflander commented Mar 16, 2023

I think this a much better approach than MPI_Get_processor_name and a potentially expensive MPI_Allgather on a bunch of strings or sorting a bunch of strings. It does rely on the MPI 3 standard.

Copy link
Member

@PhilMiller PhilMiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks decent, though it would still be nice to have something like the host name, to be able to identify slow nodes

src/vt/runtime/runtime.cc Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Mar 16, 2023

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, json schema test)

Build for 0968638 (2023-03-27 22:41:20 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 0968638 (2023-03-27 22:41:20 UTC)

icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
intel-cc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
in%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (nvidia cuda 11.2, ubuntu, mpich)

Build for ( UTC)

Compilation - successful

Testing - passed

Build log


@nlslatt
Copy link
Collaborator

nlslatt commented Mar 21, 2023

@lifflander I can confirm that this worked in the following configurations:

(a) 48 ranks per node using 8 nodes, where the first node gets ranks 0-47, the second gets ranks 48-95, etc.;
(b) similar to (a) but where I limit the total number of processes so that there are only 48*8-2 of them (which puts only 46 on node 7); and
(c) similar to (b) but using --bynode, where the ranks get assigned round robin except that node 7 still only gets 46 ranks (confirmed by sshing into node 7 and counting the running processes).

@lifflander lifflander marked this pull request as ready for review March 21, 2023 22:32
@lifflander lifflander force-pushed the 2098-add-physical-node-to-lb-data-file branch from 5b1a93c to f33bf5c Compare March 21, 2023 22:33
@lifflander lifflander requested a review from nlslatt March 21, 2023 22:55
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema validator is failing in one of the pipelines. The switch from a simple string schema type to using hard-coded json snippets in several places (including tests) feels hackish. Would it really be that hard to fix the interface so that we can use the json utility to write all the json?

src/vt/vrt/collection/balance/node_lb_data.cc Outdated Show resolved Hide resolved
@lifflander
Copy link
Collaborator Author

Looks great to me!

@lifflander lifflander force-pushed the 2098-add-physical-node-to-lb-data-file branch from 93ec3f1 to 9edba6e Compare March 23, 2023 16:51
@ppebay
Copy link
Contributor

ppebay commented Mar 24, 2023

As a side note, this branch's JSON schema validator is passing for both the "toy user defined" problem:

[LBAF_app] Executing LBAF version 0.1.0rc1
[LBAF_app] Executing with Python 3.8.16
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.0.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.1.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.2.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.3.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.1.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.2.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.3.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-toy-problem/toy_mem.0.json

as well as for the "challenging problem with fewer tasks":

[LBAF_app] Executing LBAF version 0.1.0rc1
[LBAF_app] Executing with Python 3.8.16
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.0.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.1.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.2.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.3.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.4.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.5.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.6.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.7.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.0.json
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.8.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.2.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.1.json
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.9.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.10.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.3.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.7.json
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.11.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.6.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.5.json
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.12.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.13.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.4.json
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.14.json VT object map
[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.15.json VT object map
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.8.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.11.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.10.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.9.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.13.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.15.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.12.json
[lbsVTDataReader] Valid JSON schema in /Users/pppebay/Documents/Git/LB-analysis-framework/data/challenging_toy_fewer_tasks/toy.14.json

I would like to suggest @nlslatt that the version of the validator also be reported in the logger output

Copy link
Member

@PhilMiller PhilMiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the node identification code. Maybe Philippe should review the json bits.

int physical_node_id = -1;
int physical_num_nodes = -1;
int physical_node_size = -1;
int physical_node_rank = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these all be suffixed with underscores?

@nlslatt nlslatt force-pushed the 2098-add-physical-node-to-lb-data-file branch from 96607e3 to 0968638 Compare March 27, 2023 22:41
@nlslatt
Copy link
Collaborator

nlslatt commented Mar 27, 2023

I would like to suggest @nlslatt that the version of the validator also be reported in the logger output

As I don't think we have any version number for this right now, let's defer this to another issue.

Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the code to analyze the physical nodes works and that the resulting JSON is as expected.

@nlslatt nlslatt merged commit ce79d08 into develop Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add physical node to LB data file
4 participants