-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTMG multi node #3932
MTMG multi node #3932
Conversation
cpp/CMakeLists.txt
Outdated
src/detail/shuffle_vertex_pairs.cu | ||
src/detail/shuffle_vertex_pairs_int32_int32.cu | ||
src/detail/shuffle_vertex_pairs_int32_int64.cu | ||
src/detail/shuffle_vertex_pairs_int64_int64.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind this? Is this to cut compile time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be more consistent on this to keep our codebase maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was to cut compile time. It was the one file that was being constantly recompiled while I was testing this that took a long time.
I have been picking these up when I see them to split them. I think ultimately we need to re-evaluate how we are doing type dispatching so that we can generally improve compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... we should re-evaluate this, but AFAIK, this isn't the file with the longest compile time, and just splitting this file looks a bit odd. No problem in doing this for local development, but I guess pushing this change to the main repo (without really agreeing on how to tackle compile time issue) sounds a bit premature.
@@ -70,6 +70,8 @@ class resource_manager_t { | |||
{ | |||
std::lock_guard<std::mutex> lock(lock_); | |||
|
|||
CUGRAPH_EXPECTS(remote_rank_set_.find(rank) == remote_rank_set_.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check, is rank
still local? I assume remote_rank_set_
stores global ranks, and can we query remote_rank_set_
with a local rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or from now on, rank
is global and device_id
might serve as local GPU rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my implementation here (trying to somewhat mimic MPI) rank
is global. device_id
is local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. in that case, we need to review the documentation.
/**
* @brief add a local GPU to the resource manager.
*
* @param rank The rank to assign to the local GPU
* @param device_id The device_id corresponding to this rank
*/
It is not clear to figure out that rank
is global and device_id
is local just seeing the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the parameter names and documentation to be global_rank
and local_device_id
. Added a small improvement in the class documentation to identify the global rank concept.
} else { | ||
local_ranks_to_include.push_back(rank); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using std::copy_if
with std::back_inserter
better documents the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in next push.
// | ||
// This is intended to mimic a multi-node host application (non-MPI) integrating | ||
// with MTMG library. This is a simple implementation using a shared file system | ||
// to pass configuration messages. Terribly inefficient, but should mimic | ||
// expected behavior. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this vs using mpi but creating just one process per node? We are already using MPI for testing, so the latter won't require a new execution environment. The former assumes availability of a sort of NFS and users need to setup the directory path correctly. This is not necessary if we just re-use the existing MPI test environment but just create one process per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid depending on MPI since we want the interface to work without MPI.
We can certainly create an MPI-based test that doesn't use MPI to initialize the comms. That would be a bit easier to use. I'd be concerned that we might subtly rely on an MPI in some subtle way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was think more like updating
#define CUGRAPH_MG_TEST_PROGRAM_MAIN() \
int main(int argc, char** argv) \
{ \
cugraph::test::initialize_mpi(argc, argv); \
auto comm_rank = cugraph::test::query_mpi_comm_world_rank(); \
auto comm_size = cugraph::test::query_mpi_comm_world_size(); \
int num_gpus_per_node{}; \
RAFT_CUDA_TRY(cudaGetDeviceCount(&num_gpus_per_node)); \
RAFT_CUDA_TRY(cudaSetDevice(comm_rank % num_gpus_per_node)); \
::testing::InitGoogleTest(&argc, argv); \
auto const cmd_opts = parse_test_options(argc, argv); \
auto const rmm_mode = cmd_opts["rmm_mode"].as<std::string>(); \
auto resource = cugraph::test::create_memory_resource(rmm_mode); \
rmm::mr::set_current_device_resource(resource.get()); \
cugraph::test::g_perf = cmd_opts["perf"].as<bool>(); \
cugraph::test::g_rmat_scale = \
(cmd_opts.count("rmat_scale") > 0) \
? std::make_optional<size_t>(cmd_opts["rmat_scale"].as<size_t>()) \
: std::nullopt; \
cugraph::test::g_rmat_edge_factor = \
(cmd_opts.count("rmat_edge_factor") > 0) \
? std::make_optional<size_t>(cmd_opts["rmat_edge_factor"].as<size_t>()) \
: std::nullopt; \
cugraph::test::g_test_file_name = \
(cmd_opts.count("test_file_name") > 0) \
? std::make_optional<std::string>(cmd_opts["test_file_name"].as<std::string>()) \
: std::nullopt; \
\
auto ret = RUN_ALL_TESTS(); \
cugraph::test::finalize_mpi(); \
return ret; \
}
and create CUGRAPH_MTMG_TEST_PROGRAM_MAIN()
We can just broadcast NCCL unique ID here and we won't touch anything related to MPI after this (and we can even call cugraph::test::finalize_mpi() right after broadcasting NCCL unique ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got the MPI configuration working in the next push. I did not abstract it into a reusable function/MACRO, I can look at that when we added a second test (perhaps in a PR for adding Louvain or Jaccard).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general.
I added few comments for potential code improvements (better hiding NCCL & MPI dependencies).
@@ -89,7 +100,6 @@ class instance_manager_t { | |||
std::vector<std::unique_ptr<raft::handle_t>> raft_handle_{}; | |||
std::vector<std::unique_ptr<ncclComm_t>> nccl_comms_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we directly managing NCCL? Shouldn't this be implementation detail of raft?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look at updating raft for this. I'll talk with Corey.
if (g_node_rank == 0) RAFT_NCCL_TRY(ncclGetUniqueId(&instance_manager_id)); | ||
|
||
RAFT_MPI_TRY( | ||
MPI_Bcast(&instance_manager_id, sizeof(instance_manager_id), MPI_CHAR, 0, MPI_COMM_WORLD)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to hide the existence of MPI in individual tests, we may add
// valid only for MTMG tests, updated by CUGRAPH_MTMG_TEST_PROGRAM_MAIN()
std::optional g_nccl_unique_id{std::nullopt};
after
// these variables are updated by command line arguments
static bool g_perf{false};
static std::optional<size_t> g_rmat_scale{std::nullopt};
static std::optional<size_t> g_rmat_edge_factor{std::nullopt};
static std::optional<std::string> g_test_file_name{std::nullopt};
https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/tests/utilities/base_fixture.hpp#L119
A bit annoying to add a global variable (only valid for MTMG tests) to all C++ tests, but I think this is OK as this is only for testing and the variable is std::optional (so having invalid values for non-MTMG tests is somewhat acceptable).
Then, we can expose our dependency on MPI only within CUGRAPH_MTMG_TEST_PROGRAM_MAIN()
and we can finalize MPI before invoking individual tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning on visiting this issue in depth when I add Louvain. I was also thinking of adding some test methods to hide some of these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried creating g_nccl_unique_id
as a global and initializing it in the main program. As currently constructed, each call to create_instance_manager
needs a unique g_nccl_unique_id
, I couldn't reuse them. Each test could potentially use a different GPU configuration (something I eventually want to add). So I needed to move the creation of the nccl_unique_id back into the test.
What I'm imagining is creating an MTMG test function that will create and broadcast the nccl unique id so that the use of MPI is restricted there. It does require that we not finalize MPI before we run the tests. But if we capture this in helper methods then we should be able to easily identify if we are using MPI calls that could potentially be a problem within one of the MTMG tests.
|
||
#include <gtest/gtest.h> | ||
|
||
#include <nccl.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I am not sure exposing NCCL in cuGraph is a good thing. Up to this point, we have been largely hiding the existence of NCCL under raft. Now we are explicitly dependent on NCCL. Hiding NCCL for MTMG may require an additional raft function and this can be a FIXME thing, but I feel like hiding explicit NCCL dependency is a good design choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address this also in the Louvain PR. I can actually pick this dependency up through a raft header. So if we address your other comment (about ncclComm_t
and migrating some functionality in raft) that should clean this up as well.
… in cmake to ensure MPI is available
/merge |
This PR extends the MTMG framework to support a multi-node configuration.
Multi-node configuration of MTMG assumes some externally available mechanism for communicating configuration parameters and the NCCL unique id between the processes that will participate. For testing purposes we assume a shared file system where we can create a directory and read/write configuration information rather than constructing a more complex communication mechanism (NFS is sufficient for testing).
Closes https://github.com/rapidsai/graph_dl/issues/329