-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] neighbor sampling in COO/CSR format #1982
[FEA] neighbor sampling in COO/CSR format #1982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general with some minor complaints.
And can we start with just the COO version first?
CSR version may require more thinking as the input graph (in case of SG) can be either CSR or CSC. Not sure how this will actually be implemented, but returning a CSR output for the CSC input graph can be tricky.
And we may use edgelist
instead of coo
in the public API name. cugraph has from_cudf_edgeelist
function so it might be more consistent and edgelist
sounds more generic to me than coo
.
Thanks for all the comments @seunghwak . I added some of my own comments inline. In general, I'd prefer to restrict input graph to CSR for now (similar to And I agree that we can use |
cpp/include/cugraph/algorithms.hpp
Outdated
typename graph_t::vertex_type const* ptr_d_start, | ||
index_t num_vertices, | ||
index_t sampling_size, | ||
ops::sampling::SampleTypeT sampling_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume sampling_type
captures how the sampling is done. This should probably be injected as a template policy / strategy (probably a functor: say, sampling_strategy_t
), rather than an enum
, but that would complicate dispatching from C-API / Python.
I assume, then there would be a detail::<name>_impl
that has that template policy, and this API would just instantiate the _impl
with the proper sampling_strategy_t
template argument. Is my assumption correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this captures how sampling is done.
There are several dimensions for how sampling is done actually:
- The sampling algorithm being used to select the indexes: this will have an effect in terms of performance (possibly depending on input data), but it will not affect the functionality.
- The kind of sampling in terms of functionality: for example, whether we use weights or not. I didn't plan to add this in the API for now, but with a functor (
sampling_strategy_t
kind of thing as you suggested, we may be able to add this already). This complicates things a little bit because we would have to read in data based on whether the functor requires it, but I'll see if I can think of an easy way to do that. - Whether to extract other data along with the sampled IDs: this will be useful for heterogeneous graphs, but it isn't addressed in this API at all since we would have to return more than just the sampled IDs. So this will be another overload in the future.
And yes, your assumption is correct that there will be a sampling_strategy_t
in the implementation which will be selected based on the enum
value.
For now, we don't need to care so much about C-API / Python since the integration of this with DGL will happen at the C++ level AFAIU, but the reason why we had an enum
in cugnn is because it facilitates integration with Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all of this to say that I'm open to changing this to a template type if it makes more sense to you, but basically, I'd consider this template type more of an implementation detail which the user shouldn't have to include.
Additionally, users will (in general) not have access to the implementation of that functor since cugraph-ops is closed-source. This means that there isn't much of an added value to the user to have such a functor interface, IMO.
At some point in the future, we may want to change this and allow users to add their own functors but this will most likely by as part of a revised/new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to create a GitHub issue with the larger picture of the goals that these pieces of API are trying to achieve (and link to that issue in this PR).
Specifically, I would like to see if I can, for example, use this for the purposes of MNMG MFG, or if there is some overlap. So that we can avoid code / effort duplication. Because I'm also looking into a generic MNMG neighborhood retrieval
; by that I mean the process of scanning the neighborhood of a (set of) vertex (vertices) and retrieving some vertices / edges in that neighborhood according with an injected how criteria (again, most likely a static strategy = template policy). I'm specifically avoiding here the term neighborhood sampling
because that is too closely tied with G(C)NN's purposes and random processes. As I see it, neighborhood sampling
are just special cases of neighborhood retrieval
that are dedicated to some specific G(C)NN purposes.
What I'm after is the more generic neighborhood retrieval
one (I assumed BFS is that, but per recent conversations that might be too expensive; so something more "customizable" is needed here, where a retrieval policy can be injected, which knows how to take advantage of shortcuts and not "oversolve", like BFS does; that being said, a customized multi-seed BFS might be how this could be implemented, etc.). So then the question is: is there any overlap between this idea of generic neighborhood retrieval
and the implementation(s) you are proposing here? If so, let's identify the overlap and plan accordingly to not duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to link the issue, I commented below and edited the description above as well.
I agree that you'll be able to use part of the implementation of this API in order to implement other things, like MNMG neighborhood sampling or possibly a more generic neighborhood retrieval
.
But what is the purpose of making the public-facing API more generic?
Do you have specific applications of the more generic neighborhood retrieval
in mind?
Do we know whether the users want this generic neighborhood retrieval
API over a set of more specific APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to this, even though the implementation can be re-used for selecting neighbors in other ways, it will be mostly tuned for uniform neighborhood sampling.
Thus, it's likely that other implementations are better for different ways of selecting neighbors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, the suggested API with output template<typename retrieval_strategy_t,...>
is highly problematic since the implementation within cugraph-ops will be closed source, at least for now.
Thus, you would have to give me an exact list of instantiations of this retrieval_strategy_t
in order for us not to expose the implementation.
Do you have this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that it's SG doesn't seem to be mentioned in the linked issue (unless I missed it). If it's not, then probably it should. In fact all this comment above is much more descriptive than the content in the issue, I think.
Also, MFG is mentioned in the issue. It would probably be helpful if we don't both work on implementing overlapping functionality. However, given the specifics of MNMG implementation, there probably is not a real overlap. We need to revisit this later and possibly refactor. So, in light of the SG spec for this I will then assume there isn't any overlap and the 2 implementations, although both targetting MFG but in different contexts, don't have common parts (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how you plan to provide an open source API for which some implementation details are closed source. This intention along with some information about what exactly is closed source and how (to the extent that that information is shareable...you've already shared such information above on this public forum, so I assume some of that info is shareable) should be added into the description of the issue and / or this PR. Not become uncovered in PR comments.
The more we discuss this the more gets revealed. At this point I am confused about the scope of this PR. It seems that a lot of questions are still un-answered about that. Perhaps answering them before pushing an API signature might make more sense.
I don't have an exact list of possible retrieval_strategy_t
, by virtue of it being generic. The immediate needs for this would have been, again, MFG and RW (MNMG, both). I don't know what other possible future uses of it might be, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've edited the issue and I'm mentioning cugraph-ops in the issue description now, which will provide the implementation.
I don't have an exact list of possible
retrieval_strategy_t
, by virtue of it being generic.
In this case, we cannot use your suggested API. What we could do is to allow users to provide an opaque structure representing an operator, similar to what cusparse does in SpMMOp (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op).
Edit: I am aware that this is a C API, but C++ will not allow us to use compile-time typing here either since that obviously involves access to the sources of cugraph-ops.
Because of the complications this involves and because there is no indication that we will use this for anything other than random selection at the moment, I'd suggest not to opt for that right away.
We can always provide such an API later on if necessary.
Sounds fine to me. Actually, CSC-ish formats are used only for a limited set of algorithms (PageRank, Katz Centrality, and HITS) and we use CSR-ish (-ish as it is no-longer a simple CSR in multi-GPU) formats by default. |
Forgot to link the issue here, I also edited the description above: #1978 |
Coming back to this, for the CSR interface, it actually doesn't make much sense to return an Plus, in the mid-term, we definitely want to use the CSR-based method since both DGL and PyG use CSR ultimately. |
I agree that we should return COO or CSR. I am just complaining about function names. I believe edge list and COO basically mean the same thing and adjacency list and CSR are (roughly) same as well, but COO & CSR are more rigorous technical terminologies for hardcore computer scientists while data scientists may feel more familiar with edge list or adjacency list (please correct me this understanding is not right). NetworkX has from_pandas_edgelist and from_pandas_adjacency but AFAIK, NetworkX doesn't use CSR or COO in their public API. We have used COO & CSR in our legacy graph classes, but we're sort of moving away from these terminologies in our new public API. So, I am a bit uncomfortable about using COO or CSR in our public API, but pretty much agree that we should return COO or CSR. |
I do agree with this, I think the reason why I used COO/CSR is because DGL & PyG already use those terms and they are of course the primary target for this API. |
Either is fine but for some reason, we have been using @ChuckHastings Any thoughts?
|
This PR has been labeled |
@seunghwak @aschaffer I addressed your comments above and added an implementation based on cugraph-ops. In terms of the interface, I mainly changed the names and added a |
…des must be available at runtime)
@@ -285,6 +285,7 @@ target_include_directories(cugraph | |||
# - link libraries ------------------------------------------------------------- | |||
target_link_libraries(cugraph | |||
PUBLIC | |||
cugraphops::cugraphops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As cugraphops is part of the public api for cugraph, it needs to be properly exported.
This is causing the CI failures as consumers of the cugraph-config.cmake
have no way of finding cugraphops::cugraphops
. This causes the below error:
11:32:16 Target "cugraph_etl" links to target "cugraphops::cugraphops" but the
11:32:16 target was not found. Perhaps a find_package() call is missing for an
11:32:16 IMPORTED target, or an ALIAS target is missing?
https://github.com/rapidsai/cugraph/blob/branch-22.04/cpp/cmake/thirdparty/get_libcugraphops.cmake
needs to add BUILD and INSTALL export set entries to the rapids_find_generate_module
and rapids_find_package
lines ( cugraph-exports
is the export set ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertmaynard if you see the similar setup in cuML for the libcumlprims, it seems to not follow your suggestion, but that's working for cuML. Any ideas why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because cumlprims_mg
is a private dependency of cuml
and therefore doesn't need to be found by consumers of cuml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks Robert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
@gpucibot merge |
Don't have a strong preference. I agree that networkx is inconsistent. I also don't think it's necessary for us to follow networkx at the C++ level. It is sometimes convenient. But I expect as we start integrating with more graph frameworks we will find different organizational structures and naming conventions that will be different. But for now I'd say let's go with what's in the PR. We can always rename things later if we settle on a different naming convention. |
@gpucibot merge |
This PR tracks work on MNMG Neighborhood Sampling, for G(C)NN needs. Dependencies: 1. #1982 2. #2064 3. Integration of rapidsai/cugraph-ops#24 into `cugraph` Authors: - Andrei Schaffer (https://github.com/aschaffer) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Kumar Aatish (https://github.com/kaatish) - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) URL: #2073
This pull request adds neighborhood sampling, as needed by GNN frameworks (DGL, PyTorch-Geometric).
Since I did not hear back on most of the other issues that need to be addressed before this, I am continuing with my plan of first opening a PR with just the API. Once we agree on the final API, and once a minimal version of cugraph-ops is integrated, we can add the implementation of this API.
In particular, for now I am suggesting that the sampling type is exposed in the public API (it does not exist yet in cugraph-ops since that has not been integrated yet). This must be decided ahead of sampling for best performance (either by the end user or some automatic heuristic on the original graph), which is why it makes sense to have as a separate parameter for this API.
EDIT: link to issue #1978