Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] neighbor sampling in COO/CSR format #1982

Merged
merged 14 commits into from
Feb 22, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions cpp/include/cugraph/algorithms.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1416,6 +1416,64 @@ random_walks(raft::handle_t const& handle,
bool use_padding = false,
std::unique_ptr<sampling_params_t> sampling_strategy = nullptr);

/**
* @brief generate sub-sampled graph in CSR format given input graph, list of vertices
* and sample size per vertex. The output graph consists of the given vertices
* with each vertex having at most `sample_size` neighbors from the original graph
*
* @tparam graph_t Type of input graph/view (typically, graph_view_t).
* @tparam index_t Type used to store indexing and sizes.
* @param handle RAFT handle object to encapsulate resources (e.g. CUDA stream, communicator, and
* handles to various CUDA libraries) to run graph algorithms.
* @param graph Graph (view )object to sub-sample.
* @param ptr_d_start Device pointer to set of starting vertex indices for the sub-sampling.
* @param num_vertices = number(vertices) to use for sub-sampling.
* @param sampling_size = max number of neighbors per output vertex.
* @param sampling_type = the sampling type (algo R/algo L/etc.) used to produce outputs.
* @return std::tuple<rmm::device_uvector<typename graph_t::edge_type>,
* rmm::device_uvector<typename graph_t::vertex_type>>
* Tuple consisting of two arrays representing the offsets and indices of
* the sub-sampled graph.
*/
template <typename graph_t, typename index_t>
std::tuple<rmm::device_uvector<typename graph_t::edge_type>,
rmm::device_uvector<typename graph_t::vertex_type>>
neighbor_sampling_csr(raft::handle_t const& handle,
graph_t const& graph,
typename graph_t::vertex_type const* ptr_d_start,
index_t num_vertices,
MatthiasKohl marked this conversation as resolved.
Show resolved Hide resolved
index_t sampling_size,
MatthiasKohl marked this conversation as resolved.
Show resolved Hide resolved
ops::sampling::SampleTypeT sampling_type);

/**
* @brief generate sub-sampled graph in COO format given input graph, list of vertices
* and sample size per vertex. The output graph consists of the given vertices
* with each vertex having at most `sample_size` neighbors from the original graph
*
* @tparam graph_t Type of input graph/view (typically, graph_view_t).
* @tparam index_t Type used to store indexing and sizes.
* @param handle RAFT handle object to encapsulate resources (e.g. CUDA stream, communicator, and
* handles to various CUDA libraries) to run graph algorithms.
* @param graph Graph (view )object to sub-sample.
* @param ptr_d_start Device pointer to set of starting vertex indices for the sub-sampling.
* @param num_vertices = number(vertices) to use for sub-sampling.
* @param sampling_size = max number of neighbors per output vertex.
* @param sampling_type = the sampling type (algo R/algo L/etc.) used to produce outputs.
* @return std::tuple<rmm::device_uvector<typename graph_t::edge_type>,
* rmm::device_uvector<typename graph_t::vertex_type>>
* Tuple consisting of two arrays representing the source and destination nodes of
* the sub-sampled graph.
*/
template <typename graph_t, typename index_t>
MatthiasKohl marked this conversation as resolved.
Show resolved Hide resolved
std::tuple<rmm::device_uvector<typename graph_t::vertex_type>,
rmm::device_uvector<typename graph_t::vertex_type>>
neighbor_sampling_coo(raft::handle_t const& handle,
graph_t const& graph,
typename graph_t::vertex_type const* ptr_d_start,
index_t num_vertices,
MatthiasKohl marked this conversation as resolved.
Show resolved Hide resolved
index_t sampling_size,
ops::sampling::SampleTypeT sampling_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume sampling_type captures how the sampling is done. This should probably be injected as a template policy / strategy (probably a functor: say, sampling_strategy_t ), rather than an enum, but that would complicate dispatching from C-API / Python.

I assume, then there would be a detail::<name>_impl that has that template policy, and this API would just instantiate the _impl with the proper sampling_strategy_t template argument. Is my assumption correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this captures how sampling is done.

There are several dimensions for how sampling is done actually:

  1. The sampling algorithm being used to select the indexes: this will have an effect in terms of performance (possibly depending on input data), but it will not affect the functionality.
  2. The kind of sampling in terms of functionality: for example, whether we use weights or not. I didn't plan to add this in the API for now, but with a functor (sampling_strategy_t kind of thing as you suggested, we may be able to add this already). This complicates things a little bit because we would have to read in data based on whether the functor requires it, but I'll see if I can think of an easy way to do that.
  3. Whether to extract other data along with the sampled IDs: this will be useful for heterogeneous graphs, but it isn't addressed in this API at all since we would have to return more than just the sampled IDs. So this will be another overload in the future.

And yes, your assumption is correct that there will be a sampling_strategy_t in the implementation which will be selected based on the enum value.

For now, we don't need to care so much about C-API / Python since the integration of this with DGL will happen at the C++ level AFAIU, but the reason why we had an enum in cugnn is because it facilitates integration with Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all of this to say that I'm open to changing this to a template type if it makes more sense to you, but basically, I'd consider this template type more of an implementation detail which the user shouldn't have to include.

Additionally, users will (in general) not have access to the implementation of that functor since cugraph-ops is closed-source. This means that there isn't much of an added value to the user to have such a functor interface, IMO.

At some point in the future, we may want to change this and allow users to add their own functors but this will most likely by as part of a revised/new API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to create a GitHub issue with the larger picture of the goals that these pieces of API are trying to achieve (and link to that issue in this PR).

Specifically, I would like to see if I can, for example, use this for the purposes of MNMG MFG, or if there is some overlap. So that we can avoid code / effort duplication. Because I'm also looking into a generic MNMG neighborhood retrieval; by that I mean the process of scanning the neighborhood of a (set of) vertex (vertices) and retrieving some vertices / edges in that neighborhood according with an injected how criteria (again, most likely a static strategy = template policy). I'm specifically avoiding here the term neighborhood sampling because that is too closely tied with G(C)NN's purposes and random processes. As I see it, neighborhood sampling are just special cases of neighborhood retrieval that are dedicated to some specific G(C)NN purposes.

What I'm after is the more generic neighborhood retrieval one (I assumed BFS is that, but per recent conversations that might be too expensive; so something more "customizable" is needed here, where a retrieval policy can be injected, which knows how to take advantage of shortcuts and not "oversolve", like BFS does; that being said, a customized multi-seed BFS might be how this could be implemented, etc.). So then the question is: is there any overlap between this idea of generic neighborhood retrieval and the implementation(s) you are proposing here? If so, let's identify the overlap and plan accordingly to not duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to link the issue, I commented below and edited the description above as well.

I agree that you'll be able to use part of the implementation of this API in order to implement other things, like MNMG neighborhood sampling or possibly a more generic neighborhood retrieval.

But what is the purpose of making the public-facing API more generic?
Do you have specific applications of the more generic neighborhood retrieval in mind?
Do we know whether the users want this generic neighborhood retrieval API over a set of more specific APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this, even though the implementation can be re-used for selecting neighbors in other ways, it will be mostly tuned for uniform neighborhood sampling.
Thus, it's likely that other implementations are better for different ways of selecting neighbors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, the suggested API with output template<typename retrieval_strategy_t,...> is highly problematic since the implementation within cugraph-ops will be closed source, at least for now.

Thus, you would have to give me an exact list of instantiations of this retrieval_strategy_t in order for us not to expose the implementation.
Do you have this list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it's SG doesn't seem to be mentioned in the linked issue (unless I missed it). If it's not, then probably it should. In fact all this comment above is much more descriptive than the content in the issue, I think.

Also, MFG is mentioned in the issue. It would probably be helpful if we don't both work on implementing overlapping functionality. However, given the specifics of MNMG implementation, there probably is not a real overlap. We need to revisit this later and possibly refactor. So, in light of the SG spec for this I will then assume there isn't any overlap and the 2 implementations, although both targetting MFG but in different contexts, don't have common parts (yet).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how you plan to provide an open source API for which some implementation details are closed source. This intention along with some information about what exactly is closed source and how (to the extent that that information is shareable...you've already shared such information above on this public forum, so I assume some of that info is shareable) should be added into the description of the issue and / or this PR. Not become uncovered in PR comments.

The more we discuss this the more gets revealed. At this point I am confused about the scope of this PR. It seems that a lot of questions are still un-answered about that. Perhaps answering them before pushing an API signature might make more sense.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic. The immediate needs for this would have been, again, MFG and RW (MNMG, both). I don't know what other possible future uses of it might be, yet.

Copy link
Contributor Author

@MatthiasKohl MatthiasKohl Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've edited the issue and I'm mentioning cugraph-ops in the issue description now, which will provide the implementation.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic.

In this case, we cannot use your suggested API. What we could do is to allow users to provide an opaque structure representing an operator, similar to what cusparse does in SpMMOp (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op).
Edit: I am aware that this is a C API, but C++ will not allow us to use compile-time typing here either since that obviously involves access to the sources of cugraph-ops.
Because of the complications this involves and because there is no indication that we will use this for anything other than random selection at the moment, I'd suggest not to opt for that right away.
We can always provide such an API later on if necessary.


/**
* @brief Finds (weakly-connected-)component IDs of each vertices in the input graph.
*
Expand Down