-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support heterogenous fanout type #4608
support heterogenous fanout type #4608
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts on changing the API a bit.
raft::random::RngState& rng_state, | ||
bool return_hops, | ||
bool with_replacement = true, | ||
prior_sources_behavior_t prior_sources_behavior = prior_sources_behavior_t::DEFAULT, | ||
bool dedupe_sources = false, | ||
bool do_expensive_check = false); | ||
|
||
#if 0 | ||
/* FIXME: | ||
There are two options to support heterogeneous fanout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another option to explore.
Create a new function called neighbor_sample. Create it off of the biased sampling API, but with the following changes:
- the biases become optional instead of required. Then it can do either uniform or biased in the same call just by whether the biases are included or not
- the fanout and heterogeneous fanout as you have defined. Or we might explore using std::variant, where it would either take host_span or tuple of host span and make the right choice internally
- Move the
rng_state
parameter to be right after the handle (before the graph_view). This feels like a better standard place for the parameter.
We can then mark the existing uniform_neighbor_sample and biased_neighbor_sample as deprecated. When we implement, the internal C++ implementation can just call the new neighbor_sample with the parameters properly configured. This makes it a non-breaking change (eventually we'll drop the old functions), but still keeps the code reuse increased.
Thoughts @seunghwak ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the biases become optional instead of required. Then it can do either uniform or biased in the same call just by whether the biases are included or not
=> In this case, we may update the existing non-heterogeneous fanout type sampling functions as well. i.e. combine the uniform & biased sampling functions. Not sure about the optimal balancing point between creating too many functions vs creating a function with too many input parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I guess we should avoid creating a too busy function (one function handling all different types of sampling based on the input arguments excessively using std::variant & std::optional) but we should also avoid creating too many functions... Not sure what's the optimal balancing point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, adding new parameters exponentially increase code complexity (too handle all possible combinations of optional parameters), we should better create separate functions. If supporting an additional optional parameter requires only a minor change in the API and implementation, we may create one generic function (or we may create one complex function that handles all different options in the detail namespace and multiple public functions calling this if this helps in reducing code replication).
@@ -368,6 +410,7 @@ cugraph_error_code_t cugraph_uniform_neighbor_sample( | |||
const cugraph_type_erased_device_array_view_t* label_to_comm_rank, | |||
const cugraph_type_erased_device_array_view_t* label_offsets, | |||
const cugraph_type_erased_host_array_view_t* fan_out, | |||
const cugraph_sample_heterogeneous_fanout_t* heterogeneous_fanout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we take the same approach here. Create a new C API function called neighbor_sample, following the biased function definition. Add this parameter. Deprecate the other functions. In the implementation we can just check for nullptr (NULL).
@@ -150,7 +173,7 @@ neighbor_sample_impl( | |||
|
|||
std::vector<size_t> level_sizes{}; | |||
int32_t hop{0}; | |||
for (auto&& k_level : fan_out) { | |||
for (auto&& k_level : (*fan_out)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually sufficient yet... but I'm more worried about the API right now.
This loop will need, in the case of heterogeneous sampling, to have 2 levels of for loop. An outer loop iterating by hop and an inner loop iterating by type.
I'd be inclined to add a setup loop that iterates over the types and generates the masks - and perhaps identifies the maximum number of hops to drive the outer loop. You'll need to get k_level from the right type/hop combination... so this for construct won't work at all, it will need to look different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I only added it for it to compile. I will revisit this approach once we lock the API's interface. It is only supporting non heterogeneous type for now
@@ -192,7 +215,7 @@ neighbor_sample_impl( | |||
if (labels) { (*level_result_label_vectors).push_back(std::move(*labels)); } | |||
|
|||
++hop; | |||
if (hop < fan_out.size()) { | |||
if (hop < (*fan_out).size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fan_out size will (potentially) vary by type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I only added it for it to compile. I will revisit this approach once we lock the API's interface. It is only supporting non heterogeneous type for now
# FIXME: Add expensive check to ensure all dict values are lists | ||
# Convert to a tuple of sequence (edge type size and fanout values) | ||
edge_type_size = [] | ||
[edge_type_size.append(len(s)) for s in list(fanout_vals.values())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this iterate over the edge types in the dictionary in order? We need to make sure that this is constructed with edge type 0 first, followed by edge type 1, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I converted the heterogeneous fanout type to a sorted ordered dictionary.
edge_type_size = [] | ||
[edge_type_size.append(len(s)) for s in list(fanout_vals.values())] | ||
edge_type_fanout_vals = list(chain.from_iterable(list(fanout_vals.values()))) | ||
fanout_vals = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my earlier suggestions, I think we want this to be a CSR structure, so converting from a list of sizes to a list of offsets is perhaps best done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed this back to a dense structure... so I think this code isn't right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems wrong to me. If you want to support fanout_vals as a dictionary I think we need to convert it to a dense array to get the right values. Do you have a python test for this path that we can verify?
@@ -314,8 +316,21 @@ def uniform_neighbor_sample( | |||
fanout_vals = fanout_vals.get().astype("int32") | |||
elif isinstance(fanout_vals, cudf.Series): | |||
fanout_vals = fanout_vals.values_host.astype("int32") | |||
elif isinstance(fanout_vals, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as above
…0_support_heterogeneous_fanout
…0_support_heterogeneous_fanout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds deprecated features, which I don't think we should be doing. Those should be omitted from the new APIs in PLC.
We also need to expose the edge renumber map to Python so we can recover the original edge ids, allowing us to select the correct values from the feature tensor.
python/pylibcugraph/pylibcugraph/heterogeneous_biased_neighbor_sample.pyx
Outdated
Show resolved
Hide resolved
num_edge_types, | ||
bool_t with_replacement, | ||
bool_t do_expensive_check, | ||
with_edge_properties=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this was deprecated in uniform_neighbor_sample
and I think we should just leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
python/pylibcugraph/pylibcugraph/homogeneous_biased_neighbor_sample.pyx
Outdated
Show resolved
Hide resolved
h_fan_out, | ||
bool_t with_replacement, | ||
bool_t do_expensive_check, | ||
with_edge_properties=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here; we don't need this flag and it is deprecated in uniform_neighbor_sample
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
python/pylibcugraph/pylibcugraph/homogeneous_uniform_neighbor_sample.pyx
Outdated
Show resolved
Hide resolved
python/pylibcugraph/pylibcugraph/heterogeneous_biased_neighbor_sample.pyx
Outdated
Show resolved
Hide resolved
…0_support_heterogeneous_fanout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we shorten the function names for getting the edge renumber map and offsets? I think it's self-explanatory that they have been renumbered and sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These too
python/pylibcugraph/pylibcugraph/internal_types/sampling_result.pyx
Outdated
Show resolved
Hide resolved
python/pylibcugraph/pylibcugraph/internal_types/sampling_result.pyx
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I did notice a couple minor typos in the new examples, but I'll approve now with the assumption those will be fixed.
>>> sampling_results = pylibcugraph.homogeneous_biased_neighbor_sample( | ||
... resource_handle, G, start_vertices, starting_vertex_label_offsets, | ||
... h_fan_out, False, True) | ||
>>> >>> sampling_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor copy/paste error?
>>> >>> sampling_results | |
>>> sampling_results |
>>> sampling_results = pylibcugraph.homogeneous_uniform_neighbor_sample( | ||
... resource_handle, G, start_vertices, starting_vertex_label_offsets, | ||
... h_fan_out, False, True) | ||
>>> >>> sampling_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
closes rapidsai#4589 closes rapidsai#4591 Authors: - Joseph Nke (https://github.com/jnke2016) - Chuck Hastings (https://github.com/ChuckHastings) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Chuck Hastings (https://github.com/ChuckHastings) - Alex Barghi (https://github.com/alexbarghi-nv) - Rick Ratzel (https://github.com/rlratzel) URL: rapidsai#4608
closes #4589
closes #4591