Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update C API graph creation function signatures #3982

Merged

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Nov 7, 2023

Updating the C API graph creation functions to support the following:

  • Add support for isolated vertices
  • Add MG optimization to support multiple device arrays per rank as input and concatenate them internally
  • Add MG optimization to internally compute the number of edges via allreduce rather than requiring it as an input parameter (this can be expensive to compute in python)

This PR implements these features. Some simple tests exist to check for isolate vertices (by running pagerank which generates a different result if the graph has isolated vertices). A simple test for multiple input arrays exists for the MG case.

Closes #3947
Closes #3974

Comment on lines +85 to +87
* @param [in] vertices Optional device array containing a list of vertex ids
* (specify NULL if we should create vertex ids from the
* unique contents of @p src and @p dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does NULL vertices behave with and w/o renumbering? What if edge data only contain unique ids of e.g. 1, 3, and 5? Are 0, 2, and 4 implied if renumber is False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vertices is NULL, this will provide the same behavior as today. That is:

  • Renumber enabled will ignore vertices 0, 2, 4... they will not exist in the graph. We will create a graph with 3 vertices which will be internally numbered 0, 1, 2. Any function that is called will translate any vertex id inputs to the mapping 0, 1, 2 (e.g. if you call BFS and specify a seed of 1, it will be mapped to whichever id that vertex 1 was mapped to), and any vertex ids outputs will be renumbered from the internal mapping (0, 1, 2) to the external mapping (1, 3, 5)
  • Renumber disabled will use the vertex ids exactly as specified, so vertices 0, 2 and 4 will be isolated vertices in the graph. However, if your graph also had a vertex 6... that vertex would be lost even with renumber disabled, since there is no mechanism to discover that vertex 6 exists in your graph with the current (and proposed API).

* integer values from 0 to num_vertices.
* @param [in] do_expensive_check If true, do expensive checks to validate the input data
* is consistent with software assumptions. If false bypass these checks.
* @param [in] properties Properties of the graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

properties is defined twice for this function (and also twice for the function below).

* @param [in] edge_type_ids Device array containing the edge types for each edge. Optional
argument that can be NULL if edge types are not used.
* @param [in] store_transposed If true create the graph initially in transposed format
* @param [in] renumber If true, renumber vertices to make an efficient data structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does renumber=True work for CSR? Wouldn't offsets already imply 0..N-1 vertex ids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renumber=True would still work for CSR input. Yes, offsets does imply vertex ids in the range [0, N). But our renumbering logic adds some performance optimizations by putting the vertex ids in an order that makes the CUDA kernels more efficient. So there is value in specifying renumber=True even if the vertex ids are densely packed.

I will be fixing the C API implementation of this function so that it will properly reflect the isolated vertices that can be implied in CSR format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should update the documentation to better reflect that renumbering is more than just packing integers and highly recommended in most practical use cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated documentation for each of the renumber parameters in this file in the next push.

@ChuckHastings ChuckHastings self-assigned this Nov 13, 2023
@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 13, 2023
@ChuckHastings ChuckHastings added this to the 23.12 milestone Nov 13, 2023
* @param [in] store_transposed If true create the graph initially in transposed format
* @param [in] renumber If true, renumber vertices to make an efficient data structure.
* If false, do not renumber. Renumbering is required if the vertices are not sequential
* integer values from 0 to num_vertices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this documentation is a bit misleading. In cuGraph, renumbering is more than just mapping integers to consecutive integers starting from 0. cuGraph renumbers vertex IDs in a specific way for various optimizations. This documentation may be interpreted that renumbering is unnecessary if vertex IDs are consecutive integers starting from 0. AFAIK, renumber = false's only remaining use case is for debugging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated documentation for each of the renumber parameters in this file in the next push.

cpp/include/cugraph_c/graph.h Show resolved Hide resolved
* @param [in] edge_type_ids Device array containing the edge types for each edge. Optional
argument that can be NULL if edge types are not used.
* @param [in] store_transposed If true create the graph initially in transposed format
* @param [in] renumber If true, renumber vertices to make an efficient data structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should update the documentation to better reflect that renumbering is more than just packing integers and highly recommended in most practical use cases.

*
* @param [in] graph A pointer to the graph object to destroy
*/
void cugraph_graph_free(cugraph_graph_t* graph);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this function works for both SG&MG graphs, right? Then, should we place this function after SG & MG graph creation functions? I was searching for a new graph deletion function for MG (replacing the deprecated function) and it took sometime to find that this function works for both SG & MG.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all of the free functions so that they are together in the file.

@@ -96,49 +135,27 @@ struct create_graph_functor : public cugraph::c_api::abstract_functor {
edge_type_id_t>>
new_edge_types{std::nullopt};

rmm::device_uvector<vertex_t> edgelist_srcs(src_->size_, handle_.get_stream());
rmm::device_uvector<vertex_t> edgelist_dsts(dst_->size_, handle_.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something outside the scope of this PR, but are we planning to shuffle edges in the C-API in the future? If we perform shuffle in the C-API, should we still need to take multiple arrays from the python layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do shuffle the edges in the C API currently.

Perhaps @VibhuJawa can offer some insight into the need for multiple arrays, but I believe this is an artifact of dask and how the arrays are distributed on the workers. It is possible for there to be multiple device arrays on the same worker. This new mechanism allows the concatenation of these multiple arrays to occur in the C API rather than in dask/cudf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it in dask-cudf too once we remove the num_edges signature (which this PR does).

If we don't have to get num_edges , we can just do this in cuDF/cuPY one array at a time and the resulting overhead will be 2x(array_memory) instead of 2x(dataframe_memory) .

cpp/src/c_api/graph_mg.cpp Show resolved Hide resolved

// FIXME: Need to handle the case where a GPU gets a NULL pointer but other GPUs
// Get values. Override vertex_type/weight_type/edge_id_type but don't
// fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle this instead of leaving this as FIXME. This indeed happened in Louvain in the past.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this as an update to this PR. I wanted to get the PR up for review rather than waiting for a little more validation.

@ChuckHastings ChuckHastings requested a review from a team as a code owner November 16, 2023 02:24
cpp/src/c_api/graph_mg.cpp Show resolved Hide resolved
CAPI_EXPECTS(
std::count_if(vertex_types.begin(),
vertex_types.end(),
[vertex_type](auto t) { return vertex_type != static_cast<int>(t); }) == 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but is std::all_of more specific for this? (https://en.cppreference.com/w/cpp/algorithm/all_any_none_of). I guess all_of may better document the intention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address when I get to the FIXME, which might change this logic anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push.

[weight_type](auto t) { return weight_type != static_cast<int>(t); }) == 0,
CUGRAPH_INVALID_INPUT,
"different weight type used on different GPUs",
*error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but is std::all_of more specific for this? (https://en.cppreference.com/w/cpp/algorithm/all_any_none_of). I guess all_of may better document the intention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 184 to 216
if (drop_multi_edges_) {
std::cout << "calling drop_multi_edges" << std::endl;
raft::print_device_vector(
" edgelist_srcs", edgelist_srcs.data(), edgelist_srcs.size(), std::cout);
raft::print_device_vector(
" edgelist_dsts", edgelist_dsts.data(), edgelist_dsts.size(), std::cout);

std::tie(
edgelist_srcs, edgelist_dsts, edgelist_weights, edgelist_edge_ids, edgelist_edge_types) =
cugraph::sort_and_remove_multi_edges(handle_,
std::move(edgelist_srcs),
std::move(edgelist_dsts),
std::move(edgelist_weights),
std::move(edgelist_edge_ids),
std::move(edgelist_edge_types));

std::cout << "after drop_multi_edges" << std::endl;
raft::print_device_vector(
" edgelist_srcs", edgelist_srcs.data(), edgelist_srcs.size(), std::cout);
raft::print_device_vector(
" edgelist_dsts", edgelist_dsts.data(), edgelist_dsts.size(), std::cout);
}

if (drop_self_loops_) {
std::tie(
edgelist_srcs, edgelist_dsts, edgelist_weights, edgelist_edge_ids, edgelist_edge_types) =
cugraph::remove_self_loops(handle_,
std::move(edgelist_srcs),
std::move(edgelist_dsts),
std::move(edgelist_weights),
std::move(edgelist_edge_ids),
std::move(edgelist_edge_types));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking but if drop_self_loops appear before drop_multi_edges in the input parameters, should we better perform drop_self_loops() before drop_multi_edges()? This consistency may make the code easier to read and slightly easier to maintain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in next push

{
auto edge_first =
thrust::make_zip_iterator(thrust::make_tuple(edgelist_srcs.begin(), edgelist_dsts.begin()));
thrust::sort(handle.get_thrust_policy(), edge_first, edge_first + edgelist_srcs.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require 2x the size of edge list and will increase the peak memory requirement. Our goal is to keep the peak memory footprint in graph creation to 1.5x edge list + per vertex data.

See the code in

https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/src/structure/detail/structure_utils.cuh#L164

We can call mem_frugal_partitoin (which requires 0.5x the input array) and process the resulting two partitions one by one if the input array size is larger than the certain threshold.

thrust::remove_if(handle.get_thrust_policy(),
edge_first,
edge_first + edgelist_srcs.size(),
[] __device__(auto e) { return thrust::get<0>(e) == thrust::get<1>(e); })),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 70% sure thrust::remove_if requires 2x the input array size memory. If this becomes a memory bottleneck (e.g. the input edge list size is large compared to the device memory size, see https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/src/structure/create_graph_from_edgelist_impl.cuh#L523C23-L523C23), we may create a bitmap storing whether to remove an edge or not. And based on this bitmap, we can call remove_if on srcs, dsts, weights, edge_ids, and edge_types, separately. Then, we can keep the peak memory requirement under our goal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also eliminate the need for the ugliness you mention below. I'll refactor.

void remove_self_loops(raft::handle_t const& handle,
rmm::device_uvector<vertex_t>& edgelist_srcs /* [INOUT] */,
rmm::device_uvector<vertex_t>& edgelist_dsts /* [INOUT] */,
rmm::device_uvector<A>& edgelist_a /* [INOUT] */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need three separate functions that takes 1, 2, or 3 additional variables in addition to edgelist_srcs & edgelist_dsts.

See https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/src/structure/detail/structure_utils.cuh#L148

Then, we can set

edge_first = thrust::make_zip_iterator(edgelist_srcs.begin(), edgelist_dsts.begin(), get_dataframe_buffer_begin());

And the remaining code will be same. Same for the sort_and_remove_multi_edges function.

This will be more important if we bring additional code to limit peak memory footprint.

…memory footprint when graphs are large relative to GPU memory
// FIXME: This should probably just be cugraph_graph_free
// but didn't want to confuse with original cugraph_free_graph
void cugraph_sg_graph_free(cugraph_graph_t* graph);
cugraph_error_code_t cugraph_graph_create_sg_from_csr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cugraph_graph_create_sg supports flags to drop both self loops and multi edges through the methods cugraph::remove_self_loops and cugraph::sort_and_remove_multi_edges, shouldn't cugraph_graph_create_sg_from_csr also support those flags? In fact, cugraph_graph_create_sg_from_csr extracts the COO from the CSR before calling cugraph::create_graph_from_edgelist therefore, we can drop self loops and multi edges (if set to True) right before the graph question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add that feature. I was assuming that if you already have a CSR as input then you have already done whatever ETL steps are desired.

"Invalid input arguments: all vertex types must match",
*error);

CAPI_EXPECTS(p_weights[i]->type_ == weight_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to check if weights was not a nullptr otherwise we get a segfault. I changed it locally to this

CAPI_EXPECTS((weights == nullptr) || (p_weights[i]->type_ == weight_type),
                 CUGRAPH_INVALID_INPUT,
                 "Invalid input arguments: all weight types must match",
                 *error);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

std::optional<rmm::device_uvector<weight_t>>,
std::optional<rmm::device_uvector<edge_t>>,
std::optional<rmm::device_uvector<edge_type_t>>>
sort_and_remove_multi_edges(raft::handle_t const& handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized this but now it seems like "sort" here is a misnomer. We are not really globally sorting anymore, so I think we'd better rename this function to remove_multi_edges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And may better document that this function is not stable (which means this does not preserve the order in contrast to std::remove or thrust::remove).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

std::optional<rmm::device_uvector<edge_type_t>>&& edgelist_edge_types);

/**
* @brief Sort the edges and remove all but one edge when a multi-edge exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is no longer correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

T const* array_;

T operator() __host__ __device__(size_t index) { return array_[index]; }
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use

template <typename index_t, typename Iterator>
struct indirection_t {
  Iterator first{};

  __device__ typename thrust::iterator_traits<Iterator>::value_type operator()(index_t i) const
  {
    return *(first + i);
  }
};

instead?

https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/include/cugraph/utilities/device_functors.cuh#L70

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 516 to 531
thrust::fill(handle.get_thrust_policy(),
remove_flags.begin(),
remove_flags.end(),
cugraph::packed_bool_empty_mask());

size_t remove_count = thrust::count_if(
handle.get_thrust_policy(),
thrust::make_counting_iterator(size_t{0}),
thrust::make_counting_iterator(src.size()),
[comparison, d_remove_flags = remove_flags.data()] __device__(size_t i) {
if (comparison(i)) {
atomicOr(d_remove_flags + cugraph::packed_bool_offset(i), cugraph::packed_bool_mask(i));
return true;
}
return false;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we better call this remove_flags.size()? We can avoid atomics and no need to call thrust::fill.

Something like

thrust::tabulate(
  handle.get_thrust_policy(),
  remove_flags.begin(),
  remove_flags.end(),
  [edge_first, comparison]__device__(auto i) {
    auto word = packed_bool_empty_mask();
    ...
    return word;
  });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rework this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

template <typename T>
rmm::device_uvector<T> remove_flagged_elements(raft::handle_t const& handle,
rmm::device_uvector<T>&& vector,
rmm::device_uvector<uint32_t> const& remove_flags,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use raft::device_span for consistency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better remove sort_ from the file name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better remove sort_ from the file name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 89 to 109
// Tuning parameter to address high frequency multi-edges
size_t num_hash_buckets{16};

auto hash_counts = compute_hash_sizes(pair_first,
pair_first + edgelist_srcs.size(),
num_hash_buckets,
hash_src_dst_pair<vertex_t>{},
handle.get_stream());

auto pivot =
static_cast<int32_t>(thrust::distance(hash_counts.begin(),
thrust::lower_bound(handle.get_thrust_policy(),
hash_counts.begin(),
hash_counts.end(),
edgelist_srcs.size() / 2)));

auto second_first = detail::mem_frugal_partition(pair_first,
pair_first + edgelist_srcs.size(),
hash_src_dst_pair<vertex_t>{},
pivot,
handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/include/cugraph/utilities/shuffle_comm.cuh#L782 with num_groups = 2. Then you just need to provide a key_to_group_id functor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with setting num_groups = 2 was that we might get a pretty bad skew in the multi-edges. Consider an rmat graph. We're going to get certain edges that occur with great frequency, and other edges that occur with lesser frequency. I thought that by first grouping by some larger number and then finding the median break point we'll be closer to a 50/50 split at the pivot.

I certainly could use groupby_and_count instead of recreating it here (my compute_hash_sizes function is essentially the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean there can be millions (or billions) of edges between a unique (src, dst) pair? Yeah... this might be possible if there are two really big institutes and there are a huge number of transactions between the two institutes.

But I don't think this will happen in synthetic R-mat graphs (we are not partitioning based on just src or dst but src, dst pair, so high degree vertices won't have much impact on partitioning quality).

Yeah... and calling with num_groups = 2 vs some large number can be a compute vs solution robustness trade-offs. num_goups = 2 will be cheaper than than num_groups = 128 (2^1 vs 2^7). And in the theoretical worst case (i.e. all the edges are between the same src, dst pair), we will need 2x memory of the edge list vs 1.5x. And if there are very high multiplicity edge pairs, we may be sometimes able to combine those edges and create a weighted graph. Just a thought material. Not sure what's the right trade-off point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked in detail. My recollection of our RMAT generator is that we will get a higher multiplicity of edge pairs in certain quadrants of the graph. So there will be many more instances of (0,1), (0,2), (0,3) created than vertex pairs with high vertex id values. I agree that after we scramble, the partitioning will take care of much of this. But we're still likely to end up with vertex pairs on each GPU that have high multiplicity and vertex pairs that have low multiplicity.

I think I will leave this code in there to support this issue, but I'll default the number of groups to 2 and just leave a FIXME for us to investigate it later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 147 to 167
size_t num_hash_buckets{16};

auto hash_counts = compute_hash_sizes(pair_first,
pair_first + edgelist_srcs.size(),
num_hash_buckets,
hash_src_dst_pair<vertex_t>{},
handle.get_stream());

auto pivot =
static_cast<int32_t>(thrust::distance(hash_counts.begin(),
thrust::lower_bound(handle.get_thrust_policy(),
hash_counts.begin(),
hash_counts.end(),
edgelist_srcs.size() / 2)));

auto second_first = detail::mem_frugal_partition(pair_first,
pair_first + edgelist_srcs.size(),
get_dataframe_buffer_begin(edgelist_values),
hash_src_dst_pair<vertex_t>{},
pivot,
handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 118 to 125
thrust::unique(handle.get_thrust_policy(),
pair_first,
pair_first + edgelist_srcs.size(),
[] __device__(auto lhs, auto rhs) {
return (thrust::get<0>(lhs) == thrust::get<0>(rhs)) &&
(thrust::get<1>(lhs) == thrust::get<1>(rhs));
})),
handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked thrust::unique does not use a temporary memory buffer (size = original data size). We can test this by creating an array with random data close to 70% of the total memory size and this unique call succeeds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll run that test. I have an idea to work around that. I'll work on it over the weekend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this code to use the same technique as the remove self-loops logic: Create a bit mask of entries to delete. Then delete them one array at a time. Should use at most 1.55x memory footprint.

@naimnv
Copy link
Contributor

naimnv commented Nov 18, 2023

/ok to test

Copy link
Contributor

@jnke2016 jnke2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChuckHastings
Copy link
Collaborator Author

/merge

@rapids-bot rapids-bot bot merged commit 3f1c7b5 into rapidsai:branch-23.12 Nov 21, 2023
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
6 participants