Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ktruss implementation #4059

Merged
merged 169 commits into from
Mar 26, 2024
Merged

Conversation

jnke2016
Copy link
Contributor

@jnke2016 jnke2016 commented Dec 13, 2023

Implements SG and MG ktruss using graph primitives and drop cuHornet.

Closes #3447
Closes #3448
Closes #3449
Closes #3450
Closes #3451
Closes #3452
Closes #3453

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review part 1/2.

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
@@ -1978,6 +1978,26 @@ void triangle_count(raft::handle_t const& handle,
raft::device_span<edge_t> counts,
bool do_expensive_check = false);

/*
* @brief Compute ktruss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ktruss -> K-truss

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
cpp/src/community/ktruss_impl.cuh Outdated Show resolved Hide resolved
Comment on lines 476 to 482
rmm::device_uvector<size_t> intersection_offsets(size_t{0}, handle.get_stream());
rmm::device_uvector<vertex_t> intersection_indices(size_t{0}, handle.get_stream());
rmm::device_uvector<edge_t> r_nbr_intersection_property_values0(size_t{0}, handle.get_stream());
rmm::device_uvector<edge_t> r_nbr_intersection_property_values1(size_t{0}, handle.get_stream());

// FIXME: Initially each edge should have an edge property of 0
std::tie(intersection_offsets, intersection_indices) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto [intersection_offsets, intersection_indices] = detail::nbr_intersection(...);

And you can cut your code size a bit. And better define r_nbr_intersection_property_values0 or 1 right before using them (minimize the scope of a variable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly define them when we are taking returned variables with std::tie and use std::ignore as we can't use std::ignore with auto [...]. We are not using std::ignore here, so no need to explicitly define variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only using intersection_offsets and intersection_indices so I will remove r_nbr_intersection_property_values0 and r_nbr_intersection_property_values1

Comment on lines 495 to 496
auto vertex_pair_buffer = allocate_dataframe_buffer<thrust::tuple<vertex_t, vertex_t>>(
num_vertex_pairs, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not pre-allocate memory too much in advance. This unnecessarily increase peak memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to populate this vertex_pair_buffer with thrust::tabulate am I not required to know its size upfront?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto vertex_pair_buffer_tmp = allocate_dataframe_buffer<thrust::tuple<vertex_t, vertex_t>>(
          intersection_indices.size() * 3, handle.get_stream());

The above pre-allocated memory for (p, q), (p, r) and (q, r) but I can do one group at a time and resize to reduce peak memory usage.

Comment on lines 553 to 565
rmm::device_uvector<vertex_t> num_triangles_(3 * intersection_indices.size(), handle.get_stream());

thrust::fill(handle.get_thrust_policy(), num_triangles_.begin(), num_triangles_.end(), size_t{1});

rmm::device_uvector<vertex_t> num_triangles(num_vertex_pairs, handle.get_stream());

thrust::reduce_by_key(handle.get_thrust_policy(),
get_dataframe_buffer_begin(vertex_pair_buffer_),
get_dataframe_buffer_end(vertex_pair_buffer_),
num_triangles_.begin(),
get_dataframe_buffer_begin(vertex_pair_buffer),
num_triangles.begin(),
thrust::equal_to<thrust::tuple<vertex_t, vertex_t>>{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the code from https://github.com/rapidsai/cugraph/blob/branch-24.02/cpp/src/sampling/sampling_post_processing_impl.cuh

    auto num_unique_labels = thrust::count_if(
      handle.get_thrust_policy(),
      thrust::make_counting_iterator(size_t{0}),
      thrust::make_counting_iterator((*renumber_map_label_indices).size()),
      detail::is_first_in_run_t<label_index_t const*>{(*renumber_map_label_indices).data()});
    rmm::device_uvector<label_index_t> unique_label_indices(num_unique_labels, handle.get_stream());
    rmm::device_uvector<vertex_t> vertex_counts(num_unique_labels, handle.get_stream());
    thrust::reduce_by_key(handle.get_thrust_policy(),
                          (*renumber_map_label_indices).begin(),
                          (*renumber_map_label_indices).end(),
                          thrust::make_constant_iterator(size_t{1}),
                          unique_label_indices.begin(),
                          vertex_counts.begin());

You can use the same logic to perform this with less memory.

printf("\nbefore sorting\n");
raft::print_device_vector("src", std::get<0>(vertex_pair_buffer_).data(), std::get<0>(vertex_pair_buffer_).size(), std::cout);
raft::print_device_vector("dst", std::get<1>(vertex_pair_buffer_).data(), std::get<1>(vertex_pair_buffer_).size(), std::cout);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to shuffle the resulting vertex pairs before reducing to be multi-GPU ready.

num_triangles.begin(),
thrust::equal_to<thrust::tuple<vertex_t, vertex_t>>{});


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can run thrust::partition here (Step 7 from #3446 (comment)) to find the edges to unroll.

@seunghwak
Copy link
Contributor

@jnke2016

I thought more about implementing #3446 (comment); especially the steps 6, 7, and 8.

After step 6, sort the (edge source, edge destination, triangle count) triplets using edge destination as the primary key and edge source as the secondary key; this is to implement step 8-2.

  1. Run thrust::stable_partition to place the edges with triangle counts smaller than K-2 at the end. These edges should be "unrolled" and removed (unrolled means undoing their contributions to triangle counts, see the Pierce & Sanders for additional details)<= run thrust::stable_partition instead of thrust::partition, so each partition is still sorted.
  2. While (# edges to be remove > 0)
    8-1. Find (p, q, intersection of p&q's neighbor lists) for each edge to be removed. For each vertex r in the intersection, enumerate (p,r,-1) and (q,r,-1)<=call detail::nbr_intersection again but only for the edges to be removed. This covers the case the edge to be removed is a p->q edge. Note that we should eventually run step 6 in chunks as the aggregate size of the intersection indices can far exceed the number of edges. So, we cannot store all the intersection indices and we need to re-compute. This is a memory footprint vs compute trade-off.
    8-2. For each edge to be removed (say (q,r)), find all the incoming edges of r, then query the existence of (p,q) assuming p is an incoming neighbor of q. Accumulate (p,q,-1) and (p,r,-1) if (p,q) exists. <= This covers the case the edge to be removed is a q->r edge. We can run thrust::lower_bound & thrust::upper_bound on (s, d, triangle count) triplets to find all the incoming edges of q. Note that we need to run searches on both partitions of the (edge source, edge destination, triangle count) triplet array. In multi-GPU, we need to inspect all the GPUs that can possibly have r as an edge destination. This may sound a bit complicated unless you fully understand the partitioning and you may just add if constexpr (multi_gpu) { CUGRAPH_FAIL("unimplemented."); } for the initial SG implementation. We can come back to this once we validated the SG implementation.
    8-3. Similarly for each edge to be removed (say (p,r)), find all the incoming edges of r, then query the existence of (p,q) assuming q is an incoming neighbor of r. Accumulate (p,q,-1) and (q,r,-1). <= This covers the case the edge to be removed is a p->r edge. I assume there will be significant overlap in code for 8-2 & 8-3.
    8-4. Shuffle similar to 6-3 in multi-GPU.
    8-5. Update triangle counts based on triplets (similar to 6-4 & 6-5).
    8-6. Mask out the edges to be deleted. Also, resize the (edge source, edge destination, triangle count) triplet array so the deleted edges will no longer be considered.
    8-7. Identify the new set of edges to be deleted by running thrust::stable_partition again on the edge list based on newly update triangle counts.

Please read this and let me know when you are ready to sync again.

Copy link
Collaborator

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for @seunghwak to approve before merging, but LGTM.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is now nearly in mergable state. Let's quickly address few minor issues and let this PR pass the CI.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <community/edge_triangle_count_impl.cuh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to address this.

cpp/src/community/k_truss_impl.cuh Outdated Show resolved Hide resolved
cpp/src/community/k_truss_impl.cuh Outdated Show resolved Hide resolved
return dist_valid;
});
thrust::exclusive_scan(
handle.get_thrust_policy(), prefix_sum.begin(), prefix_sum.end(), prefix_sum.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can provide thrust::make_transform_iterator(thrust::make_counting_iterator(size_t{0}), [query_vertices, num_edges = ..., sorted_veritces = ...]__device__(auto idx) { ... }); to thrust::exclusive_sum.

In the current code, you need to write dist values to prefix_sum; read dist values from prefix_sum; and compute prefix sum and write the results back to prefix_sum. (2 writes and 1 read). If you use transform_iterator, 1 write will be sufficient. This is a big cut in memory bandwidth and also reduces kernel launch overhead (2 kernel launches => 1 kernel launch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it and thanks for the background information

}

// FIXME: Remove because it yields to incorrect results
// 3. Find (k+1)-core and exclude edges that do not belong to (k+1)-core
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(k+1)=>(k-1) and re-enable this. This shouldn't yield to incorrect results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and it still yield incorrect results. I tried even k-2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a fixme for this


num_invalid_edges = static_cast<size_t>(
thrust::distance(invalid_transposed_edge_triangle_count_first,
transposed_edge_triangle_count_pair_first + edgelist_srcs.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge_t num_invalid_edges{0};
      num_invalid_edges = ...

=>

auto num_invalid_edges = ...;

#include <rmm/mr/device/cuda_memory_resource.hpp>

#include <gtest/gtest.h>
#include <utilities/base_fixture.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to address this.

cpp/tests/community/k_truss_test.cpp Outdated Show resolved Hide resolved
cpp/tests/community/k_truss_test.cpp Outdated Show resolved Hide resolved
Comment on lines 218 to 246
std::optional<cugraph::graph_t<vertex_t, edge_t, false, false>> modified_graph{std::nullopt};

std::optional<
cugraph::edge_property_t<cugraph::graph_view_t<vertex_t, edge_t, false, false>, weight_t>>
modified_edge_weight{std::nullopt};
std::tie(*modified_graph, modified_edge_weight, std::ignore, std::ignore, std::ignore) =
cugraph::
create_graph_from_edgelist<vertex_t, edge_t, weight_t, edge_t, int32_t, false, false>(
handle,
std::nullopt,
std::move(d_cugraph_src),
std::move(d_cugraph_dst),
std::move(d_cugraph_wgt),
std::nullopt,
std::nullopt,
cugraph::graph_properties_t{true, false},
renumber);

// Convert cugraph results to CSR
auto [h_cugraph_offsets, h_cugraph_indices, h_cugraph_values] =
cugraph::test::graph_to_host_csr(
handle,
(*modified_graph).view(),
modified_edge_weight ? std::make_optional((*modified_edge_weight).view()) : std::nullopt,
std::optional<raft::device_span<vertex_t const>>(std::nullopt));

// Remove isolated vertices.
h_cugraph_offsets.erase(std::unique(h_cugraph_offsets.begin() + 1, h_cugraph_offsets.end()),
h_cugraph_offsets.end()); // CSR start from 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really do this? Can we just compare cugraph COO output and reference COO output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya but that means I will convert the reference ktruss to COO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will have to rely on csr comparison because my cugraph results are not sorted. I get the same concern I had last week regarding sorting edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I extended the sorting utility functions in thrust_wrapper like you suggested.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and Thanks for all the hard work!!!

Comment on lines +75 to +77
k_truss_reference(std::vector<vertex_t> h_offsets,
std::vector<vertex_t> h_indices,
std::optional<std::vector<weight_t>> h_values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to address this in this PR, but just fyi,

You're passing h_offsets, h_indices, and h_values by value and based on the size of the vectors, this can be expensive. You can pass a const reference (const &) instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I will resolve it in my next PR

@rlratzel
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit f753e51 into rapidsai:branch-24.04 Mar 26, 2024
147 of 153 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
5 participants