Skip to content

Commit

Permalink
updated text
Browse files Browse the repository at this point in the history
  • Loading branch information
BradReesWork committed Jun 29, 2022
1 parent 399c10c commit b55921a
Showing 1 changed file with 59 additions and 9 deletions.
68 changes: 59 additions & 9 deletions notebooks/applications/CostMatrix.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,70 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to compute a cost matrix by replicating data\n",
"# How to compute a _Cost Matrix_ by replicating data\n",
"\n",
"cuGraph currently does not have a All-Source Shortest Path (ASSP) algorithm. One is on the roadmap, but that doesn't help us today. If the graph to be processed is small, then it is possible to run ASSP by creating a lot of copies of the graph and running the Single Source Shortest Path (SSSP) on one seed per graph copy. The ASSP with edge weights summed along each path results in the cost to move between any two nodes in the graph. That is the cost matrix.\n",
"### Approach\n",
"A simple approach to creating a cost matrix is to run All-Source Shortest Path (ASSP), however cuGraph currently does not have an All-Source Shortest Path (ASSP) algorithm. One is on the roadmap, based on Floyd-Warshall, but that doesn't help us today. Luckily there is a work around if the graph to be processed is small. The hack is to run ASSP by creating a lot of copies of the graph and running the Single Source Shortest Path (SSSP) on one seed per graph copy. Since each SSSP run within its own disjoint component, there is no issue with path collisions between seeds. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Organization\n",
"The first portion of the notebook discusses each step independently. It gives insight into what is going on and how fast each step takes.\n",
"\n",
"The second section puts it all the steps together in a single function and times how long with would take to compute the matrix\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data\n",
"\n",
"In this notebook we will use the email-Eu-core\n",
"\n",
"* Number of Vertices: 1,005\n",
"* Number of Edges: 25,571\n",
"\n",
"We are using this dataset since it is small with a few community, meaning that there are paths to be found."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Revisions\n",
"\n",
"| Author Credit | Date | Update | cuGraph Version | Test Hardware |\n",
"| --------------|------------|------------------|-----------------|----------------|\n",
"| Brad Rees | 06/21/2022 | created | 22.08 | V100 w 32 GB, CUDA 11.5\n",
"| Don Acosta | 06/28/2022 | modified | 22.08 | V100 w 32 GB, CUDA 11.5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References\n",
"\n",
"* https://www.sciencedirect.com/topics/mathematics/cost-matrix\n",
"* https://en.wikipedia.org/wiki/Shortest_path_problem\n",
"\n",
"Dataset\n",
"* Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. Local Higher-order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.\n",
"\n",
"* J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1), 2007. http://www.cs.cmu.edu/~jure/pubs/powergrowth-tkdd.pdf\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# system and other\n",
"import gc\n",
"import os\n",
"import time\n",
"from time import perf_counter\n",
"import math\n",
Expand All @@ -38,7 +82,7 @@
"metadata": {},
"source": [
"-----\n",
"# The first section discuss each step independently. The second section puts it all together\n",
"# Reading the data\n",
"\n",
"Let's start with data read"
]
Expand Down Expand Up @@ -76,7 +120,7 @@
"metadata": {},
"source": [
"### Read the data and verify that it is zero based (e.g. first vertex is 0)\n",
"**IMPORTANT:** The node numbering must be zero based or else the single-source seed (offset) in each copy of the graph will not be correct and there will not be all-source coverage in the cost matrix."
"**IMPORTANT:** The node numbering must be zero based. We use the starting index on the replicated graph to be one larger than the number of vertices. If the starting index is not zero, then the graph copies will overlap in index space and not be independent (disjoint). "
]
},
{
Expand Down Expand Up @@ -493,7 +537,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now do it all in a single function"
"----\n",
"# Section 2: Do it all in a single function"
]
},
{
Expand All @@ -502,6 +547,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Set the number of replications - 10 will produce 1,024 graphs\n",
"N = 10"
]
},
Expand All @@ -514,15 +560,19 @@
"def build_cost_matrix(_gdf):\n",
" data = make_data(_gdf, N)\n",
" gdf_with_ghost, ghost_id = add_ghost_node(data, N)\n",
" \n",
" G = cugraph.Graph(directed=True)\n",
" G.from_cudf_edgelist(gdf_with_ghost, source='src', destination='dst', renumber=False)\n",
" \n",
" X = cugraph.sssp(G, ghost_id)\n",
" \n",
" X = X[X['predecessor'] != ghost_id]\n",
" X = cugraph.filter_unreachable(X)\n",
" X['distance'] -= 1\n",
" X['seed'] = (X['vertex'] / offset).astype(int)\n",
" X['v2'] = X['vertex'] - (X['seed'] * offset)\n",
" cost = X.drop(columns=['vertex', 'predecessor'])\n",
" \n",
" return cost"
]
},
Expand Down Expand Up @@ -563,9 +613,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 ('cugraph_dev')",
"display_name": "cugraph_dev",
"language": "python",
"name": "python3"
"name": "cugraph_dev"
},
"language_info": {
"codemirror_mode": {
Expand Down

0 comments on commit b55921a

Please sign in to comment.