Kùzu is an embedded graph database built for query speed and scalability. It implements the property graph data model and is optimized for handling complex join-heavy analytical workloads on very large graphs. At the end of this workshop, you'll take away some insights into how to use Kùzu and its query language, Cypher, for graph analysis and also see how easy it is to use in combination with the rest of your stack!
This workshop will be divided into two parts.
We will be working on a simple dataset of a financial network. The aim is to write Cypher queries to do a graph analysis. To contrast it with a SQL-based analysis using a relational database, we will also answer the same questions that are asked via Cypher with SQL queries in DuckDB, an embedded relational database. Along the way, it will become clear that some kinds of questions are actually really hard (or borderline impossible) to answer in SQL, but are natural and easy in Cypher.
This section builds on the previous one. We will showcase the interoperability of Kùzu with the
Python data science ecosystem. We will use the existing financial network to run a graph algorithm,
betweenness_centrality
, using NetworkX, a popular Python library for graph analysis.
We will largely be interacting with Kùzu using its web-based UI, Kùzu Explorer. Here are the steps to run Kùzu Explorer.
- You need to first ensure that Docker is installed on your machine:
- Once you have Docker installed, please start Docker. This is usually done by launching Docker Desktop application.
- Next, you can download the specified image of Kùzu Explorer from DockerHub provided using the provided
docker-compose.yml
file. To do that, you can run the following commands in the directory where thedocker-compose.yml
is:
docker compose up
The above command might take a while but at the end of it will run Kùzu Explorer on your http://localhost:8000/.
Alternatively, you can type in the following command in your terminal:
docker run -p 8000:8000 \
-v ./ex_kuzu_db:/database \
-v ./data:/data \
-e MODE=READ_WRITE \
--rm kuzudb/explorer:0.7.0
This subsection is totally optional -- you can install the CLI tools for Kùzu and DuckDB on your machine to also run the Cypher/SQL queries via their CLI interfaces.
If you want to run the provided Python notebook to run graph algorithms, you will need to install Python and a local virtual environment.
We will be using the following tools:
We will define the problem statement for this workshop as follows:
Imagine you are an investigator at an organization that tracks financial crimes. Two email addresses have been flagged by law enforcement agencies, and both are now under suspicion for their potential involvement in fraudulent activities.
Your task is to analyze some data that consists of money transfers between individuals to assist in the ongoing investigation. The findings from this analysis will be crucial for an upcoming court case. To achieve this, you will delve into the dataset to uncover hidden patterns, connections, and insights into the transactions between individuals.
A financial network dataset of persons, accounts, addresses and transfers between accounts is provided. Its schema can be represented as shown below.
A summary of the dataset is provided below:
- 21 nodes of type
Person
- 21 nodes of type
Account
(eachPerson
has exactly one account) - 15 nodes of type
Address
- 22 relationships of type
Transfer
, where the transfers are directed from a source accounts
that has transferred money to a destination accountd
.
The graph schema shown above is based on the following input files.
person.csv
account.csv
transfer.csv
From a relational database perspective, we just have three tables. The person table has an address column and the account table is connected to the person table via the person ID. The transfer table contains the source and destination account IDs, and the amount transferred.
From the three tables we begin with, we are able to separate out the required columns for our node and relationship tables in Kùzu, to give us the following six tables for our graph analysis:
- Node table:
Person
- Node table:
Address
- Node table:
Account
- Relationship table:
Owns
(betweenPerson
andAccount
) - Relationship table:
LivesIn
(betweenPerson
andAddress
) - Relationship table:
Transfer
(betweenAccount
andAccount
)
The DDL commands are provided in the ddl
directory. Copy-paste the ./ddl/*.cypher
file
into their respective interfaces to populate the data in the required Kùzu database.
If you want to also run the SQL queries in DuckDB, you can do so by running the ./ddl/insert_data_duckdb.sql
file in the DuckDB CLI.
The resulting graph from this dataset has interesting structures, and is small enough to visualize all at once in Kùzu explorer. You can get the below visualization in Kùzu Explorer with the following query:
MATCH (a)-[b]->(c)
RETURN *
LIMIT 200
Write the above query in the shell panel of Kùzu Explorer and click the green play button to execute it. Kùzu Explorer will then display the results as a graph visualization.
The goal of this workshop is to write Cypher queries to answer the questions provided below. Along the way, we will visualize all query results in Kùzu Explorer and gain a deeper understanding of the data.
Query | Description |
---|---|
1a | Find the name of the Person node p1 whose email property is [email protected] Hint: Specify an explicit pattern in your MATCH clause that respects the schema, use a WHERE predicate to filter the target person by their email , and then RETURN the name of the person. The predicates can be put both in WHERE and inside the node pattern. Once done, you'll find that the name of the person p1 is "George". |
1b | Find all Account nodes, a1 owned by George. Hint: Extend the previous query with a longer MATCH pattern that connects/joins the Person node to its Account nodes that it Owns . |
1c | Find the names of all Persons who have made a direct Transfer to the Account owned by George. Hint: Extend the previous query to find all the Account nodes a2 that made a Transfer to George's accounts and the Person node that Owns a2 . |
2 | Find all possible direct or indirect Transfer flows/paths from an account owned by George to to an account owned by [email protected] (This person's name is "Edward"). For indirect transfers, only find those up to length 5. Hint: Specify variable-length or recursive relationships in Cypher using the Kleene star operator * followed by the min and max length for the paths. |
3a | Find a shortest Transfer path from George's account to Edward's account. Hint: Kùzu's Cypher dialect has a native clause to match a single shortest path. |
3b | Find all shortest Transfer paths from George's account to Edward's account. Hint: Kùzu's Cypher dialect has a native clause to match all shortest paths. |
4 | Find all direct and indirect connections of any type between the Person nodes representing George and Edward up to length 5. We are searching for any possible paths, i.e., the labels of the edges do not have to be only Owns or Transfer ; they can also be LivesIn as well. That is, the path between the two people can consist of any sequence of any labels. Further, we also do not need the connections to be in a particular direction. Hint: Use Cypher's flexible relationship matching using multiple labels or any labels. Also use the undirected relationship patterns for the -[:Transfer]- to avoid getting an empty result. |
5 | a) Find the account that has the highest number of incoming transactions. Hint: Use group by and aggregate to count of incoming edges. For reference, all possible aggregate functions are here. b) Find an important account that has received the most dollars. Hint: Do a group by and aggregate to sum of the amounts on the incoming edges. |
6 | Find the accounts that are the "most central" using a graph algorithm. We will use the notion of highest “betweenness centrality” (BC). Note: This part will be done in Python via the NetworkX library. |
Note
Betweenness centrality is a measure of the number of shortest paths that pass through a node. It is calculated as the number of shortest paths that pass through a node divided by the total number of shortest paths between all pairs of nodes.
The SQL queries that are possible to write for each corresponding Cypher query are provided in the sql
directory.
it is not required to run these queries for the workshop -- they are provided for reference.