Call tree analysis with Neo4j #2957

galderz · 2020-11-02T14:17:08Z

In order to improve debugging of analysis issues, I've been exploring producing neo4j graphs instead text based call graphs. This PR is an effort to do that and should be considered as a draft.

Functionality wise, it should be complete and equivalent to the existing call tree output. It includes:

A VM entry point node.
A Method node for each distinct direct, virtual and overriden method.
Each Method includes the method name and the complete signature.
Entry relationships between VM and entry method nodes.
Direct and virtual relationships between method nodes include bytecode index information.
Bytecode index information expresses the possibility of a method calling another one from different locations within the origin method.
Overriden relationships between virtual and override methods. These relationships have no bytecode index information.

The idea is to produce a cypher script that would load the equivalent of the existing call tree printer into a neo4j db. This adds not extra dependencies, it just produces an equivalent text file that can be loaded into neo4j via:

cat reports/call_tree_helloworld_20201102_115920.cypher | /opt/neo4j/bin/cypher-shell -u neo4j -p neo

It uses the techniques explained in this article to efficiently batch data and take advantage of script caching. In spite of these efforts, loading data is not very fast right now. The hello world example has a total of ~8k method nodes (8869 to be precise, including virtual ones) and ~27k relationships. Loading that takes ~4m on my home server. For reference, my home server takes 12s to generate a hello world native image. I'm hoping @michael-simons or other neo4j experts (e.g. @DavideD) can help shape this into performing better.

Attaching a sample neo4j graph and a rough screenshot of the area it'd represent. You can find the full cypher output for hello world here.

Implementation wise, it simply reuses the logic in CallTreePrinter to generate the graph and it uses the types referenced there.

TODO:

Decide on granularity of elements on Methods. Right now it has name and signature, but the signature could be split into class name, parameters and returns, and each of the types there could be node themselves and link together. The cypher needs to process this faster to be usable.
Improve performance.
Decide on configuration options to get the output. Right now it just piggy backs on the print analysis call option.

* Bytecode indexes are part of the relationships. * A method can interact with others multiple times, with each bytecode index indicating the origin of the call with the origin method.

* Bytecode indexes moved to the virtual relationshiops.

galderz · 2020-11-02T14:28:43Z

p.s. Just got passed a link to the article Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes. I'll look into that to see if import can be faster :)

michael-simons · 2020-11-02T14:54:02Z

Thanks for pinging me! I'll have a look at your Cypher.

michael-simons · 2020-11-02T16:03:27Z

So a couple of things: First of all, I find that use of so many labels quite irritating. I think the intention of @conker84 behind this is to make use of the implicit index on different labels. Maybe it is because the import he described in the medium post should be generic in the end. I don't know. Furthermore - and that's what is making your import dead slow - the label is by far not unique.

Have a close look at this piece:

:begin
UNWIND $rows as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

The label is always exactly UNIQUE IMPORT LABEL. You cannot create dynamic labels based on parameters in Neo4j, neither in Browser or in Shell.

So when creating the relationships after having imported all the nodes, you always do a full node scan when matching on the label and the UNIQUE IMPORT ID property as the label is not unique and for the property no index exists.

To fix this, you need to create a unique index

CREATE CONSTRAINT unique_import_id ON (n:`UNIQUE IMPORT LABEL`) ASSERT n.`UNIQUE IMPORT ID` IS UNIQUE;

However, the import will fail then as it turns out that the used id is not unique.
If it describes the same method calls, it is safe to replace the following blocks

:begin
UNWIND $rows as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

with

:begin
UNWIND $rows as row
MERGE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

This will create the nodes if they not yet exists with the given pattern, otherwise just match them and the properties will be overwritten. If this is the case - I don't know and cannot judge this here - this is the way forward.

With the changes in place, the script runs against 4.0 community edition on my machine in about 9 seconds.

If your use case is still fulfilled, I recommend not using the naming at all but do this:

Assuming this unique index. You could create a comment atop the file for example with the code in it.

CREATE CONSTRAINT unique_method_id ON (n:Method) ASSERT n._id IS UNIQUE;

Do this for the VM node

:param rows => [{_id:-1, properties:{name:'VM'}}]
:begin
UNWIND $rows as row
MERGE (n:Method{_id: row._id})
  SET n += row.properties SET n:VM;
:commit

Do this for all the method nodes

:begin
UNWIND $rows as row
MERGE (n:Method{_id: row._id})
  SET n += row.properties;
:commit

Do this for the relationships

:begin
UNWIND $rows as row
MATCH (start:Method {_id: row.start._id})
MATCH (end:Method {_id: row.end._id})
CREATE (start)-[r:OVERRIDDEN_BY]->(end);
:commit

If you need further help with the uniqueness of method nodes and their properties, just give me a sign.

galderz · 2020-11-02T17:42:14Z

Thx for hints @michael-simons. Having read the link above about the import command with CSV entries, is that a better approach compared with the current cypher script approach for importing huge amounts of nodes and relationships?

michael-simons · 2020-11-02T19:31:25Z

Hi. I don't think so. The major difference is in what I said about uniqueness of labels and indices. If you don't use any indices and do a full store scan, than both imports will be slow.

Probably the CSV one is a tad bit faster, but less convenient than just pipe it through Cypher shell.

I would run your thing on a bigger example and see how the difference evolves.

zakkak · 2020-11-02T21:29:38Z

Probably the CSV one is a tad bit faster, but less convenient than just pipe it through Cypher shell.

On the other hand CSV is a universal format so exporting the data in CSV format would give users more flexibility, so it might still be worth considering it.

michael-simons · 2020-11-03T07:24:31Z

If you like, attach a CSV file with the same rows and properties as your Cypher. I can help you with the necessary import statements.

galderz · 2020-11-03T07:49:01Z

@zakkak @michael-simons and everyone else interested, I've started a discussion on CSV vs Cypher on the graalvm slack channel, see here. That'll give us a more dynamic place to share ideas, experiment results without overloading this PR with that info.

conker84 · 2020-11-03T09:10:15Z

Hi everybody as @michael-simons said that article had the goal to show the performance improvements that we had in the apoc.export.cypher procedure, while here the code seems handwritten. IMHO Micheal covered any possible area of improvement of @galderz's script.

galderz · 2020-12-11T12:51:33Z

@michael-simons @zakkak I've experimented with CSV and created a branch with it. A sample report folder can be found here. I can see what @michael-simons meant with inconvenience with CSV. To load all that CSV, you have to copy the CSV files to the Neo4j installation's import folder, and then it can be loaded with the cypher in there. The result is more cumbersome to load into neo4j compared to having it all in cypher and then just pipe it. You don't need to start copying files around.

Performance wise, it took ~3m30s with the CSV method, whereas my original approach was taking 4m. This still feels a lot for not so much: ~8k method entries, and 27k relationships. I've tried different heap sizes (up to 16gb) and didn't make any difference. I also tried different USING PERIODIC COMMIT values (up to 4k) and that didn't change much either. The neo4j blog post here barely has details on the env (all it says it's a regular laptop with SSD), but it seems to take roughly the same time to load several orders of magnitude more data (31m nodes, 77m relationships).

zakkak · 2021-01-07T16:34:16Z

@galderz adding

CREATE CONSTRAINT unique_method_id ON (m:Method) ASSERT m.methodId IS UNIQUE;

(as suggested by @michael-simons in #2957 (comment))
in the beginning of cypher_call_tree_helloworld_20201211_112253.cypher brought the import from 4m and 30s down to 7s for me.

Before

$ cypher-shell -u neo4j -p neo4j -f cypher_call_tree_helloworld_20201211_112253.cypher
+----------+
| count(v) |
+----------+
| 1        |
+----------+

1 row available after 196 ms, consumed after another 123 ms
Added 1 nodes, Set 2 properties, Added 1 labels
+----------+
| count(m) |
+----------+
| 7509     |
+----------+

1 row available after 73 ms, consumed after another 13227 ms
Added 7509 nodes, Set 37545 properties, Added 7509 labels
+----------+
| count(m) |
+----------+
| 1354     |
+----------+

1 row available after 94 ms, consumed after another 6463 ms
Added 1354 nodes, Set 6770 properties, Added 1354 labels
+----------+
| count(*) |
+----------+
| 544      |
+----------+

1 row available after 174 ms, consumed after another 2300 ms
Created 544 relationships
+----------+
| count(*) |
+----------+
| 18554    |
+----------+

1 row available after 80 ms, consumed after another 173587 ms
Created 18554 relationships, Set 18554 properties
+----------+
| count(*) |
+----------+
| 2819     |
+----------+

1 row available after 72 ms, consumed after another 24848 ms
Created 2819 relationships
+----------+
| count(*) |
+----------+
| 5533     |
+----------+

1 row available after 66 ms, consumed after another 47594 ms
Created 5533 relationships, Set 5533 properties

After

$ cypher-shell -u neo4j -p neo4j -f cypher_call_tree_helloworld_20201211_112253.cypher
0 rows available after 508 ms, consumed after another 0 ms
Added 1 constraints
+----------+
| count(v) |
+----------+
| 1        |
+----------+

1 row available after 225 ms, consumed after another 36 ms
Added 1 nodes, Set 2 properties, Added 1 labels
+----------+
| count(m) |
+----------+
| 7509     |
+----------+

1 row available after 119 ms, consumed after another 771 ms
Added 7509 nodes, Set 37545 properties, Added 7509 labels
+----------+
| count(m) |
+----------+
| 1354     |
+----------+

1 row available after 81 ms, consumed after another 125 ms
Added 1354 nodes, Set 6770 properties, Added 1354 labels
+----------+
| count(*) |
+----------+
| 544      |
+----------+

1 row available after 173 ms, consumed after another 229 ms
Created 544 relationships
+----------+
| count(*) |
+----------+
| 18554    |
+----------+

1 row available after 76 ms, consumed after another 1240 ms
Created 18554 relationships, Set 18554 properties
+----------+
| count(*) |
+----------+
| 2819     |
+----------+

1 row available after 60 ms, consumed after another 109 ms
Created 2819 relationships
+----------+
| count(*) |
+----------+
| 5533     |
+----------+

1 row available after 80 ms, consumed after another 235 ms
Created 5533 relationships, Set 5533 properties

galderz · 2021-01-13T07:26:07Z

@zakkak Oh wow 😀 . I'll explore that option further in both approaches and see what it looks like.

Also @zakkak, any thoughts on preference of cypher or csv? I think the cypher, although less universal, it's much easier to use with neo4j compared with csv (see my previous comment). I'll tidy this up this branch, that uses cypher, in the next few days.

galderz · 2021-01-13T07:26:22Z

@zakkak Oh and thx for trying things!

zakkak · 2021-01-13T08:24:15Z

Also @zakkak, any thoughts on preference of cypher or csv? I think the cypher, although less universal, it's much easier to use with neo4j compared with csv (see my previous comment). I'll tidy this up this branch, that uses cypher, in the next few days.

Although I agree that having to copy the files to neo4j's import folder is a bit cumbersome, I personally think that CSV is the right (more generic) approach to follow since it makes it easier to:

manipulate the output (even with shell commands) before importing it to a visualization tool
import the data in different tools
extend GraalVM to support more tools for visualization/exploration

Sidenote

In my experiments I use the neo4j container image and mount the directory with the csv files using -v to avoid copying files (and having to purge the database).

To start the server I use:

podman run --rm --name=graal-neo4j \
    --user $(id -u):$(id -u) --userns=keep-id \
    -p7474:7474 -p7687:7687 \ 
    -e NEO4J_AUTH=neo4j/s3cr3t \ 
    -v /home/zakkak/Downloads/reports:/var/lib/neo4j/import:z \
    neo4j

To import the data I use:

podman exec -ti graal-neo4j \
    bin/cypher-shell \
    -u neo4j -p s3cr3t \
    -f /var/lib/neo4j/import/cypher_call_tree_helloworld_20201211_112253.cypher

To explore the graph I visit http://localhost:7474/

galderz · 2021-01-14T08:35:20Z

@zakkak Good points! Just added 2 constraints (for vm and method id) in the csv branch and the time went down to 2.5 secs 🚀🚀🚀

galderz · 2021-01-14T13:34:17Z

I've done more testing of loading times:

Quarkus getting started quickstart: 5.5 seconds
Infinispan Quakurs server: 18 seconds (~150k nodes, 450k relationships)

I'm doing further improvements to the functionality:

Separate package into package and className
Add a display-friendly package name field. For say package org.graalvm... it'd show o.g.....
Then, that can be combined with class name and method name in the neo4j css to show all that info in the node right away.

Finally, the type of cypher queries that would be useful here would be:

Find a method name and all its incoming relationships. This is to find out what brings a given method into the universe:

match (m:Method {name: "cloneKey"}) <- [r] - (o) return *

Find a methods incoming relationships 2 layers deep:

match (m:Method {name: "cloneKey"}) <- [r*1..2] - (o) return *

Find a method with with its incoming relationships, filtered by only its DIRECT and VIRTUAL incoming relationships:

match (m:Method {name: "cloneKey"}) <- [:DIRECT|:VIRTUAL] - (o) return *

galderz · 2021-01-14T13:35:16Z

And thx @michael-simons for neo4j browser and query tips!

michael-simons · 2021-01-14T13:48:39Z

TBH, I only gave some tips. The browser is done by a couple of great colleagues.

galderz · 2021-01-18T14:46:08Z

Closing this PR. A new version of this PR is available here.

galderz added 9 commits October 19, 2020 10:02

Load all method node instances into neo4j via cypher

f545a34

Add VM singleton node and entry edges to cypher

ba4ea48

Add direct method edges to the cypher

33df249

Fix direct edges and add virtual and overriden by edges

ed71ae0

Use just unique import ids to link nodes

5104d29

Make combination of virtual invocation and bci different nodes

857245e

Create VirtualMethod for virtual calls and add bci info

caec0ff

Add bytecode index support

c23e28c

* Bytecode indexes are part of the relationships. * A method can interact with others multiple times, with each bytecode index indicating the origin of the call with the origin method.

Virtual methods identified by signature

a327a87

* Bytecode indexes moved to the virtual relationshiops.

graalvmbot added the oca-signed label Nov 2, 2020

galderz mentioned this pull request Jan 14, 2021

Native server fails to build due to Elytron Credential unsupported combination infinispan/infinispan-quarkus#44

Closed

galderz mentioned this pull request Jan 18, 2021

Call tree analysis with graph databases (e.g. Neo4j) v2 #3128

Merged

galderz closed this Jan 18, 2021

galderz deleted the t_call_tree_cypher_v2 branch October 21, 2022 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call tree analysis with Neo4j #2957

Call tree analysis with Neo4j #2957

galderz commented Nov 2, 2020

galderz commented Nov 2, 2020

michael-simons commented Nov 2, 2020

michael-simons commented Nov 2, 2020

galderz commented Nov 2, 2020

michael-simons commented Nov 2, 2020

zakkak commented Nov 2, 2020

michael-simons commented Nov 3, 2020

galderz commented Nov 3, 2020

conker84 commented Nov 3, 2020

galderz commented Dec 11, 2020 •

edited

Loading

zakkak commented Jan 7, 2021 •

edited

Loading

galderz commented Jan 13, 2021

galderz commented Jan 13, 2021

zakkak commented Jan 13, 2021 •

edited

Loading

galderz commented Jan 14, 2021

galderz commented Jan 14, 2021

galderz commented Jan 14, 2021

michael-simons commented Jan 14, 2021

galderz commented Jan 18, 2021

Call tree analysis with Neo4j #2957

Call tree analysis with Neo4j #2957

Conversation

galderz commented Nov 2, 2020

galderz commented Nov 2, 2020

michael-simons commented Nov 2, 2020

michael-simons commented Nov 2, 2020

galderz commented Nov 2, 2020

michael-simons commented Nov 2, 2020

zakkak commented Nov 2, 2020

michael-simons commented Nov 3, 2020

galderz commented Nov 3, 2020

conker84 commented Nov 3, 2020

galderz commented Dec 11, 2020 • edited Loading

zakkak commented Jan 7, 2021 • edited Loading

Before

After

galderz commented Jan 13, 2021

galderz commented Jan 13, 2021

zakkak commented Jan 13, 2021 • edited Loading

Sidenote

galderz commented Jan 14, 2021

galderz commented Jan 14, 2021

galderz commented Jan 14, 2021

michael-simons commented Jan 14, 2021

galderz commented Jan 18, 2021

galderz commented Dec 11, 2020 •

edited

Loading

zakkak commented Jan 7, 2021 •

edited

Loading

zakkak commented Jan 13, 2021 •

edited

Loading