Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call tree analysis with Neo4j #2957

Closed
wants to merge 9 commits into from

Conversation

galderz
Copy link
Contributor

@galderz galderz commented Nov 2, 2020

In order to improve debugging of analysis issues, I've been exploring producing neo4j graphs instead text based call graphs. This PR is an effort to do that and should be considered as a draft.

Functionality wise, it should be complete and equivalent to the existing call tree output. It includes:

  • A VM entry point node.
  • A Method node for each distinct direct, virtual and overriden method.
  • Each Method includes the method name and the complete signature.
  • Entry relationships between VM and entry method nodes.
  • Direct and virtual relationships between method nodes include bytecode index information.
  • Bytecode index information expresses the possibility of a method calling another one from different locations within the origin method.
  • Overriden relationships between virtual and override methods. These relationships have no bytecode index information.

The idea is to produce a cypher script that would load the equivalent of the existing call tree printer into a neo4j db. This adds not extra dependencies, it just produces an equivalent text file that can be loaded into neo4j via:

cat reports/call_tree_helloworld_20201102_115920.cypher | /opt/neo4j/bin/cypher-shell -u neo4j -p neo

It uses the techniques explained in this article to efficiently batch data and take advantage of script caching. In spite of these efforts, loading data is not very fast right now. The hello world example has a total of ~8k method nodes (8869 to be precise, including virtual ones) and ~27k relationships. Loading that takes ~4m on my home server. For reference, my home server takes 12s to generate a hello world native image. I'm hoping @michael-simons or other neo4j experts (e.g. @DavideD) can help shape this into performing better.

Attaching a sample neo4j graph and a rough screenshot of the area it'd represent. You can find the full cypher output for hello world here.

Implementation wise, it simply reuses the logic in CallTreePrinter to generate the graph and it uses the types referenced there.

TODO:

  • Decide on granularity of elements on Methods. Right now it has name and signature, but the signature could be split into class name, parameters and returns, and each of the types there could be node themselves and link together. The cypher needs to process this faster to be usable.
  • Improve performance.
  • Decide on configuration options to get the output. Right now it just piggy backs on the print analysis call option.

Screen Shot 2020-11-02 at 12 06 28
Screen Shot 2020-11-02 at 12 06 10

@galderz
Copy link
Contributor Author

galderz commented Nov 2, 2020

p.s. Just got passed a link to the article Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes. I'll look into that to see if import can be faster :)

@michael-simons
Copy link

Thanks for pinging me! I'll have a look at your Cypher.

@michael-simons
Copy link

So a couple of things: First of all, I find that use of so many labels quite irritating. I think the intention of @conker84 behind this is to make use of the implicit index on different labels. Maybe it is because the import he described in the medium post should be generic in the end. I don't know. Furthermore - and that's what is making your import dead slow - the label is by far not unique.

Have a close look at this piece:

:begin
UNWIND $rows as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

The label is always exactly UNIQUE IMPORT LABEL. You cannot create dynamic labels based on parameters in Neo4j, neither in Browser or in Shell.

So when creating the relationships after having imported all the nodes, you always do a full node scan when matching on the label and the UNIQUE IMPORT ID property as the label is not unique and for the property no index exists.

To fix this, you need to create a unique index

CREATE CONSTRAINT unique_import_id ON (n:`UNIQUE IMPORT LABEL`) ASSERT n.`UNIQUE IMPORT ID` IS UNIQUE;

However, the import will fail then as it turns out that the used id is not unique.
If it describes the same method calls, it is safe to replace the following blocks

:begin
UNWIND $rows as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

with

:begin
UNWIND $rows as row
MERGE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
  SET n += row.properties SET n:Method;
:commit

This will create the nodes if they not yet exists with the given pattern, otherwise just match them and the properties will be overwritten. If this is the case - I don't know and cannot judge this here - this is the way forward.

With the changes in place, the script runs against 4.0 community edition on my machine in about 9 seconds.

If your use case is still fulfilled, I recommend not using the naming at all but do this:

  1. Assuming this unique index. You could create a comment atop the file for example with the code in it.
CREATE CONSTRAINT unique_method_id ON (n:Method) ASSERT n._id IS UNIQUE;
  1. Do this for the VM node
:param rows => [{_id:-1, properties:{name:'VM'}}]
:begin
UNWIND $rows as row
MERGE (n:Method{_id: row._id})
  SET n += row.properties SET n:VM;
:commit
  1. Do this for all the method nodes
:begin
UNWIND $rows as row
MERGE (n:Method{_id: row._id})
  SET n += row.properties;
:commit
  1. Do this for the relationships
:begin
UNWIND $rows as row
MATCH (start:Method {_id: row.start._id})
MATCH (end:Method {_id: row.end._id})
CREATE (start)-[r:OVERRIDDEN_BY]->(end);
:commit

If you need further help with the uniqueness of method nodes and their properties, just give me a sign.

@galderz
Copy link
Contributor Author

galderz commented Nov 2, 2020

Thx for hints @michael-simons. Having read the link above about the import command with CSV entries, is that a better approach compared with the current cypher script approach for importing huge amounts of nodes and relationships?

@michael-simons
Copy link

Hi. I don't think so. The major difference is in what I said about uniqueness of labels and indices. If you don't use any indices and do a full store scan, than both imports will be slow.

Probably the CSV one is a tad bit faster, but less convenient than just pipe it through Cypher shell.

I would run your thing on a bigger example and see how the difference evolves.

@zakkak
Copy link
Collaborator

zakkak commented Nov 2, 2020

Probably the CSV one is a tad bit faster, but less convenient than just pipe it through Cypher shell.

On the other hand CSV is a universal format so exporting the data in CSV format would give users more flexibility, so it might still be worth considering it.

@michael-simons
Copy link

If you like, attach a CSV file with the same rows and properties as your Cypher. I can help you with the necessary import statements.

@galderz
Copy link
Contributor Author

galderz commented Nov 3, 2020

@zakkak @michael-simons and everyone else interested, I've started a discussion on CSV vs Cypher on the graalvm slack channel, see here. That'll give us a more dynamic place to share ideas, experiment results without overloading this PR with that info.

@conker84
Copy link

conker84 commented Nov 3, 2020

Hi everybody as @michael-simons said that article had the goal to show the performance improvements that we had in the apoc.export.cypher procedure, while here the code seems handwritten. IMHO Micheal covered any possible area of improvement of @galderz's script.

@galderz
Copy link
Contributor Author

galderz commented Dec 11, 2020

@michael-simons @zakkak I've experimented with CSV and created a branch with it. A sample report folder can be found here. I can see what @michael-simons meant with inconvenience with CSV. To load all that CSV, you have to copy the CSV files to the Neo4j installation's import folder, and then it can be loaded with the cypher in there. The result is more cumbersome to load into neo4j compared to having it all in cypher and then just pipe it. You don't need to start copying files around.

Performance wise, it took ~3m30s with the CSV method, whereas my original approach was taking 4m. This still feels a lot for not so much: ~8k method entries, and 27k relationships. I've tried different heap sizes (up to 16gb) and didn't make any difference. I also tried different USING PERIODIC COMMIT values (up to 4k) and that didn't change much either. The neo4j blog post here barely has details on the env (all it says it's a regular laptop with SSD), but it seems to take roughly the same time to load several orders of magnitude more data (31m nodes, 77m relationships).

@zakkak
Copy link
Collaborator

zakkak commented Jan 7, 2021

@galderz adding

CREATE CONSTRAINT unique_method_id ON (m:Method) ASSERT m.methodId IS UNIQUE;

(as suggested by @michael-simons in #2957 (comment))
in the beginning of cypher_call_tree_helloworld_20201211_112253.cypher brought the import from 4m and 30s down to 7s for me.

Before

$ cypher-shell -u neo4j -p neo4j -f cypher_call_tree_helloworld_20201211_112253.cypher
+----------+
| count(v) |
+----------+
| 1        |
+----------+

1 row available after 196 ms, consumed after another 123 ms
Added 1 nodes, Set 2 properties, Added 1 labels
+----------+
| count(m) |
+----------+
| 7509     |
+----------+

1 row available after 73 ms, consumed after another 13227 ms
Added 7509 nodes, Set 37545 properties, Added 7509 labels
+----------+
| count(m) |
+----------+
| 1354     |
+----------+

1 row available after 94 ms, consumed after another 6463 ms
Added 1354 nodes, Set 6770 properties, Added 1354 labels
+----------+
| count(*) |
+----------+
| 544      |
+----------+

1 row available after 174 ms, consumed after another 2300 ms
Created 544 relationships
+----------+
| count(*) |
+----------+
| 18554    |
+----------+

1 row available after 80 ms, consumed after another 173587 ms
Created 18554 relationships, Set 18554 properties
+----------+
| count(*) |
+----------+
| 2819     |
+----------+

1 row available after 72 ms, consumed after another 24848 ms
Created 2819 relationships
+----------+
| count(*) |
+----------+
| 5533     |
+----------+

1 row available after 66 ms, consumed after another 47594 ms
Created 5533 relationships, Set 5533 properties

After

$ cypher-shell -u neo4j -p neo4j -f cypher_call_tree_helloworld_20201211_112253.cypher
0 rows available after 508 ms, consumed after another 0 ms
Added 1 constraints
+----------+
| count(v) |
+----------+
| 1        |
+----------+

1 row available after 225 ms, consumed after another 36 ms
Added 1 nodes, Set 2 properties, Added 1 labels
+----------+
| count(m) |
+----------+
| 7509     |
+----------+

1 row available after 119 ms, consumed after another 771 ms
Added 7509 nodes, Set 37545 properties, Added 7509 labels
+----------+
| count(m) |
+----------+
| 1354     |
+----------+

1 row available after 81 ms, consumed after another 125 ms
Added 1354 nodes, Set 6770 properties, Added 1354 labels
+----------+
| count(*) |
+----------+
| 544      |
+----------+

1 row available after 173 ms, consumed after another 229 ms
Created 544 relationships
+----------+
| count(*) |
+----------+
| 18554    |
+----------+

1 row available after 76 ms, consumed after another 1240 ms
Created 18554 relationships, Set 18554 properties
+----------+
| count(*) |
+----------+
| 2819     |
+----------+

1 row available after 60 ms, consumed after another 109 ms
Created 2819 relationships
+----------+
| count(*) |
+----------+
| 5533     |
+----------+

1 row available after 80 ms, consumed after another 235 ms
Created 5533 relationships, Set 5533 properties

@galderz
Copy link
Contributor Author

galderz commented Jan 13, 2021

@zakkak Oh wow 😀 . I'll explore that option further in both approaches and see what it looks like.

Also @zakkak, any thoughts on preference of cypher or csv? I think the cypher, although less universal, it's much easier to use with neo4j compared with csv (see my previous comment). I'll tidy this up this branch, that uses cypher, in the next few days.

@galderz
Copy link
Contributor Author

galderz commented Jan 13, 2021

@zakkak Oh and thx for trying things!

@zakkak
Copy link
Collaborator

zakkak commented Jan 13, 2021

Also @zakkak, any thoughts on preference of cypher or csv? I think the cypher, although less universal, it's much easier to use with neo4j compared with csv (see my previous comment). I'll tidy this up this branch, that uses cypher, in the next few days.

Although I agree that having to copy the files to neo4j's import folder is a bit cumbersome, I personally think that CSV is the right (more generic) approach to follow since it makes it easier to:

  1. manipulate the output (even with shell commands) before importing it to a visualization tool
  2. import the data in different tools
  3. extend GraalVM to support more tools for visualization/exploration

Sidenote

In my experiments I use the neo4j container image and mount the directory with the csv files using -v to avoid copying files (and having to purge the database).

  1. To start the server I use:
podman run --rm --name=graal-neo4j \
    --user $(id -u):$(id -u) --userns=keep-id \
    -p7474:7474 -p7687:7687 \ 
    -e NEO4J_AUTH=neo4j/s3cr3t \ 
    -v /home/zakkak/Downloads/reports:/var/lib/neo4j/import:z \
    neo4j
  1. To import the data I use:
podman exec -ti graal-neo4j \
    bin/cypher-shell \
    -u neo4j -p s3cr3t \
    -f /var/lib/neo4j/import/cypher_call_tree_helloworld_20201211_112253.cypher
  1. To explore the graph I visit http://localhost:7474/

@galderz
Copy link
Contributor Author

galderz commented Jan 14, 2021

@zakkak Good points! Just added 2 constraints (for vm and method id) in the csv branch and the time went down to 2.5 secs 🚀🚀🚀

@galderz
Copy link
Contributor Author

galderz commented Jan 14, 2021

I've done more testing of loading times:

  • Quarkus getting started quickstart: 5.5 seconds
  • Infinispan Quakurs server: 18 seconds (~150k nodes, 450k relationships)

I'm doing further improvements to the functionality:

  1. Separate package into package and className
  2. Add a display-friendly package name field. For say package org.graalvm... it'd show o.g.....
  3. Then, that can be combined with class name and method name in the neo4j css to show all that info in the node right away.

Finally, the type of cypher queries that would be useful here would be:

  1. Find a method name and all its incoming relationships. This is to find out what brings a given method into the universe:
match (m:Method {name: "cloneKey"}) <- [r] - (o) return *
  1. Find a methods incoming relationships 2 layers deep:
match (m:Method {name: "cloneKey"}) <- [r*1..2] - (o) return *
  1. Find a method with with its incoming relationships, filtered by only its DIRECT and VIRTUAL incoming relationships:
match (m:Method {name: "cloneKey"}) <- [:DIRECT|:VIRTUAL] - (o) return *

@galderz
Copy link
Contributor Author

galderz commented Jan 14, 2021

And thx @michael-simons for neo4j browser and query tips!

@michael-simons
Copy link

TBH, I only gave some tips. The browser is done by a couple of great colleagues.

@galderz
Copy link
Contributor Author

galderz commented Jan 18, 2021

Closing this PR. A new version of this PR is available here.

@galderz galderz closed this Jan 18, 2021
@galderz galderz deleted the t_call_tree_cypher_v2 branch October 21, 2022 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants