-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call tree analysis with Neo4j #2957
Conversation
* Bytecode indexes are part of the relationships. * A method can interact with others multiple times, with each bytecode index indicating the origin of the call with the origin method.
* Bytecode indexes moved to the virtual relationshiops.
p.s. Just got passed a link to the article Import 10M Stack Overflow Questions into Neo4j In Just 3 Minutes. I'll look into that to see if import can be faster :) |
Thanks for pinging me! I'll have a look at your Cypher. |
So a couple of things: First of all, I find that use of so many labels quite irritating. I think the intention of @conker84 behind this is to make use of the implicit index on different labels. Maybe it is because the import he described in the medium post should be generic in the end. I don't know. Furthermore - and that's what is making your import dead slow - the label is by far not unique. Have a close look at this piece:
The label is always exactly So when creating the relationships after having imported all the nodes, you always do a full node scan when matching on the label and the To fix this, you need to create a unique index
However, the import will fail then as it turns out that the used id is not unique.
with
This will create the nodes if they not yet exists with the given pattern, otherwise just match them and the properties will be overwritten. If this is the case - I don't know and cannot judge this here - this is the way forward. With the changes in place, the script runs against 4.0 community edition on my machine in about 9 seconds. If your use case is still fulfilled, I recommend not using the naming at all but do this:
If you need further help with the uniqueness of method nodes and their properties, just give me a sign. |
Thx for hints @michael-simons. Having read the link above about the import command with CSV entries, is that a better approach compared with the current cypher script approach for importing huge amounts of nodes and relationships? |
Hi. I don't think so. The major difference is in what I said about uniqueness of labels and indices. If you don't use any indices and do a full store scan, than both imports will be slow. Probably the CSV one is a tad bit faster, but less convenient than just pipe it through Cypher shell. I would run your thing on a bigger example and see how the difference evolves. |
On the other hand CSV is a universal format so exporting the data in CSV format would give users more flexibility, so it might still be worth considering it. |
If you like, attach a CSV file with the same rows and properties as your Cypher. I can help you with the necessary import statements. |
@zakkak @michael-simons and everyone else interested, I've started a discussion on CSV vs Cypher on the graalvm slack channel, see here. That'll give us a more dynamic place to share ideas, experiment results without overloading this PR with that info. |
Hi everybody as @michael-simons said that article had the goal to show the performance improvements that we had in the |
@michael-simons @zakkak I've experimented with CSV and created a branch with it. A sample report folder can be found here. I can see what @michael-simons meant with inconvenience with CSV. To load all that CSV, you have to copy the CSV files to the Neo4j installation's import folder, and then it can be loaded with the cypher in there. The result is more cumbersome to load into neo4j compared to having it all in cypher and then just pipe it. You don't need to start copying files around. Performance wise, it took ~3m30s with the CSV method, whereas my original approach was taking 4m. This still feels a lot for not so much: ~8k method entries, and 27k relationships. I've tried different heap sizes (up to 16gb) and didn't make any difference. I also tried different |
@galderz adding
(as suggested by @michael-simons in #2957 (comment)) Before
After
|
@zakkak Oh wow 😀 . I'll explore that option further in both approaches and see what it looks like. Also @zakkak, any thoughts on preference of cypher or csv? I think the cypher, although less universal, it's much easier to use with neo4j compared with csv (see my previous comment). I'll tidy this up this branch, that uses cypher, in the next few days. |
@zakkak Oh and thx for trying things! |
Although I agree that having to copy the files to neo4j's import folder is a bit cumbersome, I personally think that CSV is the right (more generic) approach to follow since it makes it easier to:
SidenoteIn my experiments I use the neo4j container image and mount the directory with the csv files using
podman run --rm --name=graal-neo4j \
--user $(id -u):$(id -u) --userns=keep-id \
-p7474:7474 -p7687:7687 \
-e NEO4J_AUTH=neo4j/s3cr3t \
-v /home/zakkak/Downloads/reports:/var/lib/neo4j/import:z \
neo4j
podman exec -ti graal-neo4j \
bin/cypher-shell \
-u neo4j -p s3cr3t \
-f /var/lib/neo4j/import/cypher_call_tree_helloworld_20201211_112253.cypher
|
@zakkak Good points! Just added 2 constraints (for vm and method id) in the csv branch and the time went down to 2.5 secs 🚀🚀🚀 |
I've done more testing of loading times:
I'm doing further improvements to the functionality:
Finally, the type of cypher queries that would be useful here would be:
|
And thx @michael-simons for neo4j browser and query tips! |
TBH, I only gave some tips. The browser is done by a couple of great colleagues. |
Closing this PR. A new version of this PR is available here. |
In order to improve debugging of analysis issues, I've been exploring producing neo4j graphs instead text based call graphs. This PR is an effort to do that and should be considered as a draft.
Functionality wise, it should be complete and equivalent to the existing call tree output. It includes:
Method
node for each distinct direct, virtual and overriden method.Method
includes the method name and the complete signature.The idea is to produce a cypher script that would load the equivalent of the existing call tree printer into a neo4j db. This adds not extra dependencies, it just produces an equivalent text file that can be loaded into neo4j via:
It uses the techniques explained in this article to efficiently batch data and take advantage of script caching. In spite of these efforts, loading data is not very fast right now. The hello world example has a total of ~8k method nodes (8869 to be precise, including virtual ones) and ~27k relationships. Loading that takes ~4m on my home server. For reference, my home server takes 12s to generate a hello world native image. I'm hoping @michael-simons or other neo4j experts (e.g. @DavideD) can help shape this into performing better.
Attaching a sample neo4j graph and a rough screenshot of the area it'd represent. You can find the full cypher output for hello world here.
Implementation wise, it simply reuses the logic in
CallTreePrinter
to generate the graph and it uses the types referenced there.TODO: