Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor indexes creation #335

Closed
Yomguithereal opened this issue Jul 8, 2014 · 5 comments
Closed

Refactor indexes creation #335

Yomguithereal opened this issue Jul 8, 2014 · 5 comments
Assignees

Comments

@Yomguithereal
Copy link
Collaborator

On huge graphs (i.e. 3000 nodes, 50 000 edges), the initial graph processing is way to slow and is caused by neighbour indexes' creation.

@apitts
Copy link
Contributor

apitts commented Feb 13, 2017

@Yomguithereal just wondering if you have done any performance testing re the above? I just experimented with changing sigma's three NeighborIndexes to use only the id rather than the full edge and could not see either a significant improvement in terms of memory or timing with a ~20,000 edge graph. I noticed though that you were using only id's in graphology....so I'm curious what your experience has been, if any?

@Yomguithereal
Copy link
Collaborator Author

One of the major issue with sigma's indexing is that it stores three huge indices built on nested objects. One undirected nodeA -> nodeB (+ nodeB -> nodeA) ->edge, this plus two directed ones (in & out) which means a memory usage which is more than reasonable and which is, what's more, very costly to write. graphology uses several strategies to optimize this by storing the indices directly in the nodes' register and so on. I don't believe the fact we are using ids instead of references does something to improve the performance, even if this might give a boost on some JavaScript engines.

However, if you really want to see performance drops when instantiating sigma, try a 50k nodes 2M edges and you should see the browser wait synchronously for a long time for sigma to load the graph.

On a side note, do you know of another way to index a graph's structure (be able to retrieve relevant edges & neighbors in constant time) than this one?

@apitts
Copy link
Contributor

apitts commented Feb 14, 2017

Thanks for those comments @Yomguithereal! I agree with you that there should certainly be more efficient data structures possible than those used by sigma.js currently. I have spent a bit of time reviewing indices in Graphology so far and it looks promising.

Your last question is a very interesting one. I cannot claim to be an expert on the topic but I think thinking about it in terms of sparse matrices is possible and potentially helpful. We have an adjaceny matrix that, for large networks, is usually sparse. The question is how to store and access that efficiently. Sigma.js effectively takes a list of lists (LIL) approach to the sparse matrix problem (albeit with some duplication of data). The alternative would be something like CSR (compressed sparse row) format which would, in most instances, be more space efficient but come at the cost of speed to access relative to LIL. That said, there appears to be some interesting work on hashtables vs CSR, e.g. https://pdfs.semanticscholar.org/e284/4574e260ec83437e95c133c0dc56469f8357.pdf. Not sure that going down that path is necessarily a good idea though...

@Yomguithereal
Copy link
Collaborator Author

The structure used by graphology is actually a sparse matrix (sigma also uses a sparse matrix but we could say it uses 3 sparse matrices instead of one). Hashtable and CSR are interesting ways but I would not use them on graphology reference implementation (on a variant implementation, why not?) because it must remain fast as possible keeping operation in constant time as often as possible.

@Yomguithereal
Copy link
Collaborator Author

v2 now uses graphology so this issue is obsolete and will be closed. As reference, graphology's standard implementation does not internally rely on a LIL, CSR but rather a custom DoD for performance reasons since this is the best trade-off to get constant time access for all operations without hurting memory too much. CSR and LIL just cannot offer the same all-around read performance but would be interesting backend for specialized graphology implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants