-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimisation for large scale NBLAST #20
Comments
Hi Greg, thanks for reaching out. I'm not quite at feature parity yet but I think it's approaching being of some use! Implemented:
Experimental and/or untested:
This implementation does check both of those boxes. My strategy has been to construct an "arena" which stores one query tree per neuron (along with their tangents and so on). As each tree is built once and never mutated, it can make use of the R* tree's bulk loading for improved query performance, and the layout of that tree is optimised for that point cloud. When a point cloud is added to the arena, the library returns an index to it, and any subsequent calls just use that index to minimise the amount of data being passed around. Because this lives entirely within rust, you don't have to cross the border at any point in the midst of an NxM query. Even if you're making multiple dips between them (as I do sometimes for progress checking purposes), you have to pass in very little data for each request. This tactic could feasibly get prohibitive in the RAM cost of storing all of the trees, but I haven't noticed much of an issue so far on my laptop - if I get round to compiling this to webassembly for use in the frontend it may be more pressing.
This isn't something I've looked at - as you have to iterate through every point in the query anyway, I don't have much of an intuition for how a spatial index over the query would be helpful. Anyway, some numbers. I was using the kcs20 data set for all-by-all queries, and ChaMARCM-F000586_seg002 to FruMARCM-F000085_seg001 for single queries. On my laptop with 16 cores, 32GB RAM, benchmarking the rust directly:
"geom" standing for geometric mean forward/backward normalisation, "norm" being self-hit normalisation. Counterintuitively, "rstarpt_construction_with_tangents" means the tangents are passed in, where "rstarpt_construction" calculates the tangents internally. In real-world usage, I ran an all-by-all from python against the 2500ish larval brain neurons, resampled to 1um (mean 481 points). The resampling, R* construction and tangent calculation was pretty quick, even in serial. The all-by-all query took 6 minutes when parallelised over 16 cores, using under 1GB of RAM total. RAM usage for the trees should be linear in N, and compute time (plus RAM usage for the results) quadratic. I haven't looked into any alternative spatial tree implementations yet, which is certainly a possibility going forward. It's currently using 64-bit floats internally but I can test against a lower bit depth fairly easily (a bit more work to make it generic). I'd be happy to give it a go against one of your big neuron lists if you could give me the resampled dotprops objects (parquet format, maybe?), smat, and reference raw scores to check against. I can also put together an ipython notebook or similar with an example (until I get some actual python documentation up). |
Thanks, @clbarnes! That If I understand correctly Here's some benchmark data that is probably a bit more representative – 250 largish neurons (order 1K nodes). It takes about 16s on my laptop (6 cores).
|
|
Using 6 threads, fib250 all-by-all is taking between 16 and 25 seconds (mainly 18-19s), over 90% of which is spent in the spatial query library. Any gains from the improved tree layout, caching the spatial tree, lower-level parallelisation, and avoiding bounces between high and low level languages are more than wiped out by a slower spatial query. |
#21 took it down to ~14s using 6 threads (thanks @aschampion !). |
A batch of optimizations giving another ~1.4x speedup beyond that is in Stoeoef/rstar#35.
I have a very messy version doing something like this by using the separate R* trees we already built for the query and target neurons and doing a DFS through the querying R* tree and at each node pruning every node from the target R* tree candidates with a min bounding box distance greater than the min max distance from the query bounding box to min max target bounding box. (That is, the min over target bounding boxes for the max over query bounding box's corners of the usual R* tree nearest neighbor min max pruning metric.) Then leaf query nodes only need to start their search from their parent bounding box node's target candidates. This gives another small speedup of about 1.3x in random R* trees (will take some additional work to try it in nblast itself):
I know the description is vague, but does that sound like what you had in mind here @jefferis? |
The algorithm I described is PR'd as Stoeoef/rstar#41. |
Responding to @jefferis comment at rstar here:
No worries, I don't think this is urgent for anyone. I just wanted to PR it before it went stale in my brain.
I am likewise underwhelmed, but the benchmark numbers I've given so far are just for random trees in rstar and not yet the neuron data from nblast-rs. There are still incremental improvements to the algorithm possible, such as smarter choice of which subtrees to expand for potential pruning (i.e. should it be those with maximum minimum distance to the query node so their outliers can be discarded earlier, or those providing the minimum pruning distance so that the pruning distance can be tighter), but those would be small changes in performance. This was mostly a one-to-two-day distraction, so I've not spent time closely investigating the algorithm's behavior.
My intuition is that the density and distribution properties of neuron data will perform better under these optimizations than the random point cloud data from the benchmarks, since the tree structure is better segregated and thus more can be pruned (and more otherwise redundant computation avoided). However, I doubt that will lead to order of magnitude changes and is likely to be another small linear scaling factor change.
Once I got rid of allocation overhead in a previous PR, on last dtrace profiling it's spending the vast majority of time in distance calculations as one would expect of an efficient implementation. The best way to improve that would be SIMD optimization, which is not yet as stabilized in Rust as in C++, so the C++ lib underlying your R implementation may be making up for a less optimal algorithm with more optimal linalg. The way Another overarching limit to our current performance could be that our trees aren't actually that big (they could fit in L2), there's just a lot of them. Hence improvements like this that can change some O factors for individual trees don't have huge practical improvements for us because the constant factors are still outside those loops (and thus aren't reduced) and the overhead of tree construction, etc., becomes a larger and larger proportion of the remaining wall time. |
Also to note if this did become a bottleneck, there's lots of interesting lit on GPU algorithms for point cloud kNN. |
At this point I would try to benchmark actual performance with a large number of parallel threads/processes for jobs taking >= several minutes. I don't have super parallel speedup right now (between 4-12 threads are useful from laptop to largeish server) because the standard parallel implementation backend is based on fork and somehow some objects in memory don't seem to get shared between processes. Then you can quite quickly get into issues with swapping etc. In an (NxN) all by all NBLAST I would be very surprised if tree construction ends up being a major fraction of compute time once N>a few hundred. However I have long been meaning to optimise memory issues by changing the pattern in which I work for matrices with thousands of neurons. This would be to do blocks of neurons rather than whole columns or rows. |
@clbarnes I assume you'll run this on fib250 once
Good point, a typical cache-oblivious recursive blocking should be easy here. |
Notes from our brief discussion last week to keep this issue in sync:
|
Worth noting that there is now a rust implementation of nabo (the ANN library used by the reference NBLAST implementation) (and a fork of that supporting different boundary conditions), as well as a different kNN implementation which claims to beat it (although the gains may be through parallelisation which may not help very much in our case). Worth investigating. |
This might be interesting too: https://github.com/cavemanloverboy/FNNTW Written in Rust and pretty new. Benchmarks look good but I haven't tried it myself. |
Should have been clearer, that was the project I linked to under "different kNN implementation" above! |
My bad - Should have checked myself. Funny that we came across it independently though 😄 |
No significant speedups using the rust nabo implementation, unfortunately. There's probably some low-hanging fruit for optimisations in that implementation as it does quite a lot of extra array copies to cope with different lifetime restrictions in the two implementations. Fnntw is even more awkward lifetime-wise but also puts a lot of work into optimising build time with very minor gains in query time, so may not be much use to us anyway. On the plus side, javascript bindings to webassembly is working! |
Good to know! The kiddo v2 seems interesting too; annoying that rstar, kiddo v2, and bosque haven't ended up in a table together yet... |
Hi @clbarnes,
@schlegelp and @NikDrummond reminded me about this effort today. This is not really an issue, but I figure the discussion may as well live with the repo. I was curious about where you had got with optimisation and whether your progress is likely to be of interest for us.
The main use case that causes us trouble at the moment is when we are dealing with N>10K neurons with median 500 nodes each in "dotprops" format. My analysis in this case is that with the number of comparisons being N^2 that the costs of the preprocessing steps (making the dotprops objects etc which are ~ N) are largely irrelevant (and can often be cached).
Therefore the areas for optimisation should be the actual comparison step. The R implementation typically spends 75+% of it's time inside the
nabor::knn
function. This wraps an efficient C++ nearest neighbour library. There are a few possible ways that things could be improvedA really efficient implementation could potentially achieve ~ 2x speedup if the build could be made one time only (1/3 of 75% =25% total) and the fat associated with the calling R code be removed (25% of total). There could be an additional speed gain if the points could be converted to 32 bit or even 16 bit ints.
The text was updated successfully, but these errors were encountered: