-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on high-dimensional arrays #765
Comments
Update: you can replace the assignment with linear indexing, e.g.,
In this case, you get linear scaling (although about 3x slower than Matlab): julia> test2() julia> test3() |
I think there are some inner loops where we have special cases for ndims<=3, which probably explains this. |
Also I think just |
Independently of the particular application, I raised this because it indicates that there must be some issue with indexing that makes its performance worse on higher-dimensional arrays (at least on assign, maybe not on ref). For the 3d/6d case, it's a 13-fold gap with Matlab, which I'd argue is something we should fix someday. For this specific case, one could indeed construct a matrix of pointers and then allocate a big storage block for the "inner" arrays, and use the new wrap-pointer-as-array functionality. So you'd have I don't yet understand how gen_cartesian_map works, but presumably the answer lies there. Eventually I'll spend some time poking around there, but it may be a bit---grant-writing duties are calling, insistently. But at least the problem is documented. |
You're right, getting rid of the [:] shaved a bunch off the time: julia> test1() julia> test2() julia> test3() This is close enough to Matlab for my tastes. So it's only the case where linear indexing is not used that need concern us. |
Oh, I totally agree that this is a performance problem that needs to be fixed. I just decided to talk about your particular application as well. |
Also, issue #190 should help close the gap with matlab. |
Good point! And thanks for suggesting a workaround, now that I understand your intention it seems like a fine idea. |
Another update. I realized that some of the iterators I've been defining for images/grids might be applicable here. I uploaded "fsaiter.jl" (warning: may not work for general ranges, I have not tried) and a new version of the test script that uses it, "test_tarray_iter.jl". These are both in that same gist. Timing results: julia> test1() julia> test2() julia> test3() So it's faster in 6d, about the same in 4d, and slower for 2d. I'd be interested in hearing your insights about the performance, and why it's not faster. On average each "next" call should be just 2 adds and one compare, and if you include the increment and compare in the for loop, it's 3 adds and 2 compares. I have a hard time seeing how one could do it in much less (in the general case, that is). Is it the array references, e.g., iter.inc[idim]? |
Of course, it could also be the allocation of the ForwardSubarrayIterator... |
...or if the call to next is not inlined... |
OK, another update. Since I don't know of a way to test in Julia whether a function is inlined or what the compiled code looks like, to make sure I was testing just the core algorithm itself I wrote the "assign" statement in C. The code is in the gist, https://gist.github.com/2500065. I compiled the library as Testing: julia> load("test_tarray_iter.jl") julia> libfsaiter = dlopen("libfsaiter") julia> fsaiter_assign = dlsym(libfsaiter, :fsaiter_assign) julia> test1() julia> test2() julia> test3() I have't had time to look into why test1() is slower, but presumably that's a simple fix. Aside from this case, this is getting into the ballpark of Matlab performance, even without any real optimization (using memcpy where possible, etc). At the moment this is still merely "study"---if there is a way to code this in Julia which gets us the performance we need, so much the better. But if not, then we might want to consider specialized ref/assign functions in these common cases (when indices are ints or ranges, rather than arbitrary vectors). That's it for today :-). |
Oh, one more thought: I don't know Python, but if someone does and is interested in this issue, it might be instructive to see how NumPy performs. If it's very good, we might learn something from their code. |
OK, in the gist you'll find fsaiter2.jl, which is getting close to something I could imagine committing to array.jl and becoming the main workhorse for ref & assign for Arrays. This is clearly a big change, and to keep continuity with this thread I'm asking for comments here even before I turn this into a pull request. Before getting serious about committing, it needs more validity and performance testing. However, on this test case it seems to work, it's fast, and in any event it is a complete implementation of the best design I can think of. Currently this is a pure-Julia implementation, although as shown above there might be some advantages in making the "next" function a ccall. Features:
Performance testing: julia> load("test_tarray_iter.jl") julia> test1() julia> test2() julia> test3() Even using memcpy, it's a little slower than Matlab, but it is uniformly faster than the implementation in array.jl (by a factor of 10 in the case of test3). There may also be more optimization possible; for example, I noticed that min & max had not been defined on ranges, and doing so (see the top of fsaiter2.jl) led to a huge speedup for the test1 case, where min & max were actually taking most of the time. My main remaining concern is simply the allocation and construction of the ForwardSubarrayIterator object, which might lead to noticeable overhead for small arrays. (It's clearly a win for larger arrays.) You can see that the design strategy was to do computation at construction time, so that "next" is simple and very fast (faster yet if implemented as a ccall). One other point: is it possible/desirable to have an "assignref" function that handles the case |
If you use The last thing you mention is a special case of stream fusion. It's probably the easiest special case to cover, though I'm not sure of the overall impact on real code--usually I have other operations on the RHS. YMMV, of course, and there may be some benefit to implementing it just to see how it goes. |
Hmm, in attempting to insert this into array.jl, I'm discovering issues with the boostrap process. Not clear whether these will be resolvable. Until I know more, I am removing the RFC tag from the title of this issue. |
Very nice Tim! We could probably do a quick check to see how many elements are being copied, and use ForwardSubarrayIterator over some threshold, and also avoid memcpy when min and max for Ranges can be checked in right away. I think
|
While I'm working on this, I pushed "arrayperf.jl" to test/, with the notion that we'll want to track performance across a range of operations. One interesting point: it reveals a curious asymmetry between the performance of ref and assign for dimensions 5 and above. |
I pushed a new branch, "arrays," containing the first steps towards incorporating this. Right now the changes are minimal, and cover only a small subset of cases (basically, 1d and 2d). Yet in at least a few of these covered cases, the changes provide substantial speed improvements. You can test this by running JULIAHOME/test/arrayperf.jl (for performance) and JULIAHOME/test/arrayops.jl (for correctness). The recommended procedure is to run the tests from a build from master first, then switch to the arrays branch, rebuild, and then re-run the tests. I'm in no hurry to merge this, so anyone interested in testing it might want to just switch to the "arrays" branch and do normal work, and see what breaks :-). If we get to the point where we can work in arrays without noticing problems, then it might be worth considering a merge. With regards to extending this to higher dimensions, the biggest issue is balancing the overhead of the general algorithm (a modified version of what's in the gist's fsaiter2.jl) against the complexity of providing specializations. |
What all remains to be done on this branch for merging? If there are performance improvements to be gained, let's chase it down and close it! I am happy to help to the extent possible. |
Let's get the 2d case done first, and take the nd case later. |
I'm sorry this has taken me so long. With the demands of my real job, I'm trying to discipline myself to no more than 1-2 hrs of Julia programming per day (this is surprisingly difficult, I love this stuff...), and I had ~1month where I couldn't allocate any time to Julia. So this hasn't gone quickly, for which I apologize. Here's the basic outline of the current status:
I really appreciate your offer to help out, and would be glad to accept anything you can do. My next step was going to be to write a ref like this: And all the work I've done for ref also has to be done for assign. I suspect that we may not have to have so many special cases for up to 4 dimensions, I suspect these might have been added to work around the shortcomings of the current approach. |
OK, finally some real progress to report. Executive summary:
Here's the theory:
Now, A[i1,i2,i3] gets converted to a linear index
and then A[indx] is accessed. An obvious problem is that this computation has to be done from scratch for each inner loop, despite the fact that i2 and i3 aren't changing. So, I just committed code that does this instead:
Testing: The changes are: I think this is a step forward. Of course, I am very open to suggestions for better implementations than I've provided. |
Another update:
|
Great work! The |
Merged. Since this resolves the issue about high-dimensional arrays, I'm closing this issue. #190 is still worth implementing, so I'll leave that open. |
I've been playing around with tiled arrays---these might be useful for optimizing matrix multiplication (e.g., http://tinyurl.com/7g6uq5c), and are definitely useful for the "bricked" image processing I am implementing (https://github.com/JuliaLang/julia/wiki/Computer-Vision). I discovered what seems to be a pretty serious performance problem with Julia's multidimensional arrays (though hopefully it will not be difficult to fix).
I've created a gist to store the code:
git://gist.github.com/2500065.git
The code takes a 1, 2, or 3-dimensional array, and splits it up into tiles, creating a 2, 4, or 6-dimensional array, respectively. The sizes of the arrays are designed to have the same number of elements across all dimensions, and also the same number of tiles. Therefore, time spent looping, etc, should be well-matched for all 3 choices of dimension.
The summary is that Matlab gives the expected near-constant time on tasks for all 3 dimensions, but Julia's execution time grows quickly with dimensionality.
Timing results in Julia:
julia> load("test_tarray.jl")
julia> test1()
elapsed time: 0.34828901290893555 seconds
elapsed time: 0.3329780101776123 seconds
elapsed time: 0.33298707008361816 seconds
elapsed time: 0.333798885345459 seconds
julia> test2()
elapsed time: 1.0149970054626465 seconds
elapsed time: 0.9985160827636719 seconds
elapsed time: 0.9988420009613037 seconds
elapsed time: 0.998661994934082 seconds
julia> test3()
elapsed time: 4.635593891143799 seconds
elapsed time: 4.619307994842529 seconds
elapsed time: 4.612743139266968 seconds
elapsed time: 4.629526853561401 seconds
Timing results in Matlab (started with -singleCompThread):
Now here's the funny part: if you comment out the assignment lines in tile1, tile2, and tile3, here is what you get:
julia> load("testtarray.jl") # with the TA[...] = tmp assignments commented out
julia> test1()
elapsed time: 0.19736981391906738 seconds
elapsed time: 0.17229199409484863 seconds
elapsed time: 0.1692979335784912 seconds
elapsed time: 0.16942596435546875 seconds
julia> test2()
elapsed time: 0.24941301345825195 seconds
elapsed time: 0.2330920696258545 seconds
elapsed time: 0.22670793533325195 seconds
elapsed time: 0.22788786888122559 seconds
julia> test3()
elapsed time: 0.2933318614959717 seconds
elapsed time: 0.2685282230377197 seconds
elapsed time: 0.2673368453979492 seconds
elapsed time: 0.2678871154785156 seconds
So assignment is taking the vast majority of the time. The weird thing about this is that the chunk of memory that is being assigned to is contiguous---it's the "ref" operation that should be problematic from a cache perspective. This makes me think it's a simple bug.
The text was updated successfully, but these errors were encountered: