Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on high-dimensional arrays #765

Closed
timholy opened this issue Apr 26, 2012 · 26 comments
Closed

Performance on high-dimensional arrays #765

timholy opened this issue Apr 26, 2012 · 26 comments
Labels
performance Must go faster

Comments

@timholy
Copy link
Member

timholy commented Apr 26, 2012

I've been playing around with tiled arrays---these might be useful for optimizing matrix multiplication (e.g., http://tinyurl.com/7g6uq5c), and are definitely useful for the "bricked" image processing I am implementing (https://github.com/JuliaLang/julia/wiki/Computer-Vision). I discovered what seems to be a pretty serious performance problem with Julia's multidimensional arrays (though hopefully it will not be difficult to fix).

I've created a gist to store the code:
git://gist.github.com/2500065.git
The code takes a 1, 2, or 3-dimensional array, and splits it up into tiles, creating a 2, 4, or 6-dimensional array, respectively. The sizes of the arrays are designed to have the same number of elements across all dimensions, and also the same number of tiles. Therefore, time spent looping, etc, should be well-matched for all 3 choices of dimension.

The summary is that Matlab gives the expected near-constant time on tasks for all 3 dimensions, but Julia's execution time grows quickly with dimensionality.

Timing results in Julia:

julia> load("test_tarray.jl")

julia> test1()
elapsed time: 0.34828901290893555 seconds
elapsed time: 0.3329780101776123 seconds
elapsed time: 0.33298707008361816 seconds
elapsed time: 0.333798885345459 seconds

julia> test2()
elapsed time: 1.0149970054626465 seconds
elapsed time: 0.9985160827636719 seconds
elapsed time: 0.9988420009613037 seconds
elapsed time: 0.998661994934082 seconds

julia> test3()
elapsed time: 4.635593891143799 seconds
elapsed time: 4.619307994842529 seconds
elapsed time: 4.612743139266968 seconds
elapsed time: 4.629526853561401 seconds

Timing results in Matlab (started with -singleCompThread):

test_tarray
One-dimensional tiling:
Elapsed time is 0.272575 seconds.
Elapsed time is 0.276140 seconds.
Elapsed time is 0.277794 seconds.
Elapsed time is 0.276627 seconds.
Two-dimensional tiling:
Elapsed time is 0.210037 seconds.
Elapsed time is 0.217636 seconds.
Elapsed time is 0.216659 seconds.
Elapsed time is 0.215701 seconds.
Three-dimensional tiling:
Elapsed time is 0.331167 seconds.
Elapsed time is 0.337266 seconds.
Elapsed time is 0.336038 seconds.
Elapsed time is 0.337088 seconds.

Now here's the funny part: if you comment out the assignment lines in tile1, tile2, and tile3, here is what you get:

julia> load("testtarray.jl") # with the TA[...] = tmp assignments commented out

julia> test1()
elapsed time: 0.19736981391906738 seconds
elapsed time: 0.17229199409484863 seconds
elapsed time: 0.1692979335784912 seconds
elapsed time: 0.16942596435546875 seconds

julia> test2()
elapsed time: 0.24941301345825195 seconds
elapsed time: 0.2330920696258545 seconds
elapsed time: 0.22670793533325195 seconds
elapsed time: 0.22788786888122559 seconds

julia> test3()
elapsed time: 0.2933318614959717 seconds
elapsed time: 0.2685282230377197 seconds
elapsed time: 0.2673368453979492 seconds
elapsed time: 0.2678871154785156 seconds

So assignment is taking the vast majority of the time. The weird thing about this is that the chunk of memory that is being assigned to is contiguous---it's the "ref" operation that should be problematic from a cache perspective. This makes me think it's a simple bug.

@timholy
Copy link
Member Author

timholy commented Apr 26, 2012

Update: you can replace the assignment with linear indexing, e.g.,

s = cumprod([tilesize...])
# and later...
#            TA[:,:,:,i1,i2,i3] = tmp
            offset = sum(s .* [i1-1,i2-1,i3-1])
            TA[offset+1:offset+s[end]] = tmp[:]

In this case, you get linear scaling (although about 3x slower than Matlab):
julia> test1()
elapsed time: 0.6524319648742676 seconds
elapsed time: 0.6320919990539551 seconds
elapsed time: 0.6342530250549316 seconds
elapsed time: 0.6361970901489258 seconds

julia> test2()
elapsed time: 0.6833369731903076 seconds
elapsed time: 0.6477699279785156 seconds
elapsed time: 0.6429779529571533 seconds
elapsed time: 0.6513869762420654 seconds

julia> test3()
elapsed time: 0.7397019863128662 seconds
elapsed time: 0.7116470336914062 seconds
elapsed time: 0.712367057800293 seconds
elapsed time: 0.7170429229736328 seconds

@JeffBezanson
Copy link
Member

I think there are some inner loops where we have special cases for ndims<=3, which probably explains this.
Is there a real advantage to using a big 6-d array instead of splitting it into multiple lower-dimensional arrays?

@JeffBezanson
Copy link
Member

Also I think just tmp will work instead of tmp[:].

@timholy
Copy link
Member Author

timholy commented Apr 26, 2012

Independently of the particular application, I raised this because it indicates that there must be some issue with indexing that makes its performance worse on higher-dimensional arrays (at least on assign, maybe not on ref). For the 3d/6d case, it's a 13-fold gap with Matlab, which I'd argue is something we should fix someday.

For this specific case, one could indeed construct a matrix of pointers and then allocate a big storage block for the "inner" arrays, and use the new wrap-pointer-as-array functionality. So you'd have
Ainner = pointer_to_array(Aouter[coord1,coord2,coord3],dims)
However, in theory this seems like it should be slower, because every time you access data in the inner array you first have to allocate a temporary. In other words, using a 6-d array seems like it should be a more efficient way to implement this lookup.

I don't yet understand how gen_cartesian_map works, but presumably the answer lies there. Eventually I'll spend some time poking around there, but it may be a bit---grant-writing duties are calling, insistently. But at least the problem is documented.

@timholy
Copy link
Member Author

timholy commented Apr 26, 2012

You're right, getting rid of the [:] shaved a bunch off the time:

julia> test1()
elapsed time: 0.35790300369262695 seconds
elapsed time: 0.36237382888793945 seconds
elapsed time: 0.35125184059143066 seconds
elapsed time: 0.342911958694458 seconds

julia> test2()
elapsed time: 0.3813440799713135 seconds
elapsed time: 0.34334683418273926 seconds
elapsed time: 0.34883618354797363 seconds
elapsed time: 0.35132598876953125 seconds

julia> test3()
elapsed time: 0.4148139953613281 seconds
elapsed time: 0.3868069648742676 seconds
elapsed time: 0.39541006088256836 seconds
elapsed time: 0.3959071636199951 seconds

This is close enough to Matlab for my tastes. So it's only the case where linear indexing is not used that need concern us.

@JeffBezanson
Copy link
Member

Oh, I totally agree that this is a performance problem that needs to be fixed. I just decided to talk about your particular application as well.
This is not what pointer_to_array is for; that function is for accessing mystery pointers you get from C libraries. I'm talking about storing the slices in a cell array, where the pointers are object references managed by julia.

@JeffBezanson
Copy link
Member

Also, issue #190 should help close the gap with matlab.

@timholy
Copy link
Member Author

timholy commented Apr 26, 2012

I just decided to talk about your particular application as well.... I'm talking about storing the slices in a cell array, where the pointers are object references managed by julia.

Good point! And thanks for suggesting a workaround, now that I understand your intention it seems like a fine idea.

@timholy
Copy link
Member Author

timholy commented Apr 27, 2012

Another update. I realized that some of the iterators I've been defining for images/grids might be applicable here. I uploaded "fsaiter.jl" (warning: may not work for general ranges, I have not tried) and a new version of the test script that uses it, "test_tarray_iter.jl". These are both in that same gist.

Timing results:
julia> load("test_tarray_iter.jl")

julia> test1()
elapsed time: 1.741914987564087 seconds
elapsed time: 1.682088851928711 seconds
elapsed time: 1.7417919635772705 seconds
elapsed time: 1.7512109279632568 seconds

julia> test2()
elapsed time: 1.0312600135803223 seconds
elapsed time: 1.0411438941955566 seconds
elapsed time: 1.0047879219055176 seconds
elapsed time: 1.0408740043640137 seconds

julia> test3()
elapsed time: 1.129045009613037 seconds
elapsed time: 1.0856060981750488 seconds
elapsed time: 1.0949370861053467 seconds
elapsed time: 1.0981900691986084 seconds

So it's faster in 6d, about the same in 4d, and slower for 2d.

I'd be interested in hearing your insights about the performance, and why it's not faster. On average each "next" call should be just 2 adds and one compare, and if you include the increment and compare in the for loop, it's 3 adds and 2 compares. I have a hard time seeing how one could do it in much less (in the general case, that is). Is it the array references, e.g., iter.inc[idim]?

@timholy
Copy link
Member Author

timholy commented Apr 27, 2012

Of course, it could also be the allocation of the ForwardSubarrayIterator...

@timholy
Copy link
Member Author

timholy commented Apr 27, 2012

...or if the call to next is not inlined...

@timholy
Copy link
Member Author

timholy commented Apr 28, 2012

OK, another update. Since I don't know of a way to test in Julia whether a function is inlined or what the compiled code looks like, to make sure I was testing just the core algorithm itself I wrote the "assign" statement in C. The code is in the gist, https://gist.github.com/2500065. I compiled the library as
gcc -shared -fPIC -o libfsaiter.so fsaiter_cfuncs.c

Testing:
julia> load("fsaiter.jl")

julia> load("test_tarray_iter.jl")

julia> libfsaiter = dlopen("libfsaiter")
Ptr{Void} @0x00000000046bfd20

julia> fsaiter_assign = dlsym(libfsaiter, :fsaiter_assign)
Ptr{Void} @0x00007fa326527570

julia> test1()
elapsed time: 1.042712926864624 seconds
elapsed time: 1.0036790370941162 seconds
elapsed time: 1.0113070011138916 seconds
elapsed time: 1.0004301071166992 seconds

julia> test2()
elapsed time: 0.42513489723205566 seconds
elapsed time: 0.391679048538208 seconds
elapsed time: 0.390841007232666 seconds
elapsed time: 0.39109015464782715 seconds

julia> test3()
elapsed time: 0.5570559501647949 seconds
elapsed time: 0.43977904319763184 seconds
elapsed time: 0.43453097343444824 seconds
elapsed time: 0.4356818199157715 seconds

I have't had time to look into why test1() is slower, but presumably that's a simple fix. Aside from this case, this is getting into the ballpark of Matlab performance, even without any real optimization (using memcpy where possible, etc).

At the moment this is still merely "study"---if there is a way to code this in Julia which gets us the performance we need, so much the better. But if not, then we might want to consider specialized ref/assign functions in these common cases (when indices are ints or ranges, rather than arbitrary vectors).

That's it for today :-).

@timholy
Copy link
Member Author

timholy commented Apr 28, 2012

Oh, one more thought: I don't know Python, but if someone does and is interested in this issue, it might be instructive to see how NumPy performs. If it's very good, we might learn something from their code.

@timholy
Copy link
Member Author

timholy commented Apr 30, 2012

OK, in the gist you'll find fsaiter2.jl, which is getting close to something I could imagine committing to array.jl and becoming the main workhorse for ref & assign for Arrays. This is clearly a big change, and to keep continuity with this thread I'm asking for comments here even before I turn this into a pull request. Before getting serious about committing, it needs more validity and performance testing. However, on this test case it seems to work, it's fast, and in any event it is a complete implementation of the best design I can think of.

Currently this is a pure-Julia implementation, although as shown above there might be some advantages in making the "next" function a ccall. Features:

  1. It uses memcpy to handle blocks of memory. It is smart enough to "concatenate" dimensions in cases where you're doing something like A[:,:,14:16,4:8] and can handle the first three dimensions with a single memcpy. So this will close issue use copy_to in assign() and ref() where possible #190, too.
  2. At least in 2d, I've verified that it works for general ranges (not just Range1s), even when the step is negative.
  3. It's a little more strict than our current code. I've noticed, for example, that on an array A of size (10,12), one can say A[11,2] with impunity, which I think is a little troublesome. This version should throw an error.

Performance testing:
julia> load("fsaiter2.jl")

julia> load("test_tarray_iter.jl")

julia> test1()
elapsed time: 0.33489394187927246 seconds
elapsed time: 0.2693660259246826 seconds
elapsed time: 0.2693319320678711 seconds
elapsed time: 0.26902008056640625 seconds

julia> test2()
elapsed time: 0.4216599464416504 seconds
elapsed time: 0.3640289306640625 seconds
elapsed time: 0.3632080554962158 seconds
elapsed time: 0.3660759925842285 seconds

julia> test3()
elapsed time: 0.5998151302337646 seconds
elapsed time: 0.4285140037536621 seconds
elapsed time: 0.4304778575897217 seconds
elapsed time: 0.4293508529663086 seconds

Even using memcpy, it's a little slower than Matlab, but it is uniformly faster than the implementation in array.jl (by a factor of 10 in the case of test3). There may also be more optimization possible; for example, I noticed that min & max had not been defined on ranges, and doing so (see the top of fsaiter2.jl) led to a huge speedup for the test1 case, where min & max were actually taking most of the time.

My main remaining concern is simply the allocation and construction of the ForwardSubarrayIterator object, which might lead to noticeable overhead for small arrays. (It's clearly a win for larger arrays.) You can see that the design strategy was to do computation at construction time, so that "next" is simple and very fast (faster yet if implemented as a ccall).

One other point: is it possible/desirable to have an "assignref" function that handles the case
A[aind1,aind2,...] = X[xind1,xind2,...]
with a single function? Currently this gets executed as
tmp = X[xind1,xind2,...]
A[aind1,aind2,...] = tmp
Having "assignref" could conceivable avoid creating a temporary, which might make things even faster.

@pao
Copy link
Member

pao commented Apr 30, 2012

If you use hub, you can attach the pull request to this issue (git pull-request -i 765) to maintain continuity. That will also let people pull the patchset and test with it.

The last thing you mention is a special case of stream fusion. It's probably the easiest special case to cover, though I'm not sure of the overall impact on real code--usually I have other operations on the RHS. YMMV, of course, and there may be some benefit to implementing it just to see how it goes.

@timholy
Copy link
Member Author

timholy commented Apr 30, 2012

Hmm, in attempting to insert this into array.jl, I'm discovering issues with the boostrap process. Not clear whether these will be resolvable. Until I know more, I am removing the RFC tag from the title of this issue.

@JeffBezanson
Copy link
Member

Very nice Tim!

We could probably do a quick check to see how many elements are being copied, and use ForwardSubarrayIterator over some threshold, and also avoid memcpy when copy_len is small.

min and max for Ranges can be checked in right away.

I think next should be called next! here since it advances the iterator by mutation.

assignref is probably a good idea; we can provide a fallback implementation so nobody has to know it exists :) Then we should do an efficient implementation of assign where the right-hand side as a SubArray.

@timholy
Copy link
Member Author

timholy commented May 1, 2012

While I'm working on this, I pushed "arrayperf.jl" to test/, with the notion that we'll want to track performance across a range of operations.

One interesting point: it reveals a curious asymmetry between the performance of ref and assign for dimensions 5 and above.

@timholy
Copy link
Member Author

timholy commented May 2, 2012

I pushed a new branch, "arrays," containing the first steps towards incorporating this. Right now the changes are minimal, and cover only a small subset of cases (basically, 1d and 2d). Yet in at least a few of these covered cases, the changes provide substantial speed improvements.

You can test this by running JULIAHOME/test/arrayperf.jl (for performance) and JULIAHOME/test/arrayops.jl (for correctness). The recommended procedure is to run the tests from a build from master first, then switch to the arrays branch, rebuild, and then re-run the tests.

I'm in no hurry to merge this, so anyone interested in testing it might want to just switch to the "arrays" branch and do normal work, and see what breaks :-). If we get to the point where we can work in arrays without noticing problems, then it might be worth considering a merge.

With regards to extending this to higher dimensions, the biggest issue is balancing the overhead of the general algorithm (a modified version of what's in the gist's fsaiter2.jl) against the complexity of providing specializations.

@ViralBShah
Copy link
Member

What all remains to be done on this branch for merging? If there are performance improvements to be gained, let's chase it down and close it! I am happy to help to the extent possible.

@ViralBShah
Copy link
Member

Let's get the 2d case done first, and take the nd case later.

@timholy
Copy link
Member Author

timholy commented Jun 20, 2012

I'm sorry this has taken me so long. With the demands of my real job, I'm trying to discipline myself to no more than 1-2 hrs of Julia programming per day (this is surprisingly difficult, I love this stuff...), and I had ~1month where I couldn't allocate any time to Julia. So this hasn't gone quickly, for which I apologize.

Here's the basic outline of the current status:

  1. Performance gains: depends strongly on dimensionality, but I have a real-world case (in 8 dimensions) where there should be a ~40-fold performance improvement. So not small, especially for high dimensions. In low dimensions, we're already much more optimized, but I've found cases where the use of memcpy still provides a 2-5 fold improvement. Obviously this is only for ref and assign themselves, it will not affect code that doesn't use these heavily.
  2. The approach outlined above (fsaiter2) works well for large matrices, but not so great for small ones---there's too much overhead. I think the right solution is the one that I've implemented in the arrays branch for ref up through 2d indexing: use the new Colon object to short-circuit the tests for whether entire dimensions can be handled by memcpy.
  3. The problem with "doing 2d and then handling the nd case" is that, with this approach, there will be a moment of truth: the moment we change the interpreter to stop expanding A[:,3] into A[1:size(A,1),3], and instead just pass A[Colon,3]. So until every example of ref and assign is ready to handle a "Colon" input, we'll break stuff.

I really appreciate your offer to help out, and would be glad to accept anything you can do. My next step was going to be to write a ref like this:
ref_ncopy(A, n_elements_to_memcpy, indices_that_you_can't_memcpy...)
and then write the higher-dimensional special cases like
ref(A, ::Type{Colon}, ::Type{Colon}, I::Indices...)
ref(A, ::Type{Colon}, ::Type{Colon}, ::Type{Colon}, I::Indices...)
etc. Most of the magic will be in ref_ncopy. I suspect the "add with carry" ideas in fsaiter might be useful.

And all the work I've done for ref also has to be done for assign. I suspect that we may not have to have so many special cases for up to 4 dimensions, I suspect these might have been added to work around the shortcomings of the current approach.

@timholy
Copy link
Member Author

timholy commented Jun 29, 2012

OK, finally some real progress to report. Executive summary:

  1. Currently implemented only for ref
  2. This is independent of memcpy, i.e., issue use copy_to in assign() and ref() where possible #190
  3. The savings are to be had in pre-calculating coordinatewise offsets, rather than the current strategy of forcing a fresh calculating of the linear index on each ref operation.
  4. The savings are significant, ranging from 1 to 6-fold.

Here's the theory:
For higher-dimensional arrays, we use gen_cartesian_map, which creates functions that have a body like this (3d example):

for i3 = ind3
    for i2 = ind2
        for i1 = ind1
            X[storeind] = A[i1,i2,i3]
            storeind += 1
        end
    end
end

Now, A[i1,i2,i3] gets converted to a linear index

indx = i1 + size(A,1)*(i2-1 + size(A,2)*(i3-1))

and then A[indx] is accessed.

An obvious problem is that this computation has to be done from scratch for each inner loop, despite the fact that i2 and i3 aren't changing.

So, I just committed code that does this instead:

offset3 = 0
stride1 = 1
stride2 = stride1 * size(A,1)
stride3 = stride2 * size(A,2)
for i3 = ind3
    offset2 = offset3 + (i3-1)*stride3
    for i2 = ind2
        offset1 = offset2 + (i2-1)*stride2
        for i1 = ind1
            linearind = offset1 + i1
            X[storeind] = A[linearind]
            storeind += 1
        end
    end
end

Testing:
First run a benchmark in the current master (must be recently updated): go to the test/ directory and type load("arrayperf.jl")
Then, check out and make the arrays_linearindex branch. Run the same test. Compare.

The changes are:
abstractarray.jl: make_arrayind_loop_nest and gen_array_index_map
array.jl: change to call gen_array_index_map, changing the arguments appropriately
I thought about trying to use the array output of genbodies in gen_cartesian_map, but then I realized that this function doesn't allocate enough symbols to store the strides and offsets.

I think this is a step forward. Of course, I am very open to suggestions for better implementations than I've provided.

@timholy
Copy link
Member Author

timholy commented Jun 29, 2012

Another update:

  1. I went ahead and implemented assign, too. There, I've seen improvements as big as 19-fold.
  2. Take the "random operations" with a grain of salt---the particular assign/ref statements vary (randomly) from run-to-run.

@JeffBezanson
Copy link
Member

Great work! The gen_array_index_map stuff should be merged right away, I'd say.

@timholy
Copy link
Member Author

timholy commented Jun 30, 2012

Merged. Since this resolves the issue about high-dimensional arrays, I'm closing this issue. #190 is still worth implementing, so I'll leave that open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants