Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3x slowdown after 1st run of readcsv #3441

Closed
quinnj opened this issue Jun 18, 2013 · 21 comments
Closed

3x slowdown after 1st run of readcsv #3441

quinnj opened this issue Jun 18, 2013 · 21 comments
Labels
performance Must go faster

Comments

@quinnj
Copy link
Member

quinnj commented Jun 18, 2013

I've been trying out the new readcsv changes this morning and noticed something funny:

julia> timing()
elapsed time: 3.146338111 seconds
elapsed time: 9.017399806 seconds
elapsed time: 9.482192787 seconds
elapsed time: 9.330088505 seconds
elapsed time: 9.368527561 seconds
elapsed time: 9.520508683 seconds
elapsed time: 9.536091891 seconds
elapsed time: 9.311094323 seconds
elapsed time: 9.302004193 seconds
elapsed time: 9.412307084 seconds

Subsequent reads to the 1st are 3x; anyone know what's going on here? @tanmaykm?

This is great progress though. In my personal benchmarks, R's read.csv bare is 1.9x us, and with certain optimizations to read.csv, we're still 1.4x. Great stuff!

@dmbates
Copy link
Member

dmbates commented Jun 18, 2013

Are you garbage collecting between trials? The fact that the first run is fast and the later runs are slower seems like a memory allocation issue.

@quinnj
Copy link
Member Author

quinnj commented Jun 18, 2013

It does indeed seem to be a memory allocation issue.

function timing()
    gc_disable()
    for i = 1:5
        @time readcsv("sales2.csv")
    end
    gc_enable()
 end

Results

julia> timing()
elapsed time: 3.064959336 seconds
elapsed time: 2.929288328 seconds
elapsed time: 3.025489937 seconds
elapsed time: 3.119300652 seconds
elapsed time: 3.113658087 seconds

@ViralBShah
Copy link
Member

Can we close this?

@quinnj
Copy link
Member Author

quinnj commented Jun 18, 2013

What would be the solution here? I guess I'm confused how a user is supposed to consistently have performance without being aware of the gc_enable/disable functions.

@StefanKarpinski
Copy link
Member

Why would we close this? The issue doesn't seem addressed at all. Using mmap might help this.

@JeffBezanson
Copy link
Member

It's well known that disabling a GC will get you better performance right until you run out of memory and crash.

@tanmaykm
Copy link
Member

This would be typical of any operation that taxes gc by allocating and releasing lots of memory/objects. Memory map may help here as @StefanKarpinski pointed out. If the cells are numeric, it may also help to clean the data and explicitly set the data type in readcsv.

@JeffBezanson
Copy link
Member

I think I can do a better job tuning the gc. There are some knobs to twiddle.

@timholy
Copy link
Member

timholy commented Jun 18, 2013

@karbarcca, it may not be applicable to your case here, but I've also found the profiler can be extremely useful for figuring out gc issues (of course, maybe you're already doing that...). For example, just today I was writing some code that, over the course of a couple hours, I managed to make 300x faster. Some of the bottlenecks were gc's triggered from unexpected places. For example, the majority of the time in my code was spent multiplying lots of 3x3 matrices times vectors, so it proved to be completely worth writing my own specialized function:

function gemv3!(res, A, x)
    res[1] = A[1]*x[1] + A[4]*x[2] + A[7]*x[3]
    res[2] = A[2]*x[1] + A[5]*x[2] + A[8]*x[3]
    res[3] = A[3]*x[1] + A[6]*x[2] + A[9]*x[3]
    res
end

In my initial version of this function, it lacked the last line (returning res), and surprisingly gc was getting called. That showed up as excessive time spent on the third row multiplication; since the profiler drills down into the C code it was easy to see that gc was responsible. Inserting the return of res eliminated gc from being executed on this line; by itself this one optimization netted a sizable performance boost. Eventually I managed to expunge all gc calls from the code (as well as make other important improvements).

@StefanKarpinski
Copy link
Member

These kinds of things always give me a mix of horror and delight – horror that you have to write out that 3x3 matmul like that, but delight that you can do that and get it to be as fast as you need. We'll get there.

@JeffBezanson
Copy link
Member

We do need more small-matrix optimizations. One problem especially is that the logic and code flow in matmul.jl has gotten really complex, and there are several layers of function calls.

@timholy
Copy link
Member

timholy commented Jun 19, 2013

I fall mostly on the delight side. It took me, what, 2 minutes to write and wire into the various places in my code? Understanding the gc on the return value was more subtle and cost me more time, but nothing outrageous. The workflow "quickly write the simple version first" (which is really pleasant thanks to Julia's design and the nice library that everyone has been contributing to) -> "run it" -> "ugh, too slow" -> "profile it" -> "fix a few problems in places where it actually matters" -> "ah, that's much nicer!" is quite satisfying. If I had to write everything in C from the ground up it would be far more tedious. And doing a little hand-optimization gives us geeks something to make us feel like we're earning our keep.

The only time I get frustrated is when I can't get the performance I know I deserve :-). But that's getting rarer all the time.

I guess the major negative is that knowing what to do to solve performance problems does take some experience. Think of it as training---it makes us all stronger. I know I've gotten better thanks to the example code in julia and the help from the community.

@StefanKarpinski
Copy link
Member

Well, that's a pleasant view of it. I guess the key points are that

  1. you can write the easy version that works, and
  2. you can usually get it fast with a bit more work.

@timholy
Copy link
Member

timholy commented Jun 19, 2013

Back to the original issue, does anyone besides me think it's suspicious that it's a 6s increment after the first run? In saying this, I'm presuming that:

  • a single pass of gc() takes vastly less than 6s to run after the first iteration (although it might explain everything if it did take 6s). For the purposes of being able to make a specific argument, let's say one call to gc() takes < 0.1s.
  • any call to gc() clears all the garbage accumulated up to the current time.

If those two assumptions are true, why doesn't the second run take 0.1s longer, rather than 6s longer, than the first? I'm expecting that the first call to gc() that occurs after the 2nd call to readcsv should effectively take us back to the original state. Are we seeing some kind of "destructive interference" in the precise timing with which the gc gets automatically called??

@JeffBezanson
Copy link
Member

It has a lot to do with the GC's heap growth schedule, which can introduce a sliver of super-linear run time behavior. If the heap grows too slowly, it has to do many unnecessary collections as this very large data structure is filled in.
#261 will help a lot with this.

JeffBezanson added a commit that referenced this issue Jun 20, 2013
- make it an error to resize an array with shared data (fixes #3430)
- now able to use realloc to grow arrays (part of #3440, helps #3441)
- the new scheme is simpler. one Array owns the data, instead of
  tracking the buffers separately as mallocptr_t
- Array data can be allocated inline, with malloc, or from a pool
@timholy
Copy link
Member

timholy commented Jun 20, 2013

#261 does look awesome. Aside from the (much more important) performance improvements, presumably it would also make many finalizers run at more predictable times, which perhaps could be a good thing.

Now it doesn't sound relevant, but in cases where "slivers" of poor performance appear for very specific parameter combinations, is it worth asking whether introducing some element of randomness could be your friend?

@JeffBezanson
Copy link
Member

@karbarcca can you try this again and see if there's any improvement?

@quinnj
Copy link
Member Author

quinnj commented Jul 2, 2013

Unfortunately I can't get readcsv to run now; I think it has to do with some mmap stuff that was added? I'm on Windows 8 64-bit native build as of this morning (cd35600701).

julia> @time readcsv("C:/Users/karbarcca/Google Drive/Dropbox/Dropbox/Sears/Teradata/Sears-Julia/sales2.csv")
symbol could not be found fcntl (-1): The specified procedure could not be found.
symbol could not be found lseek (-1): The specified procedure could not be found.
symbol could not be found lseek (-1): The specified procedure could not be found.
symbol could not be found pwrite (-1): The specified procedure could not be found.
symbol could not be found lseek (-1): The specified procedure could not be found.
symbol could not be found mmap (-1): The specified procedure could not be found.
ERROR: ccall: could not find function fcntl
 in mmap_stream_settings at mmap.jl:74 (repeats 26666 times)

I'll try to dig into it some more to get it to run.

@ViralBShah
Copy link
Member

I think this issue should be closed, and the mmap should be a new issue.

@quinnj
Copy link
Member Author

quinnj commented Jul 2, 2013

Ok, got it to work with the following and with multiple runs, I get consistent times. Is the use_mmap default supposed to be true by default? It seems like it might be better to default false until it's working on all platforms.

@time readdlm("C:/Users/karbarcca/Google Drive/Dropbox/Dropbox/Sears/Teradata/Sears-Julia/sales2.csv",',',Float64;use_mmap=false)

@tanmaykm
Copy link
Member

tanmaykm commented Jul 2, 2013

It seems like mmap support is not complete on windows? Can you probably try passing use_mmap=false to readcsv.

@quinnj quinnj closed this as completed Jul 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants