Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GAP.jl makes julia slow (unusable?) #266

Closed
thofma opened this issue Jun 19, 2019 · 20 comments
Closed

Using GAP.jl makes julia slow (unusable?) #266

thofma opened this issue Jun 19, 2019 · 20 comments
Assignees
Labels
kind: bug Something isn't working

Comments

@thofma
Copy link
Contributor

thofma commented Jun 19, 2019

Pretty cool that GAP.jl is working! Unfortunately we were seeing severe performance regressions after doing using GAP (although the code did not use any GAP functionality). We boiled it down to the following example. This is with a fresh install of GAP.jl and GAPTypes.jl:

julia> using BenchmarkTools

julia> function g(n)
         z = [BigInt(1)]
         for i in 1:n
           z = [z[] * i]
         end
         return z
       end

julia> @benchmark g(200000)
BenchmarkTools.Trial: 
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     4.931 s (8.00% GC)
  median time:      5.022 s (7.10% GC)
  mean time:        5.022 s (7.10% GC)
  maximum time:     5.113 s (6.24% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> using GAP
Adding path /home/adsad/.julia/packages/GAP/aMcJN/gap/.libs to DL_LOAD_PATH
 ┌───────┐   GAP 4.10dev-2113-gf7f36d7 of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-pc-linux-gnu-julia64-kv6
 Configuration:  gmp 6.1.2, Julia GC, Julia 1.1.0
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.2, PrimGrp 3.3.2, SmallGrp 1.3, TransGrp 2.0.4
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'

julia> @benchmark g(200000)
BenchmarkTools.Trial: 
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     14.477 s (73.01% GC)
  median time:      14.477 s (73.01% GC)
  mean time:        14.477 s (73.01% GC)
  maximum time:     14.477 s (73.01% GC)
  --------------
  samples:          1
  evals/sample:     1

So a slowdown by a factor of 3x by just doing using GAP. I have other examples where, after using GAP, a function spends 99% of its time in the GC. I don't know if we can use GAP.jl in this form.

@CarloSircana @fieker

@mohamed-barakat
Copy link
Contributor

I can confirm the behavior with GAP.jl alone. This is a serious issue. Thanks for reporting.

@fingolfin
Copy link
Member

@thofma thanks for the report. That is of course unacceptable!

@rbehrends can you look into this? We first need to establish the exact cause of this. Of course the immediate suspect is the stack scanning. Which makes me wonder if we can't implement one immediate workaround for the case were GAP is launched from Julia (the one that is relevant here): Namely, if we are not inside GAP code, we don't need to scan the code. Of course this needs to take nesting into account... Luckily we already have added some provisions for tracking whether GAP is "active" or not in GAP's libgap-api, we just are not yet that in GAP.jl, but I am sure @sebasguts and me can change that quickly.

Assuming stack scanning is the main problem, this should immediately help this particular case. Of course we'd still have a performance issue when GAP code is running; for that, we can also make use of the mentioned libgap-api code, to restrict the range of the stack we need to scan, though, which should also help there.

Of course that won't help if GAP is the main process and it starts Julia, but that is less important for OSCAR (though of course it'd be nice to speed that up, too)

@fingolfin fingolfin added the kind: bug Something isn't working label Jun 19, 2019
@fingolfin
Copy link
Member

Just to make it explicit, I am talking about GAP_EnterStack_() and GAP_LeaveStack_(). Though I am not sure whether we can use them from pure Julia, as we'd need the equivalent of __builtin_frame_address(0), I think. On the Julia c kernel level, there is jl_get_frame_addr, a macro, and also gc_get_stack_ptr; perhaps we could use the latter.

@rbehrends
Copy link
Contributor

I can kind of reproduce it, though in my case I "only" have a ~33% overhead. This may be due to architecture-specific differences, though. The culprit is GapTaskScanner(), the function responsible for scanning task stacks (but not the actual main stack, which is where the work happens).

Normally, scanning task stacks should not be that much of a problem. For starters, due to generational GC, unused task stacks should not be scanned every single time (or Julia would have a problem with highly recursive functions on those, too).

Even so, I'm not sure why there's such an inordinate amount of time spent scanning these; we do scan the main stack each time without ill effects.

I need to dig further into that to find out the actual underlying cause. (Architectural differences may be involved as macOS and Linux use slightly different task stack implementations, which could account for the differences in overhead that we are seeing.)

@rbehrends
Copy link
Contributor

The biggest issue right now seems to be that task stacks seem to be scanned every minor collection, regardless of whether they've been touched in the meantime, rather than just every major collection. This may be an issue for Julia with large task stacks, too, though I'll have to investigate that more closely.

@fingolfin
Copy link
Member

Which makes me wonder: did we ever test what happens if GAP is called from a task? Is this even supposed to work right now? If so: we definitely should add test cases for this

@rbehrends
Copy link
Contributor

rbehrends commented Jun 19, 2019

GAP is currently being called from a task when loaded as a Julia module, as far as I can tell. And yes, it is supposed to work; we have specific logic in GapRootScanner() to deal with that case.

Which doesn't mean that some other things can't go wrong, but in principle that use case is accounted for.

@rbehrends
Copy link
Contributor

I've committed a quick fix here. I'm not yet positive that it handles the generational issues cleanly (and may require more tweaking for that), so I'm not yet going to turn it into a PR, but it does fix the performance regression for me.

@rbehrends
Copy link
Contributor

I've updated the fix (same branch as above) to be more robust. However, in its current version it is implemented in C++, and we need to think about whether that is okay for GAP (other parts of GAP do use C++ already, but I'm not sure what the constraints are) and whether we need more testing for it due to some non-trivial changes. However, for now it should work for anybody who cannot wait for a patch to arrive officially.

@fingolfin
Copy link
Member

@thofma @mohamed-barakat The fix has now been merged into the GAP master branch. Would be great if you were able to verify that it fixes the regression for you. Then we can get it into GAP 4.11, which I hope we'll then be able to release in August.

@fingolfin
Copy link
Member

With the fix in place, I am now getting this:

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]

julia> function g(n)
         z = [BigInt(1)]
         for i in 1:n
           z = [z[] * i]
         end
         return z
       end
g (generic function with 1 method)

julia> @benchmark g(200000)
BenchmarkTools.Trial:
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     4.900 s (7.79% GC)
  median time:      4.923 s (8.66% GC)
  mean time:        4.923 s (8.66% GC)
  maximum time:     4.945 s (9.52% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> using GAP
Adding path /home/horn/Projekte/OSCAR/GAP.jl/gap/.libs to DL_LOAD_PATH
 ┌───────┐   GAP 4.dev of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-pc-linux-gnu-julia64-kv7
 Configuration:  gmp 6.1.2, Julia GC, Julia 1.1.1, readline
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.2, PrimGrp 3.3.2, SmallGrp 1.3, TransGrp 2.0.4
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'

julia> @benchmark g(200000)
BenchmarkTools.Trial:
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     5.264 s (12.15% GC)
  median time:      5.264 s (12.15% GC)
  mean time:        5.264 s (12.15% GC)
  maximum time:     5.264 s (12.15% GC)
  --------------
  samples:          1
  evals/sample:     1

So there still is a slow down, but rather minor. It would still be good to know how e.g. those tests @thofma mentions (where GC takes 99% of the time) now fare.

@rbehrends
Copy link
Contributor

Hmm, this is a bigger slowdown than I experienced when I last tested it (where it was difficult to even reproduce a slowdown). I'll have to look at why there may still be residual issues.

@thofma
Copy link
Contributor Author

thofma commented Jul 18, 2019

It is better. (I can't find the example with the 99% anymore).

@mohamed-barakat
Copy link
Contributor

@thofma @mohamed-barakat The fix has now been merged into the GAP master branch.

Works for me with no measurable regression.

@fingolfin
Copy link
Member

@mohamed-barakat so you are saying that you get identical benchmark results before and after using GAP? Interesting?

@rbehrends can you reproduce the remaining slowdown I reported? (I had it on two machines, both on my MacBook and also on a Ubuntu server)

@rbehrends
Copy link
Contributor

I'm getting completely different results (though my version of Julia is a couple of months old). I'm wondering if there are different versions of GMP involved? Anyhow, here is what I'm getting:

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.0-DEV.0
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |

julia> include("bigfun.jl")
g (generic function with 1 method)

julia> @benchmark g(200000)
BenchmarkTools.Trial: 
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     22.148 s (1.38% GC)
  median time:      22.148 s (1.38% GC)
  mean time:        22.148 s (1.38% GC)
  maximum time:     22.148 s (1.38% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> using GAP
Adding path /Users/behrends/develop/gap-dev/master/.libs to DL_LOAD_PATH
 ┌───────┐   GAP 4.dev of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-apple-darwin18.6.0-julia64-kv7
 Configuration:  gmp 6.1.2, Julia GC, Julia 1.3.0-DEV
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.2, PrimGrp 3.3.2, SmallGrp 1.3, TransGrp 2.0.4
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'

julia> @benchmark g(200000)
BenchmarkTools.Trial: 
  memory estimate:  35.99 GiB
  allocs estimate:  800003
  --------------
  minimum time:     22.151 s (1.95% GC)
  median time:      22.151 s (1.95% GC)
  mean time:        22.151 s (1.95% GC)
  maximum time:     22.151 s (1.95% GC)
  --------------
  samples:          1
  evals/sample:     1

This is a pretty fast machine, so I'm not sure why it would take so much longer per run. It may also mean that significant relative differences in GC performance are hidden because most of the time is spent doing calculations.

I'll finally add that even with generational GC, part of the remaining performance difference may simply be that we're dealing with a significantly bigger heap.

@mohamed-barakat
Copy link
Contributor

@mohamed-barakat so you are saying that you get identical benchmark results before and after using GAP? Interesting?

Yes, but maybe my tests do not test the GC heavily, at least, I was unable to measure regression. I'll try to make these test part of the Oscar CI.

@rbehrends
Copy link
Contributor

After looking at it in a profiler, I primarily see a bit more time being spent in sweep_pool_pagetable0, which is linear in the order of pool pages in use. This would indicate that the performance difference is indeed largely caused by the larger heap.

@fingolfin
Copy link
Member

@rbehrends can this issue be closed now?

@rbehrends
Copy link
Contributor

@fingolfin Yes, I believe we are good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants