Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading CxxWrap.jl causes slowdown in GAP.jl #258

Closed
fingolfin opened this issue Jul 17, 2020 · 10 comments
Closed

loading CxxWrap.jl causes slowdown in GAP.jl #258

fingolfin opened this issue Jul 17, 2020 · 10 comments

Comments

@fingolfin
Copy link
Contributor

The package GAP.jl does not use CxxWrap.jl at all, but is part of an umbrella project (Oscar.jl), which happens to include two other packages which both use CxxWrap.jl; as a result, GAP.jl and CxxWrap.jl are often loaded at the same time.

I recently discovered that just the act of loading CxxWrap.jl will substantially slow down GAP.jl. For example, take this snippet (taken from oscar-system/GAP.jl#485):

using GAP, BenchmarkTools
function test_simple()
  for i = 1:1000000
    x = GAP.Globals.IsAbelian
  end
end
@btime test_simple()

This takes 48 ms for me in Julia 1.4.2 and CxxWrap v0.10.2. However, if I then do using CxxWrap, the same test suddenly takes 238ms. With v0.11.0 it seems to have gotten a bit worse and now takes ~290 ms. I see no difference with Julia 1.5, I think (but i have to carefully remeasure everything if anybody thinks that the Julia version might matter).

@rbehrends looked briefly into it and reported the following:

Some profiling indicates that the extra time is spent in jl_typemap_level_assoc_exact(), called from jl_lookup_generic_(). For the slow case it is being called a lot, for the fast case it doesn't even seem to show up on the profile.

I am going to try to reduce this to a smaller example, ideally one that doesn't involve GAP.jl; but in the meantime, I thought I should already post this, in case anything of the above rings a bell for anybody.

@fingolfin
Copy link
Contributor Author

I minimized the example in the sense that while it still relies on GAP.jl, it does so in a minimal fashion: via ccall:

using GAP, BenchmarkTools

function func(ptr::Ptr{Cvoid})::Any
    return ccall(:julia_gap, Any, (Ptr{Cvoid},), ptr)
end

function test_foo()
  for i = 1:1000000
    v = ccall(:GAP_ValueGlobalVariable, Ptr{Cvoid}, (Ptr{UInt8},), "IsAbelian")
    x = func(v)
  end
end
@btime test_foo()

However, I failed reduce this further, by replacing func by the following, then the resulting snippet has the same speed regardless of whether CxxWrap is loaded or not:

function func(ptr::Ptr{Cvoid})::Any
    return ccall(:strlen, Cint, (Ptr{Cvoid},), ptr)
end

function test_foo()
  for i = 1:1000000
    v = ccall(:strchr, Ptr{Cvoid}, (Ptr{UInt8}, Cint), "test", 't')
    x = func(v)
  end
end
@btime test_foo()

Note that the C function julia_to_gap is defined in a dlopen'ed loadable module, not a shared library (specifically, GAP.jl loads libgap.so; then calls the init function of that shared library, which in turn ultimately loads a loadable module JuliaInterface.so, which contains julia_to_gap. So now I wonder if this plays a role (and which)...

@barche
Copy link
Collaborator

barche commented Jul 18, 2020

Do you have a stack trace of how it gets to jl_typemap_level_assoc_exact? My first hunch is that because CxxWrap is adding a lot of types and/or methods that might make some Julia lookup table slower, causing the slowdown?

@rbehrends
Copy link

Here are two profiles from running the original test under @btime, the first with CxxWrap, the second without. As you can see, jl_typemap_level_assoc_exact() is called from jl_lookup_generic_(). From the looks of it, the underlying reason seems to be the call_cache lookups failing.

1. Code

function test()
  for i = 1:1000000
    x = GAP.Globals.IsAbelian
  end
end

@btime test()

2. Profile with CxxWrap

image

3. Profile without CxxWrap

image

@barche
Copy link
Collaborator

barche commented Jul 19, 2020

CxxWrap adds a lot (more than 300) of convert methods, maybe these interfere here somehow. You can check if that is the problem by commenting out the evals that define convert here: https://github.com/JuliaInterop/CxxWrap.jl/blob/master/src/CxxWrap.jl#L642 (lines 642, 646 and 647).

@fingolfin
Copy link
Contributor Author

I managed to narrow it down further:

function func_slow(ptr::Ptr{Cvoid})::Any
    return ccall(:julia_gap, Any, (Ptr{Cvoid},), ptr)
end
function test_slow()
  for i = 1:1000000
    x = func_slow(C_NULL)
  end
end
@btime test_slow()

This goes from 2.2ms to 185ms when CxxWrap.jl is loaded. Let me point out once again that julia_gap is a function defined in a loadable module which is dlopen'ed not by us resp. Julia, but by a shared library (libgap,written in C). However, I also tried to Libdl.dlopen it myself and then use const julia_gap_sym = Libdl.dlsym(JI_handle, :julia_gap) in a ccall; but that gave identical performance).

Aaanyway: What I noticed is this: Before loading CxxWrap, I get this:

julia> @code_typed test_slow()
CodeInfo(
1 ─       goto #7 if not true
2%2  = φ (#1 => 1, #6 => %9)::Int64$(Expr(:foreigncall, :(:julia_gap), Any, svec(Ptr{Nothing}), 0, :(:ccall), :($(QuoteNode(Ptr{Nothing} @0x0000000000000000))), :($(QuoteNode(Ptr{Nothing} @0x0000000000000000)))))::Any%4  = (%2 === 1000000)::Bool
└──       goto #4 if not %4
3 ─       goto #5
4%7  = Base.add_int(%2, 1)::Int64
└──       goto #5
5%9  = φ (#4 => %7)::Int64%10 = φ (#3 => true, #4 => false)::Bool%11 = Base.not_int(%10)::Bool
└──       goto #7 if not %11
6 ─       goto #2
7return
) => Nothing

julia> @code_typed func_slow(C_NULL)
CodeInfo(
1%1 = $(Expr(:foreigncall, :(:julia_gap), Any, svec(Ptr{Nothing}), 0, :(:ccall), :(ptr), :(ptr)))::Any
└──      return %1
) => Any

After loading CxxWrap.jl, I get this instead:

julia> @code_typed test_slow()
CodeInfo(
1 ─       goto #7 if not true
2%2  = φ (#1 => 1, #6 => %9)::Int64
│         invoke Main.func_slow(Main.C_NULL::Ptr{Nothing})::Any%4  = (%2 === 1000000)::Bool
└──       goto #4 if not %4
3 ─       goto #5
4%7  = Base.add_int(%2, 1)::Int64
└──       goto #5
5%9  = φ (#4 => %7)::Int64%10 = φ (#3 => true, #4 => false)::Bool%11 = Base.not_int(%10)::Bool
└──       goto #7 if not %11
6 ─       goto #2
7return
) => Nothing

julia> @code_typed func_slow(C_NULL)
CodeInfo(
1%1 = Main.Any::Core.Compiler.Const(Any, false)
│   %2 = $(Expr(:foreigncall, :(:julia_gap), Any, svec(Ptr{Nothing}), 0, :(:ccall), :(ptr), :(ptr)))::Any%3 = Base.convert(%1, %2)::Any
└──      return %3
) => Any

That is, it fails to inline func_slow; and it inserts a call to Base.convert inside of func_slow which was not present before!

And if I drop the ::Any in the definition of func_slow, I get identical performance with and without Any!

I suspect this is because of the CxxWrap.jl method listed here:

julia> methods(convert, (Type{Any}, Any))
# 2 methods for generic function "convert":
[1] convert(to_type::Type{Any}, x::CxxWrap.CxxWrapCore.SmartPointer{DerivedT}) where DerivedT in CxxWrap.CxxWrapCore at /Users/mhorn/.julia/packages/CxxWrap/ZOkSN/src/CxxWrap.jl:272
[2] convert(::Type{Any}, x) in Base at essentials.jl:170

So whenever Julia code contains a call convert(Any, foobar), prior to loading CxxWrap.jl it seems Julia can optimize this away; but after loading CxxWrap, it can not do so anymore..

All this suggests how I might resolve the issue (by dropping that ::Any), but I still don't fully understand what's going on here. Like, why does it seem to matter that the function I ccall is in a loadable module (perhaps it doesn't and this just a red herring; but so far I was not able to reproduce the slowdown by calling a function inside a regular shared library...)

fingolfin added a commit to fingolfin/GAP.jl that referenced this issue Jul 20, 2020
... by dropping the `::Any` return value annotation in `RAW_GAP_TO_JULIA`:
Normally, Julia optimizes this away as a no-op. But CxxWrap.jl installs
a method for converting `CxxWrap.CxxWrapCore.SmartPointer{DerivedT}` to `Any`;
this precludes Julia from optimizing the conversion away.

See also:
- oscar-system#485
- JuliaInterop/CxxWrap.jl#258
fingolfin added a commit to oscar-system/GAP.jl that referenced this issue Jul 20, 2020
... by dropping the `::Any` return value annotation in `RAW_GAP_TO_JULIA`:
Normally, Julia optimizes this away as a no-op. But CxxWrap.jl installs
a method for converting `CxxWrap.CxxWrapCore.SmartPointer{DerivedT}` to `Any`;
this precludes Julia from optimizing the conversion away.

See also:
- #485
- JuliaInterop/CxxWrap.jl#258
@barche
Copy link
Collaborator

barche commented Jul 21, 2020

Does the slowdown also happen if instead of loading CxxWrap, you add a method do convert? e.g. do this before running the test:

julia> struct Foo end
julia> Base.convert(::Type{Any}, x::Foo) = x

@fingolfin
Copy link
Contributor Author

Yes indeed, it does.

@barche
Copy link
Collaborator

barche commented Jul 21, 2020

OK, then I guess it's best to open an issue for this with Julia itself?

@fingolfin
Copy link
Contributor Author

Well, one might argue that convert(Any, x) should always yield x -- that CxxWrap.jl changes this is really surprising to me. However, if one accepts this as "correct", then I guess also what Julia then is doing is correct -- after all, I asked it to convert a value to Any, thinking that this does nothing; but now CxxWrap instructs Julia that, yes, conversion to Any does do something after all... So I see no bug in Julia here -- except perhaps to permit a non-standard conversion to Any ?

It still might be a good idea to get some core Julia people involved, just as a form of sanity check?

barche added a commit that referenced this issue Sep 7, 2020
barche added a commit that referenced this issue Sep 7, 2020
@barche
Copy link
Collaborator

barche commented Sep 7, 2020

Fixed in the master branch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants