Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsorted method tables #35983

Merged
merged 2 commits into from
Jul 7, 2020
Merged

unsorted method tables #35983

merged 2 commits into from
Jul 7, 2020

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented May 22, 2020

This experiment tries to see what happens if we completely un-sort the method tables. Turns out we can actually still do surprisingly well for performance without those caches (lookup costs seem to range from 2-10x slower across a small sample of queries tested). This lets us reconsider which types of caches and trees we want to use to store this data.

[Ignore the commit history here, unless you really want to see all of my mistakes in handling a non-transitive sort. You might find them informative, if you wonder about the final algorithm I've landed upon here currently. But I imagine they'd be pretty hard to comprehend on their own.]

@vtjnash vtjnash requested a review from JeffBezanson May 22, 2020 02:33
@vtjnash vtjnash force-pushed the jn/gf-lazy-sort branch 3 times, most recently from 73fab73 to ba39948 Compare June 25, 2020 00:54
@vtjnash vtjnash marked this pull request as ready for review June 25, 2020 15:29
@vtjnash
Copy link
Member Author

vtjnash commented Jun 25, 2020

Most recent interesting commit history (before squashing) is ba39948. But I'm pretty happy with this now. Performance is harder to directly measure, since we expect most common cases now to hit the dispatch-tuple-types cache, but (given that) I think this is in a stable state now to review/merge.

@JeffBezanson
Copy link
Member

We should look at the load times of various packages, and some other latencies. I'm building it now and the first thing that jumped out at me is loading LinearAlgebra during bootstrap is significantly faster --- I imagine since it defines a large number of methods with complex signatures. Base itself is a hair slower. Anyway we should give this a thorough benchmarking.

@JeffBezanson
Copy link
Member

This has come a long way and is now pretty close, but I still see some regressions vs. master.
master:

julia> @time using Plots
  8.111273 seconds (8.30 M allocations: 490.805 MiB, 4.19% gc time)

julia> @time display(plot(rand(10)))
 11.280785 seconds (9.40 M allocations: 524.971 MiB, 1.35% gc time)

PR:

julia> @time using Plots
 12.068010 seconds (13.65 M allocations: 788.507 MiB, 3.23% gc time)

julia> @time display(plot(rand(10)))
 12.045214 seconds (10.56 M allocations: 590.858 MiB, 3.34% gc time)

@vtjnash
Copy link
Member Author

vtjnash commented Jun 26, 2020

Any timing without #36436 probably isn't too useful, but yes, Plots includes hits the harder cases quite heavily (far too much compilation due to using Artifacts and too many difficult backedges from a large precompile file), and I haven't worked as much on optimizing those yet. I'm surprised by how slow your computer is though—were you also using FORCE_ASSERTIONS=1? as that also would potentially invalidate attempted measurements. By comparison, ApproxFun load time I see amazingly go from 28s to 12s!

@JeffBezanson
Copy link
Member

My computer is just really old. I also see the 2x speedup in ApproxFun. Makes sense that different cases will have different perf characteristics here.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 26, 2020

Curious also what you see if you apply #35384 too. There's some ordering issues if we were to try to land that directly on master now which could cause some rare correctness issues (that's solved here, by simply not relying on ordering).

This PR (with #36436):

$ time ./julia -e '@time using Plots; @time display(@time plot(1:10))'
  5.386617 seconds (9.16 M allocations: 574.738 MiB, 2.18% gc time)
  3.231135 seconds (7.88 M allocations: 443.955 MiB, 7.36% gc time)
  8.201448 seconds (14.35 M allocations: 804.838 MiB, 3.84% gc time)

real	0m14.062s
user	0m12.790s
sys	0m0.387s

And then adding in #35384 also:

time ./julia -e '@time using Plots; @time display(@time plot(1:10))'
  4.436754 seconds (9.50 M allocations: 592.946 MiB, 2.44% gc time)
  2.962513 seconds (7.24 M allocations: 408.276 MiB, 8.35% gc time)
  6.658311 seconds (13.65 M allocations: 766.289 MiB, 4.79% gc time)

real	0m11.445s
user	0m11.367s
sys	0m0.354s

(note that about 2 seconds there—or almost half of that load time—appears in the Profile inside __init__ methods, aka Pkg.Artifacts, so there's nothing I can do about that)

@vtjnash vtjnash force-pushed the jn/gf-lazy-sort branch 2 times, most recently from 227fd8f to 4c75c70 Compare July 1, 2020 19:04
src/gf.c Outdated
env.match.env = search.env;
env.match.issubty = 1;
env.match.fptr(ml, &env.match); // also matches all ambig and prior, as needed
jl_typemap_intersection_visitor(defs, offs, &env.match);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to figure out some sort of limit on this --- sometimes it returns a large number of items during inference, and we're unlikely to cut it down to 3. For example ==(::Any, ::Any). Maybe a larger arbitrary cutoff? Or maybe count how many methods had dispatchtuple signatures, since we know those are all disjoint, and early-out based on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most efficient would perhaps be to change lim just to mean "number of intersecting signatures". I was slightly surprised I managed to preserve (and refine) the original meaning here. There's an early-exit in the first O(n^2) loop, which helps considerably for that case. Counting dispatchtuple signatures seems like a simple addition to that.

julia> @btime Base._methods_by_ftype(Tuple{typeof(==), Any, Any}, -1, Base.get_world_counter())
  76.585 μs (194 allocations: 12.88 KiB) # master
  17.504 ms (46063 allocations: 2.56 MiB) # now (258d04e8bfdb8189fd61199bccd7641650fec07b)
  17.386 ms (46063 allocations: 2.56 MiB) # with earlier limit from counting dispatch-tupletypes matches (ec8961c77614b80d21ee51d6353769c9ae57d899)

julia> @btime Base._methods_by_ftype(Tuple{typeof(==), Any, Any}, 4, Base.get_world_counter())
  3.824 μs (8 allocations: 528 bytes) # master
  280.331 μs (858 allocations: 76.42 KiB) # PR
  880.929 ns (9 allocations: 624 bytes) # with limit

vtjnash added 2 commits July 1, 2020 22:16
Rather than stopping at the first subtyping match, this scans the entire
table of methods for intersections for every lookup. It then filters the
results to remove ambiguities and sort for specificity order.

Also exports the has_ambiguity computation from ml-matches, as it's
cheap here to provide (already computed), and required by Core.Compiler
(and also more correct than what the optimizer had been doing, though
only on extreme hypothetical edge cases).
We know that dispatch tuple signatures are always disjoint and
most-specific over their intersection, so they are always going to be
included in the result. We can use that knowledge to abort the search
quickly as we discover that there will be many specific matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants