-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unsorted method tables #35983
unsorted method tables #35983
Conversation
73fab73
to
ba39948
Compare
Most recent interesting commit history (before squashing) is ba39948. But I'm pretty happy with this now. Performance is harder to directly measure, since we expect most common cases now to hit the dispatch-tuple-types cache, but (given that) I think this is in a stable state now to review/merge. |
We should look at the load times of various packages, and some other latencies. I'm building it now and the first thing that jumped out at me is loading LinearAlgebra during bootstrap is significantly faster --- I imagine since it defines a large number of methods with complex signatures. Base itself is a hair slower. Anyway we should give this a thorough benchmarking. |
This has come a long way and is now pretty close, but I still see some regressions vs. master.
PR:
|
Any timing without #36436 probably isn't too useful, but yes, Plots includes hits the harder cases quite heavily (far too much compilation due to using Artifacts and too many difficult backedges from a large precompile file), and I haven't worked as much on optimizing those yet. I'm surprised by how slow your computer is though—were you also using |
My computer is just really old. I also see the 2x speedup in ApproxFun. Makes sense that different cases will have different perf characteristics here. |
Curious also what you see if you apply #35384 too. There's some ordering issues if we were to try to land that directly on master now which could cause some rare correctness issues (that's solved here, by simply not relying on ordering). This PR (with #36436):
And then adding in #35384 also:
(note that about 2 seconds there—or almost half of that load time—appears in the Profile inside |
227fd8f
to
4c75c70
Compare
src/gf.c
Outdated
env.match.env = search.env; | ||
env.match.issubty = 1; | ||
env.match.fptr(ml, &env.match); // also matches all ambig and prior, as needed | ||
jl_typemap_intersection_visitor(defs, offs, &env.match); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to figure out some sort of limit on this --- sometimes it returns a large number of items during inference, and we're unlikely to cut it down to 3. For example ==(::Any, ::Any)
. Maybe a larger arbitrary cutoff? Or maybe count how many methods had dispatchtuple signatures, since we know those are all disjoint, and early-out based on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most efficient would perhaps be to change lim
just to mean "number of intersecting signatures". I was slightly surprised I managed to preserve (and refine) the original meaning here. There's an early-exit in the first O(n^2)
loop, which helps considerably for that case. Counting dispatchtuple
signatures seems like a simple addition to that.
julia> @btime Base._methods_by_ftype(Tuple{typeof(==), Any, Any}, -1, Base.get_world_counter())
76.585 μs (194 allocations: 12.88 KiB) # master
17.504 ms (46063 allocations: 2.56 MiB) # now (258d04e8bfdb8189fd61199bccd7641650fec07b)
17.386 ms (46063 allocations: 2.56 MiB) # with earlier limit from counting dispatch-tupletypes matches (ec8961c77614b80d21ee51d6353769c9ae57d899)
julia> @btime Base._methods_by_ftype(Tuple{typeof(==), Any, Any}, 4, Base.get_world_counter())
3.824 μs (8 allocations: 528 bytes) # master
280.331 μs (858 allocations: 76.42 KiB) # PR
880.929 ns (9 allocations: 624 bytes) # with limit
Rather than stopping at the first subtyping match, this scans the entire table of methods for intersections for every lookup. It then filters the results to remove ambiguities and sort for specificity order. Also exports the has_ambiguity computation from ml-matches, as it's cheap here to provide (already computed), and required by Core.Compiler (and also more correct than what the optimizer had been doing, though only on extreme hypothetical edge cases).
We know that dispatch tuple signatures are always disjoint and most-specific over their intersection, so they are always going to be included in the result. We can use that knowledge to abort the search quickly as we discover that there will be many specific matches.
This experiment tries to see what happens if we completely un-sort the method tables. Turns out we can actually still do surprisingly well for performance without those caches (lookup costs seem to range from 2-10x slower across a small sample of queries tested). This lets us reconsider which types of caches and trees we want to use to store this data.
[Ignore the commit history here, unless you really want to see all of my mistakes in handling a non-transitive sort. You might find them informative, if you wonder about the final algorithm I've landed upon here currently. But I imagine they'd be pretty hard to comprehend on their own.]