Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Iterate over smaller set for setdiff[!](a,b) #29048

Merged
merged 2 commits into from
Sep 11, 2018

Conversation

laborg
Copy link
Contributor

@laborg laborg commented Sep 5, 2018

Attempt to fix #25317 by specialising on setdiff!(a::Set, b::Set) and iterating over a instead of b if length(a) < length(b) * 0.5. This 0.5 length threshold is a conservative guess after benchmarking has shown, that the iteration over a in case length(a) == length(b) slows down execution by ~20%.

setdiff() uses setdiff! internally, so both profit from the performance improvement.

Benchmarks

Code:

using Combinatorics
s = Set.([[1],1:450_000,1:550_000,1:1_000_000]);
p = collect(permutations([1,2,3,4],2));
for x in p
    for f in [setdiff, setdiff!]
        l,r = length(s[x[1]]), length(s[x[2]])
        print(lpad(string(f,"(",l,",",r,"): "),28))
        @btime $(f)(y...) evals=1 setup=(y=($copy($s[$x[1]]),$copy($s[$x[2]])))
    end
end

Master:

         setdiff(1,450000):   11.234 ms (5 allocations: 496 bytes)
        setdiff!(1,450000):   11.197 ms (0 allocations: 0 bytes)
         setdiff(1,550000):   12.250 ms (5 allocations: 496 bytes)
        setdiff!(1,550000):   12.740 ms (0 allocations: 0 bytes)
        setdiff(1,1000000):   24.942 ms (5 allocations: 496 bytes)
       setdiff!(1,1000000):   25.276 ms (0 allocations: 0 bytes)
         setdiff(450000,1):   1.742 ms (7 allocations: 9.00 MiB)
        setdiff!(450000,1):   742.000 ns (0 allocations: 0 bytes)
    setdiff(450000,550000):   21.525 ms (7 allocations: 9.00 MiB)
   setdiff!(450000,550000):   18.574 ms (0 allocations: 0 bytes)
   setdiff(450000,1000000):   37.417 ms (7 allocations: 9.00 MiB)
  setdiff!(450000,1000000):   35.589 ms (0 allocations: 0 bytes)
         setdiff(550000,1):   1.719 ms (7 allocations: 9.00 MiB)
        setdiff!(550000,1):   409.000 ns (0 allocations: 0 bytes)
    setdiff(550000,450000):   20.650 ms (7 allocations: 9.00 MiB)
   setdiff!(550000,450000):   18.009 ms (0 allocations: 0 bytes)
   setdiff(550000,1000000):   40.565 ms (7 allocations: 9.00 MiB)
  setdiff!(550000,1000000):   38.338 ms (0 allocations: 0 bytes)
        setdiff(1000000,1):   4.435 ms (7 allocations: 18.00 MiB)
       setdiff!(1000000,1):   748.000 ns (0 allocations: 0 bytes)
   setdiff(1000000,450000):   22.211 ms (7 allocations: 18.00 MiB)
  setdiff!(1000000,450000):   18.457 ms (0 allocations: 0 bytes)
   setdiff(1000000,550000):   24.539 ms (7 allocations: 18.00 MiB)
  setdiff!(1000000,550000):   20.906 ms (0 allocations: 0 bytes)

This PR:

         setdiff(1,450000):   815.000 ns (5 allocations: 496 bytes)
        setdiff!(1,450000):   474.000 ns (0 allocations: 0 bytes)
         setdiff(1,550000):   916.000 ns (5 allocations: 496 bytes)
        setdiff!(1,550000):   414.000 ns (0 allocations: 0 bytes)
        setdiff(1,1000000):   1.813 μs (5 allocations: 496 bytes)
       setdiff!(1,1000000):   715.000 ns (0 allocations: 0 bytes)
         setdiff(450000,1):   1.579 ms (7 allocations: 9.00 MiB)
        setdiff!(450000,1):   341.000 ns (0 allocations: 0 bytes)
    setdiff(450000,550000):   20.185 ms (7 allocations: 9.00 MiB)
   setdiff!(450000,550000):   18.336 ms (0 allocations: 0 bytes)
   setdiff(450000,1000000):   18.241 ms (7 allocations: 9.00 MiB)
  setdiff!(450000,1000000):   16.689 ms (0 allocations: 0 bytes)
         setdiff(550000,1):   1.575 ms (7 allocations: 9.00 MiB)
        setdiff!(550000,1):   391.000 ns (0 allocations: 0 bytes)
    setdiff(550000,450000):   19.741 ms (7 allocations: 9.00 MiB)
   setdiff!(550000,450000):   17.697 ms (0 allocations: 0 bytes)
   setdiff(550000,1000000):   39.603 ms (7 allocations: 9.00 MiB)
  setdiff!(550000,1000000):   38.109 ms (0 allocations: 0 bytes)
        setdiff(1000000,1):   3.316 ms (7 allocations: 18.00 MiB)
       setdiff!(1000000,1):   705.000 ns (0 allocations: 0 bytes)
   setdiff(1000000,450000):   21.766 ms (7 allocations: 18.00 MiB)
  setdiff!(1000000,450000):   18.333 ms (0 allocations: 0 bytes)
   setdiff(1000000,550000):   24.342 ms (7 allocations: 18.00 MiB)
  setdiff!(1000000,550000):   21.027 ms (0 allocations: 0 bytes)

base/set.jl Outdated
# get rid of pathological cases where length(s) <<< length(t)
# the 0.5*length is a very conservative threshold
function setdiff!(s::Set, t::Set)
if length(s) < length(t) * 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about writing this as 2*length(s) < length(t)? That will be simpler and faster than converting to float (although it doesn't really matter when doing much more expensive operations on collections, but waste not, want not and all that.)

@StefanKarpinski
Copy link
Member

Is the 2x length ration threshold empirically determined or is there some intuition behind it? Either way it's probably worth noting in a comment.

@laborg
Copy link
Contributor Author

laborg commented Sep 10, 2018

I made some statistics with Ints as elements, starting with 1:1 ratio and it appeared that a good threshold for those is around 0.8, but this depends on a couple of things:

  • size of both sets
  • size of the conjuction of both sets
  • cost of hash-lookup (ht_keyindex())
  • cost of removal (_delete())

With those variables its hard to get to a generic threshold and therefore the conservative estimation of 0.5 (~things should not regress performance wise).
But what I saw is that substantial performance improvement is available for large size discrepancies anyway.

Maybe it would help to implement a faster filter_in_one_pass to get rid of the many lookups...

Copy link
Member

@rfourquet rfourquet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
I believe nanosoldier is on holidays, but the benchmarks in the OP are sufficiently convincing for me :)

@fredrikekre
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":release-1.0")

@fredrikekre
Copy link
Member

Derp
@nanosoldier runbenchmarks(ALL, vs=":master")

@laborg
Copy link
Contributor Author

laborg commented Sep 11, 2018

I want to find a better solution which also includes the other set functions (symdiff, union, intersect) plagued by this asymmetric behaviour - should I in the mean time close this PR?

@StefanKarpinski
Copy link
Member

It seems better to me to have a partial solution until we can have a more complete one. You can always revert this later if it's easier.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@mbauman
Copy link
Member

mbauman commented Sep 11, 2018

Benchmarks look to be their usual noisy selves. We do have a few setdiff benchmarks, but the arguments used don't appear to have large enough size disparities to show up in the results. I'll tag this as "needs benchmark" even though you provided good benchmarks in the first post just so we don't forget to migrate them over to BaseBenchmarks.jl. Great first PR!

@mbauman mbauman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Sep 11, 2018
@mbauman mbauman merged commit 773540d into JuliaLang:master Sep 11, 2018
KristofferC pushed a commit that referenced this pull request Sep 12, 2018
* Iterate over smaller set for setdiff[!]

* Add comment and change to integer multiplication

(cherry picked from commit 773540d)
@KristofferC KristofferC mentioned this pull request Sep 12, 2018
KristofferC pushed a commit that referenced this pull request Feb 11, 2019
* Iterate over smaller set for setdiff[!]

* Add comment and change to integer multiplication

(cherry picked from commit 773540d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

setdiff!(a,b) slower than setdiff(a,b) when b is large (and viceversa when a is large)
7 participants