Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generic code-inferrability test #37193

Merged
merged 2 commits into from
Nov 15, 2020
Merged

Add a generic code-inferrability test #37193

merged 2 commits into from
Nov 15, 2020

Conversation

timholy
Copy link
Member

@timholy timholy commented Aug 25, 2020

This PR adds an overall "code quality assurance" test, checking elements of inferrability of Base and the stdlibs. It largely tests just the precompiled MethodInstances, so does not offer comprehensive coverage, but these are connected to invalidations so this offers a form of protection against regression of our recent progress.

There are several elements of the test:

  • it checks whether Base.require is "bullet-proof" against invalidation. Anytime Base.require gets invalidated, it's more than half a second before you can start loading the next package in a sequence, so invalidations here are a big problem from the standpoint of package-loading performance. Currently this is marked @test_broken, because we still have some string-processing code with inference problems that could invalidate these methods, but a goal will be to turn these into @tests.
  • it tests the overall number of "worrisome" MethodInstances (length(badcounts) in the test code), as well as the average number of total dependents (backedges, and the backedges of their backedges, etc) of these methods (meancounts in the test code). Currently this PR sets the bar at 1250 worrisome MethodInstances and 33 average dependents. If/when my current unmerged PRs make it in, these can drop to 1000 and 32, respectively. For reference, Julia 1.5 has 2030 such MethodInstances with an average of 38.5 dependents. Most of what's left to fix appears to be in the area of string processing, so with a bit more effort we may be able to drop these numbers quite dramatically.
  • it checks inferrability of specific functions (mostly the is* functions which should generally return Bool), and whether specific unfortunate variants of certain other methods (==, <, etc) have been inferred. The absence of such MethodInstances indicates that all code that calls them is well-inferred.

It's worth acknowledging that this may put a new burden on contributors: not only do they need to contribute the functionality, but they need to make sure it's not dramatically worsening inference. I think this is worth having. After all, we check whitespace, and whitespace is probably not nearly as important to the overall user experience as having well-inferred, invalidation-resistant code. Nevertheless, to mitigate the extra hurdles I've tried to put some hints near places where tests might fail to help developers find resources to help them.

Closes #36393

@KristofferC
Copy link
Member

KristofferC commented Aug 26, 2020

It's worth acknowledging that this may put a new burden on contributors: not only do they need to contribute the functionality, but they need to make sure it's not dramatically worsening inference.

Because of this, I am not sure this is the best way to do this. And with "this way" I mean, setting an arbitrary cap and then hard erroring CI when that is reached. This would be a bit similar to setting a hard cap on sysimage build time and failing CI when that is reached (assuming that it is completely deterministic). When that point is reached, it is likely a lot of CI will start failing and the whole thing grinds to a halt until someone with the skill goes in and pushes to number of invalidations down again.

Instead, (just like compilation time), I feel like this should be tracked via some time-series functionality where it can be monitored and when some threshold is reached it might start blinking red or something but hard failing CI because someone does a small fix that pushes the number of backedges from 163 to 170 while the PR that was merged yesterday took it from 130 to 160 seems too harsh.

After all, we check whitespace, and whitespace is probably not nearly as important to the overall user experience as having well-inferred, invalidation-resistant code.

While true, the difficulty of fixing the issue also needs to be weighed in which in one case is trivial and in the other very advanced.

@timholy
Copy link
Member Author

timholy commented Aug 26, 2020

Excellent points as usual. Maybe a separate buildbot pass then? I'm clueless about how one goes about implementing this.

@KristofferC
Copy link
Member

The CUDA people have a nice benchmark site based on codespeed that shows timelines for various benchmarks (https://speed.juliagpu.org/timeline/#/?exe=5,3,4&base=3+74&ben=grid&env=1&revs=50&equid=off&quarts=on&extr=on).

It would be nice to hookup the daily Nanosoldier benchmarks to an instance of that and at the same time we could maybe add some compilation time benchmarks (JuliaCI/BaseBenchmarks.jl#252) but also some "static" info like number of backedges etc so that could also be tracked over time.

@timholy
Copy link
Member Author

timholy commented Aug 26, 2020

That makes sense, but I worry about how many people actually look at those numbers. I don't check Julia's performance routinely. OTOH, as long as it's checked before the release perhaps that would be enough.

OTOH, if we had a separate build pass, there's no way you'd miss a change although one could decide to go ahead and merge anyway. It does make sense to separate it out from "standard" tests of course.

@KristofferC
Copy link
Member

KristofferC commented Aug 26, 2020

I don't check Julia's performance routinely. OTOH, as long as it's checked before the release perhaps that would be enough.

Maybe that is because we have no way of nicely visualizing it right now.

OTOH, if we had a separate build pass, there's no way you'd miss a change although one could decide to go ahead and merge anyway. It does make sense to separate it out from "standard" tests of course.

Well, that build pass is going to be green and ignored until it turns red and then it will still mostly be ignored. And when someone (you?) see it, then what do you do? You probably want to try to find the place where bad stuff happened but since you don't really have a pass / fail criteria the best thing is probably at that point to start building a time series to see if there are points where things suddenly got worse. You could try a bisection but if the regression is gradual, it won't necessarily give you something valuable.

@timholy
Copy link
Member Author

timholy commented Aug 26, 2020

Fair enough. I'm not sure how one gets started on that though.

@KristofferC
Copy link
Member

KristofferC commented Aug 28, 2020

Hm, @staticfloat how long do we save the logs from the buildbot? If we add something that prints to the build log, would it be easy to build up a timeseries from that?

For now, perhaps we can:

  • Keep the test error for the "non-fuzzy" things (like making sure Base.require is solid). This is a yes or no thing and a PR that breaks it can fix it.
  • Change the error for the fuzzy tests to warnings or something along those lines.

That means we can still merge the PR and the code gets regularly run so it doesn't bit rot and then we can think a bit more about how exactly the "holistic" measurement should be presented / checked?

@martinholters
Copy link
Member

Could we do a before/after comparison like is done e.g. for code coverage? "This PR increases invalidation susceptibility by 2.3%." or whatever.

@staticfloat
Copy link
Member

Hm, @staticfloat how long do we save the logs from the buildbot? If we add something that prints to the build log, would it be easy to build up a timeseries from that?

The buildbots clean things up after 120 days. You can do this if you want to, but note that buildbot is far from a performant database. Expect to wait like 0.5s for each datapoint you want to read.

You are the second or third person in the last month who has wanted to store timeseries data for packages. I've been thinking about hosting a timeseries database like influxdb or timescaledb for the (authenticated) Julia community. Combined with a small wrapper over HTTP (like this abandoned package), it would allow us to POST timeseries datapoints and then query timeseries back out with HTTP or even fancy things like Grafana. Interested, or overkill?

@timholy
Copy link
Member Author

timholy commented Aug 31, 2020

I do like the idea of a time series, but I think the more important one is to be notified when PRs add a lot of poorly-inferred calls. What I had in mind was best summarized by @martinholters (#37193 (comment)).

@timholy
Copy link
Member Author

timholy commented Aug 31, 2020

xref https://github.com/julia-actions/julia-invalidations, thanks @ianshmean! That covers packages, so we just need something like that for Julia itself (based on potential invalidations like this PR).

@timholy
Copy link
Member Author

timholy commented Nov 2, 2020

Is there any hope of doing something like this for Julia 1.6? We don't seem to have regressed, except there seem to be more ways to invalidate Base.require than I remember previously. At least one chunk of problems come from the preferences work (e.g., #38044, CC @staticfloat).

@KristofferC
Copy link
Member

@timholy could you check if https://github.com/JuliaLang/julia/tree/kc/stability_preferences helps with that?

@KristofferC
Copy link
Member

Is there any hope of doing something like this for Julia 1.6?

I'm personally ok with putting in whatever you think would be helpful. I still worry a bit about a potential future PR "deadlock" when this starts to erroring and new contributors get confused about what they did wrong. But we can always disable the tests if they become disruptive so 🤷

@timholy
Copy link
Member Author

timholy commented Nov 3, 2020

How about we aim to eliminate vulnerabilities in paths leading to Base.require and make that a hard-fail, and use warnings for the rest?

@timholy timholy force-pushed the teh/inference_qa branch 6 times, most recently from 5b69038 to 9d9d7b7 Compare November 9, 2020 11:40
@timholy
Copy link
Member Author

timholy commented Nov 9, 2020

OK, this seems to be working (test failure seems incidental). Ready for review.

I am not terribly attached to the is* tests, so we could delete those depending on how others feel about them. For quite a few of them there are more than 3 methods, so in cases of poor inference their caller is going to have to annotate at the call site anyway. This may still have some value since it can be nice to check what all methods do return, and if they're all Bool (as opposed to, say, Union{Bool,Missing}) then one can feel more confident about asserting that.

@timholy timholy merged commit 110f125 into master Nov 15, 2020
@timholy timholy deleted the teh/inference_qa branch November 15, 2020 20:25
@vchuravy
Copy link
Member

I am seeing tests added in this PR failing in #38437

@KristofferC
Copy link
Member

Did CI run last time 7 days ago? Maybe it had time to regress in that time?

@timholy
Copy link
Member Author

timholy commented Nov 16, 2020

Hmm, I should have re-run CI, although it seems broken right now.

Indeed, something has changed it. Specifically, the test that I think is the most important, the impossibility of invalidating Base.require without committing piracy or specializing non-exported methods of Base, is now failing.

julia> m = methods(Base.require).ms[2]
require(uuidkey::Base.PkgId) in Base at loading.jl:900

julia> mis = remove_unlikely_methodinstances(atrisk_triggers(m, first.(mivulnerabilities_exported)))
2-element Vector{MethodInstance}:
 MethodInstance for convert(::Type{Nothing}, ::Any)
 MethodInstance for convert(::Type{Union{}}, ::Any)

julia> using Cthulhu

julia> ascend(mis[1])
Choose a call for analysis (q to quit):
     convert(::Type{Nothing}, ::Any)
       setindex!(::IdDict{Dict{String, Any}, Nothing}, ::Any, ::Any)
   +     push!(::IdSet{Dict{String, Any}}, ::Dict{String, Any})
   +   setindex!(::IdDict{Any, Nothing}, ::Any, ::Any)
       setindex!(::IdDict{MethodInstance, Nothing}, ::Any, ::Any)
   +     push!(::IdSet{MethodInstance}, ::MethodInstance)
         push!(::IdSet{MethodInstance}, ::Any)
 >         treelist!(::FoldingTrees.Node{Cthulhu.Data{MethodInstance}}, ::IOBuffer, ::Any, ::String, ::IdSet{MethodInstance})

I'm showing the tree in collapsed form, folding everything below a "reasonable" call. It's basically all IdDict/IdSet-related. The offending convert call is

val = convert(V, val)
which has been with us at least since 8b6a8c2 (February), so it's hard to believe that this directly changed. Seems like it might be something else, like perhaps an improvement elsewhere?

There's a fix in #38455.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Julep: attempting to measure (& test?) "absolute" risk of invalidation
5 participants