Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Enzyme and reexport ADTypes.AutoEnzyme #1887

Draft
wants to merge 88 commits into
base: master
Choose a base branch
from
Draft

Conversation

devmotion
Copy link
Member

@devmotion devmotion commented Sep 28, 2022

Note: This does not work yet


I opened this PR to make it easier to debug (and possibly fix) issues with Enzyme.

Currently, the following example does not work (note that the snippet does not require the PR which solely reexports AutoEnzyme at this point):

using Turing
using Enzyme
using ADTypes
Enzyme.API.runtimeActivity!(true);
Enzyme.API.typeWarning!(false);

@model function model()
    m ~ Normal(0, 1)
    s ~ InverseGamma()
    x ~ Normal(m, s)
end

sample(model() | (; x=0.5), NUTS(; adtype = ADTypes.AutoEnzyme()), 10)

With Enzyme#main my Julia (1.8.1) segfaults. An incomplete (it filled my whole terminal) output: https://gist.github.com/devmotion/1352197f2354c6fecddd7b778ec4bcf7#file-log-txt

The example works (latest releases of Turing, Enzyme, and ADTypes on Julia 1.10.0) but the following warnings show up:

warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors

@coveralls
Copy link

coveralls commented Nov 13, 2022

Pull Request Test Coverage Report for Build 12656384825

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 775 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-50.0%) to 26.906%

Files with Coverage Reduction New Missed Lines %
src/essential/container.jl 1 87.1%
src/variational/VariationalInference.jl 4 0.0%
src/mcmc/repeat_sampler.jl 16 0.0%
src/mcmc/is.jl 17 0.0%
src/stdlib/RandomMeasures.jl 22 0.0%
ext/TuringDynamicHMCExt.jl 29 0.0%
src/mcmc/mh.jl 42 52.99%
src/mcmc/abstractmcmc.jl 42 4.35%
src/mcmc/emcee.jl 49 0.0%
ext/TuringOptimExt.jl 50 0.0%
Totals Coverage Status
Change from base Build 12434902283: -50.0%
Covered Lines: 420
Relevant Lines: 1561

💛 - Coveralls

@codecov
Copy link

codecov bot commented Nov 13, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.18%. Comparing base (0c3d3d0) to head (6b2b65f).

❗ There is a different number of reports uploaded between BASE (0c3d3d0) and HEAD (6b2b65f). Click for more details.

HEAD has 39 uploads less than BASE
Flag BASE (0c3d3d0) HEAD (6b2b65f)
55 16
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1887       +/-   ##
===========================================
- Coverage   85.39%   32.18%   -53.21%     
===========================================
  Files          21       21               
  Lines        1588     1566       -22     
===========================================
- Hits         1356      504      -852     
- Misses        232     1062      +830     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/essential/ad.jl Outdated Show resolved Hide resolved
src/essential/ad.jl Outdated Show resolved Hide resolved
@wsmoses
Copy link
Collaborator

wsmoses commented Jun 26, 2023

Also if you want to disable the warnings you can set it like so (https://github.com/EnzymeAD/Enzyme.jl/blob/c29e6119c7963ddb22f1363726f762455748e193/src/api.jl#L414
)

Enzyme.API.typeWarning!(false)

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 26, 2023

You also may want to set the version to 0.11.2 since your CI currently is running at 0.11.0 (⌃ [7da242da] Enzyme v0.11.0)

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 27, 2023

@devmotion this PR (EnzymeAD/Enzyme.jl#914) should fix the immediate issues you see on CI if you want to try.


using AdvancedPS: AdvancedPS

include("container.jl")

export @model,
@varname,
AutoEnzyme,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's export this as Turing.Experimental.AutoEnzyme until Enzyme becomes more stable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the threshold for being considered stable here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhauru and @penelopeysm probably have a lot more experience on this.

My heuristic threshold:

  • Enzyme passes all Distributions.jl and Turing.jl tests
  • No known segfaults for Enzyme

for a continuous period of 8 weeks.

@mhauru
Copy link
Member

mhauru commented Oct 24, 2024

I've merged the latest master and upgraded to Enzyme v0.12. We are still being held back from v0.13 by Bijectors.jl. There are a number of new test failures, because

  1. There seem to be some regressions, tests that used to pass but now don't. I need to investigate.
  2. We had previously only tested Enzyme on the HMC and SGHMC tests. I hadn't realised/had forgotten that we weren't testing Enzyme with the full test suite. I've now added it to test/mcmc/gibbs.jl,test/mcmc/abstractmcmc.jl, and a couple of others as well. Everywhere where there is a loop over AD backends.
  3. Runtime activity is no longer a global setting in Enzyme. Having to change how we set runtime activity I think is a good opportunity to take stock of its effect, so I've just removed using it for now. Let's try to get to a point where the only test failures are ones where Enzyme says "you may need to use runtime activity", see how many there are, and only then enable it.

Getting Bijectors.jl to support Enzyme v0.13 I think has to be the next step, because otherwise any of the failures we see here might already be fixed on v0.13, and thus minimising and reporting them is pointless.

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 6, 2024

gentle bump here

@yebai
Copy link
Member

yebai commented Nov 11, 2024

It would be good to address EnzymeAD/Enzyme.jl#1812, #2307 and TuringLang/Bijectors.jl#341 before merging this PR.

@mhauru
Copy link
Member

mhauru commented Nov 12, 2024

TuringLang/Bijectors.jl#341 at the very least needs addressing, because it's currently holding us back from running a recent Enzyme version here, and thus we don't know if the test suite would pass on a recent version. TuringLang/Bijectors.jl#341 passes tests on 1.10, but on 1.11 the Enzyme tests fail because of the accursed extension load order issue. The fix for that is currently waiting on this one: TuringLang/Bijectors.jl#346 EDIT and consequently this: TuringLang/Bijectors.jl#349

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 29, 2024

With the bijectors fix landed, I suppose this is (again) ready to go?

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 29, 2024

separately @yebai you appear to have removedd my permissions to run tests, if that can be restored

@yebai
Copy link
Member

yebai commented Dec 1, 2024

separately @yebai you appear to have removedd my permissions to run tests, if that can be restored

I don't know what happened precisely -- some changes were made to the TuringLang repos permissions to make CI work more robustly.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 2, 2024

┌ Warning: Could not use exact versions of packages in manifest, re-resolving
└ @ Pkg.Operations /opt/hostedtoolcache/julia/1.11.1/x86/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1[9](https://github.com/TuringLang/Turing.jl/actions/runs/12122531930/job/33795922092?pr=1887#step:8:10)02
ERROR: Unsatisfiable requirements detected for package DynamicPPL [366bfd00]:
 DynamicPPL [366bfd00] log:
 ├─possible versions are: 0.1.0 - 0.31.0 or uninstalled
 ├─restricted to versions [0.29, 0.30.4 - 0.31] by Turing [fce5fe82], leaving only versions: [0.29.0 - 0.29.2, 0.30.4 - 0.31.0]
 │ └─Turing [fce5fe82] log:
 │   ├─possible versions are: 0.35.3 or uninstalled
 │   └─Turing [fce5fe82] is fixed to version 0.35.3
 ├─restricted by compatibility requirements with Mooncake [da2b9cff] to versions: 0.29.0 - 0.30.5 or uninstalled, leaving only versions: [0.29.0 - 0.29.2, 0.30.4 - 0.30.5]
 │ └─Mooncake [da2b9cff] log:
 │   ├─possible versions are: 0.3.0 - 0.4.53 or uninstalled
 │   └─restricted to versions 0.4.19 - 0.4 by project [23fc8c3f], leaving only versions: 0.4.19 - 0.4.53
 │     └─project [23fc8c3f] log:
 │       ├─possible versions are: 0.0.0 or uninstalled
 │       └─project [23fc8c3f] is fixed to version 0.0.0
 └─restricted by compatibility requirements with Bijectors [76274a88] to versions: 0.31.0 or uninstalled — no versions left
   └─Bijectors [76274a88] log:
     ├─possible versions are: 0.1.0 - 0.15.2 or uninstalled
     ├─restricted to versions 0.14 - 0.15 by Turing [fce5fe82], leaving only versions: 0.14.0 - 0.15.2
     │ └─Turing [fce5fe82] log: see above
     └─restricted by compatibility requirements with Enzyme [7da242da] to versions: [0.1.0 - 0.[13](https://github.com/TuringLang/Turing.jl/actions/runs/12122531930/job/33795922092?pr=1887#step:8:14).16, 0.15.0 - 0.15.2] or uninstalled, leaving only versions: 0.15.0 - 0.15.2
       └─Enzyme [7da242da] log:
         ├─possible versions are: 0.1.0 - 0.13.18 or uninstalled
         └─restricted to versions 0.13 by project [23fc8c3f], leaving only versions: 0.13.0 - 0.13.18
           └─project [23fc8c3f] log: see above
Stacktrace:

I'd try to help, but I don't have permission to edit things or rerun CI xD

@penelopeysm
Copy link
Member

penelopeysm commented Dec 2, 2024

It should resolve with Mooncake 0.4.54 as that allows for DPPL=0.31.0. Don't know why CI isn't picking up the new version.

└─Mooncake [da2b9cff] log:
     ├─possible versions are: 0.3.0 - 0.4.53 or uninstalled

0.4.54 should have been available a few hours ago.

@mhauru
Copy link
Member

mhauru commented Dec 4, 2024

The registry issue is sorted, now merrily running with the latest Enzyme.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 4, 2024

Check ADType: Error During Test at /home/runner/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:334
  Got exception outside of a @test
  ArgumentError: Unsupported ADType: ADTypes.AutoEnzyme{Nothing, Nothing}
  Stacktrace:
    [1] Main.ADUtils.ADTypeCheckContext(adbackend::ADTypes.AutoEnzyme{Nothing, Nothing}, child::DynamicPPL.DefaultContext)
      @ Main.ADUtils ~/work/Turing.jl/Turing.jl/test/test_utils/ad_utils.jl:102
    [2] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:336 [inlined]
    [3] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
    [4] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:335 [inlined]
    [5] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1793 [inlined]
    [6] top-level scope
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:22
    [7] include(fname::String)
      @ Main ./sysimg.jl:38
    [8] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:237 [inlined]
    [9] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:26 [inlined]
   [10] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
   [11] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:56 [inlined]
   [12] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:237 [inlined]
   [13] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:54 [inlined]
   [14] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
   [15] top-level scope
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:34
   [16] include(fname::String)
      @ Main ./sysimg.jl:38
   [17] top-level scope
      @ none:6
   [18] eval
      @ ./boot.jl:430 [inlined]
   [19] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:296
   [20] _start()
      @ Base ./client.jl:531

Looks like something in turing needs to be updated?

@mhauru
Copy link
Member

mhauru commented Dec 5, 2024

Fixed the above issue that @wsmoses pointed out.

We are seeing a lot of illegal type analysis errors, which I suspect are all instances of EnzymeAD/Enzyme.jl#2169.
Too many to reasonably mark as broken, I think we need to get that fixed first and then have another look.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 5, 2024

So this is indicative of a union (which isn't presently fully supported, at least without setting Enzyme.API.strictAliasing!(false) which may permit it).

Something around here https://github.com/TuringLang/DynamicPPL.jl/blob/2252a9b6012da8e2ac56353770a0f848f6874357/src/abstract_varinfo.jl#L791 is sometimes an int and other times a double. I think this will need to be fixed on the turing side.

@wsmoses wsmoses mentioned this pull request Dec 6, 2024
11 tasks
@mhauru
Copy link
Member

mhauru commented Dec 6, 2024

If

@model function gdemo_copy()
    s ~ InverseGamma(2, 3)
end

fails, and it does, then I assume most Turing models are affected, since they don't really get simpler than that. We could look into trying to chase down that Union somewhere, but I'm surprised that this is an issue given that we should have type stability at most function boundaries for such a simple model, especially as "deep" in as invlink_with_logpdf (we've made sure of that for performance reasons). It could end up taking quite a lot of time to track down the issue on the Turing side.

Enzyme.API.strictAliasing!(false) doesn't seem to save us, the simplest MWE in the issue I made still fails.

@wsmoses has something in Enzyme gotten stricter so that these illegal type analysis errors come up more often nowadays? Some of the errors are from tests that already passed at an earlier point.

@yebai
Copy link
Member

yebai commented Dec 16, 2024

So this is indicative of a union (which isn't presently fully supported, at least without setting Enzyme.API.strictAliasing!(false) which may permit it). Something around here https://github.com/TuringLang/DynamicPPL.jl/blob/2252a9b6012da8e2ac56353770a0f848f6874357/src/abstract_varinfo.jl#L791 is sometimes an int and other times a double. I think this will need to be fixed on the turing side.

Ideally, a proper fix should be added to Enzyme instead of requiring packages like Turing.jl / DynamicPPL.jl to work around it. One good reason is that Turing allows arbitrary Julia code inside the @model macro, which will get hit again if users write code that involves union.

@wsmoses
Copy link
Collaborator

wsmoses commented Jan 4, 2025

So this is indicative of a union (which isn't presently fully supported, at least without setting Enzyme.API.strictAliasing!(false) which may permit it). Something around here https://github.com/TuringLang/DynamicPPL.jl/blob/2252a9b6012da8e2ac56353770a0f848f6874357/src/abstract_varinfo.jl#L791 is sometimes an int and other times a double. I think this will need to be fixed on the turing side.

Ideally, a proper fix should be added to Enzyme instead of requiring packages like Turing.jl / DynamicPPL.jl to work around it. One good reason is that Turing allows arbitrary Julia code inside the @model macro, which will get hit again if users write code that involves union.

Sure, but that's doable after this merges so we can at least confirm and test the thigns that currently are expected to work do work. Not all AD tools support all code (at least at any given time). Zygote historically and currently didn't support mutation (and that's fine here). Enzyme historically (but not presently) didn't like type unstable code. Mooncake presently fails on this PR/CI (see below).

Can we just mark whatever isn't working now as test_broken, open issues, and at least track where things are at? That seems to be the case with all other ADs here.

2024-12-05T15:34:20.5424999Z   StackOverflowError:
2024-12-05T15:34:20.5425533Z   Stacktrace:
2024-12-05T15:34:20.5437430Z         [1] �[0m�[1mset_to_zero!!�[22m�[0m�[1m(�[22m�[90mx�[39m::�[0mMooncake.PossiblyUninitTangent�[90m{Any}�[39m�[0m�[1m)�[22m
2024-12-05T15:34:20.5438650Z   �[90m        @�[39m �[35mMooncake�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:658�[24m�[39m
2024-12-05T15:34:20.5439433Z         [2] �[0m�[1mtuple_map�[22m
2024-12-05T15:34:20.5440277Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mutils.jl:46�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5441047Z         [3] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5441902Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:657�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5442677Z         [4] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5443513Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:663�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5444289Z         [5] �[0m�[1mtuple_map�[22m
2024-12-05T15:34:20.5445094Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mutils.jl:46�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5445843Z         [6] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5446672Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:657�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5448469Z         [7] �[0m�[1mset_to_zero!!�[22m�[0m�[1m(�[22m�[90mx�[39m::�[0mMooncake.Tangent�[90m{@NamedTuple{gdemo_copy::Mooncake.MutableTangent{@NamedTuple{contents::Mooncake.PossiblyUninitTangent{Any}}}}}�[39m�[0m�[1m)�[22m
2024-12-05T15:34:20.5450076Z   �[90m        @�[39m �[35mMooncake�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:661�[24m�[39m
2024-12-05T15:34:20.5451259Z         [8] �[0m�[1mset_to_zero!!�[22m�[0m�[1m(�[22m�[90mx�[39m::�[0mMooncake.PossiblyUninitTangent�[90m{Any}�[39m�[0m�[1m)�[22m
2024-12-05T15:34:20.5452390Z   �[90m        @�[39m �[35mMooncake�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:659�[24m�[39m
2024-12-05T15:34:20.5453468Z   �[90m--- the above 7 lines are repeated 26659 more times ---�[39m
2024-12-05T15:34:20.5491994Z    [186622] �[0m�[1mtuple_map�[22m
2024-12-05T15:34:20.5492882Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mutils.jl:46�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5493655Z    [186623] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5494514Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:657�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5495177Z    [186624] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5495713Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:663�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5496189Z    [186625] �[0m�[1mtuple_map�[22m
2024-12-05T15:34:20.5496930Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mutils.jl:46�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5497630Z    [186626] �[0m�[1mset_to_zero!!�[22m
2024-12-05T15:34:20.5498141Z   �[90m        @�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:657�[24m�[39m�[90m [inlined]�[39m
2024-12-05T15:34:20.5499255Z    [186627] �[0m�[1mset_to_zero!!�[22m�[0m�[1m(�[22m�[90mx�[39m::�[0mMooncake.Tangent�[90m{@NamedTuple{gdemo_copy::Mooncake.MutableTangent{@NamedTuple{contents::Mooncake.PossiblyUninitTangent{Any}}}}}�[39m�[0m�[1m)�[22m
2024-12-05T15:34:20.5500207Z   �[90m        @�[39m �[35mMooncake�[39m �[90m~/.julia/packages/Mooncake/19jl1/src/�[39m�[90m�[4mtangents.jl:661�[24m�[39m
2024-12-05T15:34:38.4612463Z dynamic model: �[91m�[1mError During Test�[22m�[39m at �[39m�[1m/home/runner/work/Turing.jl/Turing.jl/test/mcmc/gibbs.jl:119�[22m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.