Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update index.qmd #548

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Update index.qmd #548

wants to merge 7 commits into from

Conversation

willtebbutt
Copy link
Member

Addresses part of #547 .

@gdalle does this read more correctly?

Copy link
Contributor

github-actions bot commented Nov 4, 2024

Preview the changes: https://turinglang.org/docs/pr-previews/548
Please avoid using the search feature and navigation bar in PR previews!

@gdalle
Copy link

gdalle commented Nov 4, 2024

I think there might still be a misunderstanding here. The keyword argument AutoReverseDiff(; compile) only controls whether the tape is optimized after being recorded. It doesn't control whether a recorded tape is only used once, or reused across function calls (inducing possible control flow problems).
Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it. See these key lines in LogDensityProblemsAD.jl:

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADADTypesExt.jl#L52-L54

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L45-L56

So the real question is:

  • in which places does Turing "prepare" the ADGradient with an example x
  • can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

@willtebbutt
Copy link
Member Author

Ah, I see. Maybe @torfjelde is the right person to ask about this? I've not dug in to exactly where to sort out RD gradient stuff yet in Turing.jl.

@torfjelde
Copy link
Member

in which places does Turing "prepare" the ADGradient with an example x

Typically this occurs in the initial step of the sampler, i.e. once (there are exceptions, but in those cases you can't really do much better in our case).

can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).

Copy link
Member

@torfjelde torfjelde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments 👍

tutorials/docs-10-using-turing-autodiff/index.qmd Outdated Show resolved Hide resolved
@gdalle
Copy link

gdalle commented Nov 6, 2024

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).

@torfjelde I think that's the key confusion between us. My understanding is that, regardless of whether it is compiled or not, the tape only captures execution for one version of the control flow. Thus, the tape always becomes invalid if the control flow changes (e.g. different branch of an if statement). Compilation just speeds up the differentiation but does not affect anything else, especially not correctness. Do you have a different understanding?

Source: https://juliadiff.org/ReverseDiff.jl/dev/api/#The-AbstractTape-API

@devmotion
Copy link
Member

The tape is purely internal and neither stored nor reused if compile = Val(false).

Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it

This is wrong. The tape is only reused if compile = Val(true): https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L51-L56

@gdalle
Copy link

gdalle commented Nov 6, 2024

Okay then IIUC it's a matter of inconsistency between

  • the "compile" kwarg here and in LogDensityProblemsAD (which determines whether a tape is reused)
  • the compilation as defined by ReverseDiff, which is about speeding up an existing tape (and disconnected from the reuse issue)

@gdalle
Copy link

gdalle commented Nov 6, 2024

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

@gdalle
Copy link

gdalle commented Nov 6, 2024

I'd love for you to chime in here @devmotion @torfjelde: SciML/ADTypes.jl#91

@torfjelde
Copy link
Member

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

@gdalle
Copy link

gdalle commented Nov 6, 2024

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

Indeed you can't compile a tape you never record in the first place. In any case, I think the ambiguous terminology was fixed by SciML/ADTypes.jl#91. It's just a shame that the word "compile" was chosen instead of "record", given how both are used in ReverseDiff's documentation. But it's a sunk cost now.

@willtebbutt
Copy link
Member Author

Thank you all for raising this problem / helping to fix it -- it's good that we've gotten to the bottom of this! I'm basically happy with this. @gdalle and @torfjelde could take a final look at this, and let me know whether you are happy.

@torfjelde
Copy link
Member

It's just a shame that the word "compile" was chosen instead of "record", given how both are used in ReverseDiff's documentation. But it's a sunk cost now.

I agree with that 😕

Copy link
Member

@torfjelde torfjelde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added one final comment that would be nice to shoehorn in here since we're already making alterations, but looks good:)

If the differentiation method is not specified in this way, Turing will default to using whatever the global AD backend is. Currently, this defaults to `ForwardDiff`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, but can we maybe also add the following paragraph at the very end (see the comment below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest way to determine which backend to use is to benchmark the model using TuringBenchmarking.jl and run the following:

using TuringBenchmarking
benchmark_model(model, adbackends=[AutoForwardDiff(), AutoReverseDiff(), ...])

where the adbackends kwarg is used to specify which backends you want to benchmark.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you fully switch to DI you can also use DifferentiationInterfaceTest.benchmark_differentiation for this ;)

Copy link
Member Author

@willtebbutt willtebbutt Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point @torfjelde . However, would it make more sense to just link to TuringBenchmarking.jl, rather than copying + pasting in a code snippet which may become outdated? Alternatively, is there a simple way to put example benchmarking code in here which runs whenever the docs are built?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants