diff --git a/markdown/00-introduction/00_introduction.md b/markdown/00-introduction/00_introduction.md index bf5fa78e3..1555c365f 100644 --- a/markdown/00-introduction/00_introduction.md +++ b/markdown/00-introduction/00_introduction.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/0-introduction/" title: "Introduction to Turing" permalink: "/:collection/:name/" --- diff --git a/markdown/01-gaussian-mixture-model/01_gaussian-mixture-model.md b/markdown/01-gaussian-mixture-model/01_gaussian-mixture-model.md index c4ca9759d..70575dcbc 100644 --- a/markdown/01-gaussian-mixture-model/01_gaussian-mixture-model.md +++ b/markdown/01-gaussian-mixture-model/01_gaussian-mixture-model.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/1-gaussianmixturemodel/" title: "Unsupervised Learning using Bayesian Mixture Models" permalink: "/:collection/:name/" --- diff --git a/markdown/02-logistic-regression/02_logistic-regression.md b/markdown/02-logistic-regression/02_logistic-regression.md index 319b9ae05..1dffbf3ae 100644 --- a/markdown/02-logistic-regression/02_logistic-regression.md +++ b/markdown/02-logistic-regression/02_logistic-regression.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/2-logisticregression/" title: "Bayesian Logistic Regression" permalink: "/:collection/:name/" --- diff --git a/markdown/03-bayesian-neural-network/03_bayesian-neural-network.md b/markdown/03-bayesian-neural-network/03_bayesian-neural-network.md index 02f2dd9d2..d1a7a1248 100644 --- a/markdown/03-bayesian-neural-network/03_bayesian-neural-network.md +++ b/markdown/03-bayesian-neural-network/03_bayesian-neural-network.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/3-bayesnn/" title: "Bayesian Neural Networks" permalink: "/:collection/:name/" --- diff --git a/markdown/04-hidden-markov-model/04_hidden-markov-model.md b/markdown/04-hidden-markov-model/04_hidden-markov-model.md index 3b611166d..5ddba8f52 100644 --- a/markdown/04-hidden-markov-model/04_hidden-markov-model.md +++ b/markdown/04-hidden-markov-model/04_hidden-markov-model.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/4-bayeshmm/" title: "Bayesian Hidden Markov Models" permalink: "/:collection/:name/" --- diff --git a/markdown/05-linear-regression/05_linear-regression.md b/markdown/05-linear-regression/05_linear-regression.md index e241c8dc5..86c65ac78 100644 --- a/markdown/05-linear-regression/05_linear-regression.md +++ b/markdown/05-linear-regression/05_linear-regression.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/5-linearregression/" title: "Linear Regression" permalink: "/:collection/:name/" --- diff --git a/markdown/06-infinite-mixture-model/06_infinite-mixture-model.md b/markdown/06-infinite-mixture-model/06_infinite-mixture-model.md index 327f1e2a4..8c21051f4 100644 --- a/markdown/06-infinite-mixture-model/06_infinite-mixture-model.md +++ b/markdown/06-infinite-mixture-model/06_infinite-mixture-model.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/6-infinitemixturemodel/" title: "Probabilistic Modelling using the Infinite Mixture Model" permalink: "/:collection/:name/" --- diff --git a/markdown/07-poisson-regression/07_poisson-regression.md b/markdown/07-poisson-regression/07_poisson-regression.md index 62ff64041..db9a20bc9 100644 --- a/markdown/07-poisson-regression/07_poisson-regression.md +++ b/markdown/07-poisson-regression/07_poisson-regression.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/7-poissonregression/" title: "Bayesian Poisson Regression" permalink: "/:collection/:name/" --- diff --git a/markdown/08-multionomial-regression/08_multinomial-logistic-regression.md b/markdown/08-multionomial-regression/08_multinomial-logistic-regression.md index b9be8fe85..820d75bb5 100644 --- a/markdown/08-multionomial-regression/08_multinomial-logistic-regression.md +++ b/markdown/08-multionomial-regression/08_multinomial-logistic-regression.md @@ -1,4 +1,5 @@ --- +redirect_from: "tutorials/8-multinomiallogisticregression/" title: "Bayesian Multinomial Logistic Regression" permalink: "/:collection/:name/" --- diff --git a/markdown/09-variational-inference/09_variational-inference.md b/markdown/09-variational-inference/09_variational-inference.md new file mode 100644 index 000000000..2d1fdc24c --- /dev/null +++ b/markdown/09-variational-inference/09_variational-inference.md @@ -0,0 +1,1296 @@ +--- +redirect_from: "tutorials/9-variationalinference/" +title: "Variational inference (VI) in Turing.jl" +permalink: "/:collection/:name/" +--- + + +In this post we'll have a look at what's know as **variational inference (VI)**, a family of _approximate_ Bayesian inference methods, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. In particular, we will focus on one of the more standard VI methods called **Automatic Differentation Variational Inference (ADVI)**. + +Here we will focus on how to use VI in Turing and not much on the theory underlying VI. If you're interested in understanding the mathematics you can checkout [our write-up](../../docs/for-developers/variational_inference) or any other resource online (there a lot of great ones). + +Using VI in Turing.jl is very straight forward. If `model` denotes a definition of a `Turing.Model`, performing VI is as simple as +```julia +m = model(data...) # instantiate model on the data +q = vi(m, vi_alg) # perform VI on `m` using the VI method `vi_alg`, which returns a `VariationalPosterior` +``` + + +Thus it's no more work than standard MCMC sampling in Turing. + +To get a bit more into what we can do with `vi`, we'll first have a look at a simple example and then we'll reproduce the [tutorial on Bayesian linear regression](../../tutorials/5-linearregression) using VI instead of MCMC. Finally we'll look at some of the different parameters of `vi` and how you for example can use your own custom variational family. + +## Setup + +```julia +using Random +using Turing +using Turing: Variational + +Random.seed!(42); +``` + + + + +## Simple example: Normal-Gamma conjugate model + +The Normal-(Inverse)Gamma conjugate model is defined by the following generative process + +\begin{align} + s &\sim \mathrm{InverseGamma}(2, 3) \\\\ + m &\sim \mathcal{N}(0, s) \\\\ + x_i &\overset{\text{i.i.d.}}{=} \mathcal{N}(m, s), \quad i = 1, \dots, n +\end{align} + +Recall that *conjugate* refers to the fact that we can obtain a closed-form expression for the posterior. Of course one wouldn't use something like variational inference for a conjugate model, but it's useful as a simple demonstration as we can compare the result to the true posterior. + +First we generate some synthetic data, define the `Turing.Model` and instantiate the model on the data: + +```julia +# generate data +x = randn(2000); +``` + + +```julia +@model model(x) = begin + s ~ InverseGamma(2, 3) + m ~ Normal(0.0, sqrt(s)) + for i = 1:length(x) + x[i] ~ Normal(m, sqrt(s)) + end +end; +``` + + +```julia +# Instantiate model +m = model(x); +``` + + + + +Now we'll produce some samples from the posterior using a MCMC method, which in constrast to VI is guaranteed to converge to the *exact* posterior (as the number of samples go to infinity). + +We'll produce 10 000 samples with 200 steps used for adaptation and a target acceptance rate of 0.65 + +If you don't understand what "adaptation" or "target acceptance rate" refers to, all you really need to know is that `NUTS` is known to be one of the most accurate and efficient samplers (when applicable) while requiring little to no hand-tuning to work well. + + +```julia +samples_nuts = sample(m, NUTS(200, 0.65), 10000); +``` + + + + +Now let's try VI. The most important function you need to now about to do VI in Turing is `vi`: + + +```julia +@doc(Variational.vi) +``` + + +``` +vi(model, alg::VariationalInference) +vi(model, alg::VariationalInference, q::VariationalPosterior) +vi(model, alg::VariationalInference, getq::Function, θ::AbstractArray) +``` + +Constructs the variational posterior from the `model` and performs the optimization following the configuration of the given `VariationalInference` instance. + +# Arguments + + * `model`: `Turing.Model` or `Function` z ↦ log p(x, z) where `x` denotes the observations + * `alg`: the VI algorithm used + * `q`: a `VariationalPosterior` for which it is assumed a specialized implementation of the variational objective used exists. + * `getq`: function taking parameters `θ` as input and returns a `VariationalPosterior` + * `θ`: only required if `getq` is used, in which case it is the initial parameters for the variational posterior + + + + +Additionally, you can pass +- an initial variational posterior `q`, for which we assume there exists a implementation of `update(::typeof(q), θ::AbstractVector)` returning an updated posterior `q` with parameters `θ`. +- a function mapping $\theta \mapsto q_{\theta}$ (denoted above `getq`) together with initial parameters `θ`. This provides more flexibility in the types of variational families that we can use, and can sometimes be slightly more convenient for quick and rough work. + +By default, i.e. when calling `vi(m, advi)`, Turing use a *mean-field* approximation with a multivariate normal as the base-distribution. Mean-field refers to the fact that we assume all the latent variables to be *independent*. This the "standard" ADVI approach; see [Automatic Differentiation Variational Inference (2016)](https://arxiv.org/abs/1603.00788) for more. In Turing, one can obtain such a mean-field approximation by calling `Variational.meanfield(model)` for which there exists an internal implementation for `update`: + + +```julia +@doc(Variational.meanfield) +``` + + +``` +meanfield([rng, ]model::Model) +``` + +Creates a mean-field approximation with multivariate normal as underlying distribution. + + + + +Currently the only implementation of `VariationalInference` available is `ADVI`, which is very convenient and applicable as long as your `Model` is differentiable with respect to the *variational parameters*, that is, the parameters of your variational distribution, e.g. mean and variance in the mean-field approximation. + + +```julia +@doc(Variational.ADVI) +``` + + +```julia +struct ADVI{AD} <: AdvancedVI.VariationalInference{AD} +``` + +Automatic Differentiation Variational Inference (ADVI) with automatic differentiation backend `AD`. + +# Fields + + * `samples_per_step::Int64` + + Number of samples used to estimate the ELBO in each optimization step. + * `max_iters::Int64` + + Maximum number of gradient steps. + + + + +To perform VI on the model `m` using 10 samples for gradient estimation and taking 1000 gradient steps is then as simple as: + + +```julia +# ADVI +advi = ADVI(10, 1000) +q = vi(m, advi); +``` + + + + +Unfortunately, for such a small problem Turing's new `NUTS` sampler is *so* efficient now that it's not that much more efficient to use ADVI. So, so very unfortunate... + +With that being said, this is not the case in general. For very complex models we'll later find that `ADVI` produces very reasonable results in a much shorter time than `NUTS`. + +And one significant advantage of using `vi` is that we can sample from the resulting `q` with ease. In fact, the result of the `vi` call is a `TransformedDistribution` from Bijectors.jl, and it implements the Distributions.jl interface for a `Distribution`: + + +```julia +q isa MultivariateDistribution +``` + +``` +true +``` + + + + + +This means that we can call `rand` to sample from the variational posterior `q` + + +```julia +rand(q) +``` + +``` +2-element Array{Float64,1}: + 1.0786295499476042 + -0.042370317247471936 +``` + + + + + +and `logpdf` to compute the log-probability + + +```julia +logpdf(q, rand(q)) +``` + +``` +4.268921703144813 +``` + + + + + +Let's check the first and second moments of the data to see how our approximation compares to the point-estimates form the data: + + +```julia +var(x), mean(x) +``` + +``` +(1.0225001600719719, -0.027900450605557185) +``` + + + +```julia +(mean(rand(q, 1000); dims = 2)..., ) +``` + +``` +(1.0302721641867068, -0.027767503306100567) +``` + + + + + +That's pretty close! But we're Bayesian so we're not interested in *just* matching the mean. +Let's instead look the actual density `q`. + +For that we need samples: + + +```julia +samples = rand(q, 10000); +``` + + +```julia +# setup for plotting +using Plots, LaTeXStrings, StatsPlots +``` + + +```julia +p1 = histogram(samples[1, :], bins=100, normed=true, alpha=0.2, color = :blue, label = "") +density!(samples[1, :], label = "s (ADVI)", color = :blue, linewidth = 2) +density!(samples_nuts, :s; label = "s (NUTS)", color = :green, linewidth = 2) +vline!([var(x)], label = "s (data)", color = :black) +vline!([mean(samples[1, :])], color = :blue, label ="") + +p2 = histogram(samples[2, :], bins=100, normed=true, alpha=0.2, color = :blue, label = "") +density!(samples[2, :], label = "m (ADVI)", color = :blue, linewidth = 2) +density!(samples_nuts, :m; label = "m (NUTS)", color = :green, linewidth = 2) +vline!([mean(x)], color = :black, label = "m (data)") +vline!([mean(samples[2, :])], color = :blue, label="") + +plot(p1, p2, layout=(2, 1), size=(900, 500)) +``` + +![](figures/09_variational-inference_18_1.png) + + + +For this particular `Model`, we can in fact obtain the posterior of the latent variables in closed form. This allows us to compare both `NUTS` and `ADVI` to the true posterior $p(s, m \mid \{x_i\}_{i = 1}^n )$. + +*The code below is just work to get the marginals $p(s \mid \{x_i\}_{i = 1}^n)$ and $p(m \mid \{x_i\}_{i = 1}^n)$ from the posterior obtained using ConjugatePriors.jl. Feel free to skip it.* + + +```julia +# used to compute closed form expression of posterior +using ConjugatePriors + +# closed form computation +# notation mapping has been verified by explicitly computing expressions +# in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +μ₀ = 0.0 # => μ +κ₀ = 1.0 # => ν, which scales the precision of the Normal +α₀ = 2.0 # => "shape" +β₀ = 3.0 # => "rate", which is 1 / θ, where θ is "scale" + +# prior +pri = NormalGamma(μ₀, κ₀, α₀, β₀) + +# posterior +post = posterior(pri, Normal, x) + +# marginal distribution of τ = 1 / σ² +# Eq. (90) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +# `scale(post)` = θ +p_τ = Gamma(post.shape, scale(post)) +p_σ²_pdf = z -> pdf(p_τ, 1 / z) # τ => 1 / σ² + +# marginal of μ +# Eq. (91) in "Conjugate Bayesian analysis of the Gaussian distribution" by Murphy +p_μ = TDist(2 * post.shape) + +μₙ = post.mu # μ → μ +κₙ = post.nu # κ → ν +αₙ = post.shape # α → shape +βₙ = post.rate # β → rate + +# numerically more stable but doesn't seem to have effect; issue is probably internal to +# `pdf` which needs to compute ≈ Γ(1000) +p_μ_pdf = z -> exp(logpdf(p_μ, (z - μₙ) * exp(- 0.5 * log(βₙ) + 0.5 * log(αₙ) + 0.5 * log(κₙ)))) + +# posterior plots +p1 = plot(); +histogram!(samples[1, :], bins=100, normed=true, alpha=0.2, color = :blue, label = "") +density!(samples[1, :], label = "s (ADVI)", color = :blue) +density!(samples_nuts, :s; label = "s (NUTS)", color = :green) +vline!([mean(samples[1, :])], linewidth = 1.5, color = :blue, label ="") + +# normalize using Riemann approx. because of (almost certainly) numerical issues +Δ = 0.001 +r = 0.75:0.001:1.50 +norm_const = sum(p_σ²_pdf.(r) .* Δ) +plot!(r, p_σ²_pdf, label = "s (posterior)", color = :red); +vline!([var(x)], label = "s (data)", linewidth = 1.5, color = :black, alpha = 0.7); +xlims!(0.75, 1.35); + +p2 = plot(); +histogram!(samples[2, :], bins=100, normed=true, alpha=0.2, color = :blue, label = "") +density!(samples[2, :], label = "m (ADVI)", color = :blue) +density!(samples_nuts, :m; label = "m (NUTS)", color = :green) +vline!([mean(samples[2, :])], linewidth = 1.5, color = :blue, label="") + + +# normalize using Riemann approx. because of (almost certainly) numerical issues +Δ = 0.0001 +r = -0.1 + mean(x):Δ:0.1 + mean(x) +norm_const = sum(p_μ_pdf.(r) .* Δ) +plot!(r, z -> p_μ_pdf(z) / norm_const, label = "m (posterior)", color = :red); +vline!([mean(x)], label = "m (data)", linewidth = 1.5, color = :black, alpha = 0.7); + +xlims!(-0.25, 0.25); + +p = plot(p1, p2; layout=(2, 1), size=(900, 500)) +``` + +![](figures/09_variational-inference_19_1.png) + + + + +# Bayesian linear regression example using `ADVI` + +This is simply a duplication of the tutorial [5. Linear regression](../regression/02_linear-regression) but now with the addition of an approximate posterior obtained using `ADVI`. + +As we'll see, there is really no additional work required to apply variational inference to a more complex `Model`. + +## Copy-paste from [5. Linear regression](../regression/02_linear-regression) + +This section is basically copy-pasting the code from the [linear regression tutorial](../regression/02_linear-regression). + + +```julia +Random.seed!(1); +``` + + +```julia +# Import RDatasets. +using RDatasets + +# Hide the progress prompt while sampling. +Turing.setprogress!(false); +``` + + +```julia +# Import the "Default" dataset. +data = RDatasets.dataset("datasets", "mtcars"); + +# Show the first six rows of the dataset. +first(data, 6) +``` + +``` +6×12 DataFrame. Omitted printing of 6 columns +│ Row │ Model │ MPG │ Cyl │ Disp │ HP │ DRat │ +│ │ String │ Float64 │ Int64 │ Float64 │ Int64 │ Float64 │ +├─────┼───────────────────┼─────────┼───────┼─────────┼───────┼─────────┤ +│ 1 │ Mazda RX4 │ 21.0 │ 6 │ 160.0 │ 110 │ 3.9 │ +│ 2 │ Mazda RX4 Wag │ 21.0 │ 6 │ 160.0 │ 110 │ 3.9 │ +│ 3 │ Datsun 710 │ 22.8 │ 4 │ 108.0 │ 93 │ 3.85 │ +│ 4 │ Hornet 4 Drive │ 21.4 │ 6 │ 258.0 │ 110 │ 3.08 │ +│ 5 │ Hornet Sportabout │ 18.7 │ 8 │ 360.0 │ 175 │ 3.15 │ +│ 6 │ Valiant │ 18.1 │ 6 │ 225.0 │ 105 │ 2.76 │ +``` + + + +```julia +# Function to split samples. +function split_data(df, at = 0.70) + r = size(df,1) + index = Int(round(r * at)) + train = df[1:index, :] + test = df[(index+1):end, :] + return train, test +end + +# A handy helper function to rescale our dataset. +function standardize(x) + return (x .- mean(x, dims=1)) ./ std(x, dims=1), x +end + +# Another helper function to unstandardize our datasets. +function unstandardize(x, orig) + return (x .+ mean(orig, dims=1)) .* std(orig, dims=1) +end +``` + +``` +unstandardize (generic function with 1 method) +``` + + + +```julia +# Remove the model column. +select!(data, Not(:Model)) + +# Standardize our dataset. +(std_data, data_arr) = standardize(Matrix(data)) + +# Split our dataset 70%/30% into training/test sets. +train, test = split_data(std_data, 0.7) + +# Save dataframe versions of our dataset. +train_cut = DataFrame(train, names(data)) +test_cut = DataFrame(test, names(data)) + +# Create our labels. These are the values we are trying to predict. +train_label = train_cut[:, :MPG] +test_label = test_cut[:, :MPG] + +# Get the list of columns to keep. +remove_names = filter(x->!in(x, [:MPG, :Model]), names(data)) + +# Filter the test and train sets. +train = Matrix(train_cut[:,remove_names]); +test = Matrix(test_cut[:,remove_names]); +``` + + +```julia +# Bayesian linear regression. +@model linear_regression(x, y, n_obs, n_vars, ::Type{T}=Vector{Float64}) where {T} = begin + # Set variance prior. + σ₂ ~ truncated(Normal(0,100), 0, Inf) + + # Set intercept prior. + intercept ~ Normal(0, 3) + + # Set the priors on our coefficients. + coefficients ~ MvNormal(zeros(n_vars), 10 * ones(n_vars)) + + # Calculate all the mu terms. + mu = intercept .+ x * coefficients + y ~ MvNormal(mu, σ₂) +end; +``` + + +```julia +n_obs, n_vars = size(train) +m = linear_regression(train, train_label, n_obs, n_vars); +``` + + + + +## Performing VI + +First we define the initial variational distribution, or, equivalently, the family of distributions to consider. We're going to use the same mean-field approximation as Turing will use by default when we call `vi(m, advi)`, which we obtain by calling `Variational.meanfield`. This returns a `TransformedDistribution` with a `TuringDiagMvNormal` as the underlying distribution and the transformation mapping from the reals to the domain of the latent variables. + + +```julia +q0 = Variational.meanfield(m) +typeof(q0) +``` + +``` +Bijectors.TransformedDistribution{DistributionsAD.TuringDiagMvNormal{Array{ +Float64,1},Array{Float64,1}},Bijectors.Stacked{Tuple{Bijectors.Inverse{Bije +ctors.TruncatedBijector{0,Float64,Float64},0},Bijectors.Identity{0},Bijecto +rs.Identity{1}},3},Distributions.Multivariate} +``` + + + +```julia +advi = ADVI(10, 10_000) +``` + +``` +AdvancedVI.ADVI{AdvancedVI.ForwardDiffAD{40}}(10, 10000) +``` + + + + + +Turing also provides a couple of different optimizers: +- `TruncatedADAGrad` (default) +- `DecayedADAGrad` +as these are well-suited for problems with high-variance stochastic objectives, which is usually what the ELBO ends up being at different times in our optimization process. + +With that being said, thanks to Requires.jl, if we add a `using Flux` prior to `using Turing` we can also make use of all the optimizers in `Flux`, e.g. `ADAM`, without any additional changes to your code! For example: +```julia +using Flux, Turing +using Turing.Variational + +vi(m, advi; optimizer = Flux.ADAM()) +``` + + +just works. + +For this problem we'll use the `DecayedADAGrad` from Turing: + + +```julia +opt = Variational.DecayedADAGrad(1e-2, 1.1, 0.9) +``` + +``` +AdvancedVI.DecayedADAGrad(0.01, 1.1, 0.9, IdDict{Any,Any}()) +``` + + + +```julia +q = vi(m, advi, q0; optimizer = opt) +typeof(q) +``` + +``` +Bijectors.TransformedDistribution{DistributionsAD.TuringDiagMvNormal{Array{ +Float64,1},Array{Float64,1}},Bijectors.Stacked{Tuple{Bijectors.Inverse{Bije +ctors.TruncatedBijector{0,Float64,Float64},0},Bijectors.Identity{0},Bijecto +rs.Identity{1}},3},Distributions.Multivariate} +``` + + + + + +*Note: as mentioned before, we internally define a `update(q::TransformedDistribution{<:TuringDiagMvNormal}, θ::AbstractVector)` method which takes in the current variational approximation `q` together with new parameters `z` and returns the new variational approximation. This is required so that we can actually update the `Distribution` object after each optimization step.* + +*Alternatively, we can instead provide the mapping $\theta \mapsto q_{\theta}$ directly together with initial parameters using the signature `vi(m, advi, getq, θ_init)` as mentioned earlier. We'll see an explicit example of this later on!* + +To compute statistics for our approximation we need samples: + + +```julia +z = rand(q, 10_000); +``` + + + + +Now we can for example look at the average + + +```julia +avg = vec(mean(z; dims = 2)) +``` + +``` +13-element Array{Float64,1}: + 0.020117192340218665 + 0.0002796669830116502 + 1.0010658079947254 + -0.00027613100477009924 + 0.0019703203129575017 + 0.0011012413001283284 + -0.002683222755889248 + -0.0017262666130412618 + -0.0010564941529113507 + -8.207373094640307e-5 + 5.198685294519674e-5 + 5.285097329590308e-5 + 0.0031336481988493855 +``` + + + + + +The vector has the same ordering as the model, e.g. in this case `σ₂` has index `1`, `intercept` has index `2` and `coefficients` has indices `3:12`. If you forget or you might want to do something programmatically with the result, you can obtain the `sym → indices` mapping as follows: + + +```julia +_, sym2range = bijector(m, Val(true)); +sym2range +``` + +``` +(intercept = UnitRange{Int64}[2:2], σ₂ = UnitRange{Int64}[1:1], coefficient +s = UnitRange{Int64}[3:13]) +``` + + + +```julia +avg[union(sym2range[:σ₂]...)] +``` + +``` +1-element Array{Float64,1}: + 0.020117192340218665 +``` + + + +```julia +avg[union(sym2range[:intercept]...)] +``` + +``` +1-element Array{Float64,1}: + 0.0002796669830116502 +``` + + + +```julia +avg[union(sym2range[:coefficients]...)] +``` + +``` +11-element Array{Float64,1}: + 1.0010658079947254 + -0.00027613100477009924 + 0.0019703203129575017 + 0.0011012413001283284 + -0.002683222755889248 + -0.0017262666130412618 + -0.0010564941529113507 + -8.207373094640307e-5 + 5.198685294519674e-5 + 5.285097329590308e-5 + 0.0031336481988493855 +``` + + + + + +*Note: as you can see, this is slightly awkward to work with at the moment. We'll soon add a better way of dealing with this.* + +With a bit of work (this will be much easier in the future), we can also visualize the approximate marginals of the different variables, similar to `plot(chain)`: + + +```julia +function plot_variational_marginals(z, sym2range) + ps = [] + + for (i, sym) in enumerate(keys(sym2range)) + indices = union(sym2range[sym]...) # <= array of ranges + if sum(length.(indices)) > 1 + offset = 1 + for r in indices + for j in r + p = density(z[j, :], title = "$(sym)[$offset]", titlefontsize = 10, label = "") + push!(ps, p) + + offset += 1 + end + end + else + p = density(z[first(indices), :], title = "$(sym)", titlefontsize = 10, label = "") + push!(ps, p) + end + end + + return plot(ps..., layout = (length(ps), 1), size = (500, 1500)) +end +``` + +``` +plot_variational_marginals (generic function with 1 method) +``` + + + +```julia +plot_variational_marginals(z, sym2range) +``` + +![](figures/09_variational-inference_39_1.png) + + + +And let's compare this to using the `NUTS` sampler: + + +```julia +chain = sample(m, NUTS(0.65), 10_000); +``` + + +```julia +plot(chain) +``` + +![](figures/09_variational-inference_41_1.png) + +```julia +vi_mean = vec(mean(z; dims = 2))[[union(sym2range[:coefficients]...)..., union(sym2range[:intercept]...)..., union(sym2range[:σ₂]...)...]] +``` + +``` +13-element Array{Float64,1}: + 1.0010658079947254 + -0.00027613100477009924 + 0.0019703203129575017 + 0.0011012413001283284 + -0.002683222755889248 + -0.0017262666130412618 + -0.0010564941529113507 + -8.207373094640307e-5 + 5.198685294519674e-5 + 5.285097329590308e-5 + 0.0031336481988493855 + 0.0002796669830116502 + 0.020117192340218665 +``` + + + +```julia +mean(chain).nt.mean +``` + +``` +13-element Array{Float64,1}: + 1.0000000399377265 + -3.418828635857032e-8 + 2.4761590570486984e-9 + 1.8477962723990843e-8 + -4.0801944678703214e-8 + 3.6899324841451007e-9 + 6.991847816704769e-10 + 8.432730884578574e-9 + 1.72177428312298e-8 + -3.3881239715188653e-8 + 4.439930725536783e-8 + -6.963225118208367e-10 + 1.9275537805998484e-6 +``` + + + + + +One thing we can look at is simply the squared error between the means: + + +```julia +sum(abs2, mean(chain).nt.mean .- vi_mean) +``` + +``` +0.0004321363903591653 +``` + + + + + +That looks pretty good! But let's see how the predictive distributions looks for the two. + +## Prediction + +Similarily to the linear regression tutorial, we're going to compare to multivariate ordinary linear regression using the `GLM` package: + + +```julia +# Import the GLM package. +using GLM + +# Perform multivariate OLS. +ols = lm(@formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train_cut) + +# Store our predictions in the original dataframe. +train_cut.OLSPrediction = unstandardize(GLM.predict(ols), data.MPG); +test_cut.OLSPrediction = unstandardize(GLM.predict(ols, test_cut), data.MPG); +``` + + +```julia +# Make a prediction given an input vector. +function prediction_chain(chain, x) + p = get_params(chain) + α = mean(p.intercept) + β = collect(mean.(p.coefficients)) + return α .+ x * β +end +``` + +``` +prediction_chain (generic function with 1 method) +``` + + + +```julia +# Make a prediction using samples from the variational posterior given an input vector. +function prediction(samples::AbstractVector, sym2ranges, x) + α = mean(samples[union(sym2ranges[:intercept]...)]) + β = vec(mean(samples[union(sym2ranges[:coefficients]...)]; dims = 2)) + return α .+ x * β +end + +function prediction(samples::AbstractMatrix, sym2ranges, x) + α = mean(samples[union(sym2ranges[:intercept]...), :]) + β = vec(mean(samples[union(sym2ranges[:coefficients]...), :]; dims = 2)) + return α .+ x * β +end +``` + +``` +prediction (generic function with 2 methods) +``` + + + +```julia +# Unstandardize the dependent variable. +train_cut.MPG = unstandardize(train_cut.MPG, data.MPG); +test_cut.MPG = unstandardize(test_cut.MPG, data.MPG); +``` + + +```julia +# Show the first side rows of the modified dataframe. +first(test_cut, 6) +``` + +``` +6×12 DataFrame. Omitted printing of 6 columns +│ Row │ MPG │ Cyl │ Disp │ HP │ DRat │ WT │ +│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ +├─────┼─────────┼──────────┼───────────┼───────────┼───────────┼──────────┤ +│ 1 │ 116.195 │ 1.01488 │ 0.591245 │ 0.0483133 │ -0.835198 │ 0.222544 │ +│ 2 │ 114.295 │ 1.01488 │ 0.962396 │ 1.4339 │ 0.249566 │ 0.636461 │ +│ 3 │ 120.195 │ 1.01488 │ 1.36582 │ 0.412942 │ -0.966118 │ 0.641571 │ +│ 4 │ 128.295 │ -1.22486 │ -1.22417 │ -1.17684 │ 0.904164 │ -1.31048 │ +│ 5 │ 126.995 │ -1.22486 │ -0.890939 │ -0.812211 │ 1.55876 │ -1.10097 │ +│ 6 │ 131.395 │ -1.22486 │ -1.09427 │ -0.491337 │ 0.324377 │ -1.74177 │ +``` + + + +```julia +z = rand(q, 10_000); +``` + + +```julia +# Calculate the predictions for the training and testing sets using the samples `z` from variational posterior +train_cut.VIPredictions = unstandardize(prediction(z, sym2range, train), data.MPG); +test_cut.VIPredictions = unstandardize(prediction(z, sym2range, test), data.MPG); + +train_cut.BayesPredictions = unstandardize(prediction_chain(chain, train), data.MPG); +test_cut.BayesPredictions = unstandardize(prediction_chain(chain, test), data.MPG); +``` + + +```julia +vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG).^2) +bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG).^2) +ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG).^2) + +vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG).^2) +bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG).^2) +ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG).^2) + +println("Training set: + VI loss: $vi_loss1 + Bayes loss: $bayes_loss1 + OLS loss: $ols_loss1 +Test set: + VI loss: $vi_loss2 + Bayes loss: $bayes_loss2 + OLS loss: $ols_loss2") +``` + +``` +Training set: + VI loss: 0.0007001199110321286 + Bayes loss: 1.3743352742545003e-14 + OLS loss: 3.070926124893026 +Test set: + VI loss: 0.0014915402827578768 + Bayes loss: 9.410239119029718e-14 + OLS loss: 27.09481307076062 +``` + + + + + + +Interestingly the squared difference between true- and mean-prediction on the test-set is actually *better* for the mean-field variational posterior than for the "true" posterior obtained by MCMC sampling using `NUTS`. But, as Bayesians, we know that the mean doesn't tell the entire story. One quick check is to look at the mean predictions ± standard deviation of the two different approaches: + + +```julia +z = rand(q, 1000); +preds = hcat([unstandardize(prediction(z[:, i], sym2range, test), data.MPG) for i = 1:size(z, 2)]...); + +scatter(1:size(test, 1), mean(preds; dims = 2), yerr=std(preds; dims = 2), label="prediction (mean ± std)", size = (900, 500), markersize = 8) +scatter!(1:size(test, 1), unstandardize(test_label, data.MPG), label="true") +xaxis!(1:size(test, 1)) +ylims!(95, 140) +title!("Mean-field ADVI (Normal)") +``` + +![](figures/09_variational-inference_53_1.png) + +```julia +preds = hcat([unstandardize(prediction_chain(chain[i], test), data.MPG) for i = 1:5:size(chain, 1)]...); + +scatter(1:size(test, 1), mean(preds; dims = 2), yerr=std(preds; dims = 2), label="prediction (mean ± std)", size = (900, 500), markersize = 8) +scatter!(1:size(test, 1), unstandardize(test_label, data.MPG), label="true") +xaxis!(1:size(test, 1)) +ylims!(95, 140) +title!("MCMC (NUTS)") +``` + +![](figures/09_variational-inference_54_1.png) + + + +Indeed we see that the MCMC approach generally provides better uncertainty estimates than the mean-field ADVI approach! Good. So all the work we've done to make MCMC fast isn't for nothing. + +## Alternative: provide parameter-to-distribution instead of `q` with`update` implemented + +As mentioned earlier, it's also possible to just provide the mapping $\theta \mapsto q_{\theta}$ rather than the variational family / initial variational posterior `q`, i.e. use the interface `vi(m, advi, getq, θ_init)` where `getq` is the mapping $\theta \mapsto q_{\theta}$ + +In this section we're going to construct a mean-field approximation to the model by hand using a composition of`Shift` and `Scale` from Bijectors.jl togheter with a standard multivariate Gaussian as the base distribution. + + +```julia +using Bijectors +``` + + +```julia +using Bijectors: Scale, Shift +``` + + +```julia +d = length(q) +base_dist = Turing.DistributionsAD.TuringDiagMvNormal(zeros(d), ones(d)) +``` + +``` +DistributionsAD.TuringDiagMvNormal{Array{Float64,1},Array{Float64,1}}( +m: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] +σ: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] +) +``` + + + + + +`bijector(model::Turing.Model)` is defined by Turing, and will return a `bijector` which takes you from the space of the latent variables to the real space. In this particular case, this is a mapping `((0, ∞) × ℝ × ℝ¹⁰) → ℝ¹²`. We're interested in using a normal distribution as a base-distribution and transform samples to the latent space, thus we need the inverse mapping from the reals to the latent space: + + +```julia +to_constrained = inv(bijector(m)); +``` + + +```julia +function getq(θ) + d = length(θ) ÷ 2 + A = @inbounds θ[1:d] + b = @inbounds θ[d + 1: 2 * d] + + b = to_constrained ∘ Shift(b; dim = Val(1)) ∘ Scale(exp.(A); dim = Val(1)) + + return transformed(base_dist, b) +end +``` + +``` +getq (generic function with 1 method) +``` + + + +```julia +q_mf_normal = vi(m, advi, getq, randn(2 * d)); +``` + + +```julia +p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) # MvDiagNormal + Affine transformation + to_constrained +p2 = plot_variational_marginals(rand(q, 10_000), sym2range) # Turing.meanfield(m) + +plot(p1, p2, layout = (1, 2), size = (800, 2000)) +``` + +![](figures/09_variational-inference_61_1.png) + + +As expected, the fits look pretty much identical. + +But using this interface it becomes trivial to go beyond the mean-field assumption we made for the variational posterior, as we'll see in the next section. + +### Relaxing the mean-field assumption + +Here we'll instead consider the variational family to be a full non-diagonal multivariate Gaussian. As in the previous section we'll implement this by transforming a standard multivariate Gaussian using `Scale` and `Shift`, but now `Scale` will instead be using a lower-triangular matrix (representing the Cholesky of the covariance matrix of a multivariate normal) in constrast to the diagonal matrix we used in for the mean-field approximate posterior. + + +```julia +using LinearAlgebra +``` + + +```julia +# Using `ComponentArrays.jl` together with `UnPack.jl` makes our lives much easier. +using ComponentArrays, UnPack +``` + + +```julia +proto_arr = ComponentArray( + L = zeros(d, d), + b = zeros(d) +) +proto_axes = proto_arr |> getaxes +num_params = length(proto_arr) + +function getq(θ) + L, b = begin + @unpack L, b = ComponentArray(θ, proto_axes) + LowerTriangular(L), b + end + # For this to represent a covariance matrix we need to ensure that the diagonal is positive. + # We can enforce this by zeroing out the diagonal and then adding back the diagonal exponentiated. + D = Diagonal(diag(L)) + A = L - D + exp(D) # exp for Diagonal is the same as exponentiating only the diagonal entries + + b = to_constrained ∘ Shift(b; dim = Val(1)) ∘ Scale(A; dim = Val(1)) + + return transformed(base_dist, b) +end +``` + +``` +getq (generic function with 1 method) +``` + + + +```julia +advi = ADVI(10, 20_000) +``` + +``` +AdvancedVI.ADVI{AdvancedVI.ForwardDiffAD{40}}(10, 20000) +``` + + + +```julia +q_full_normal = vi(m, advi, getq, randn(num_params); optimizer = Variational.DecayedADAGrad(1e-2)); +``` + + + + +Let's have a look at the learned covariance matrix: + + +```julia +A = q_full_normal.transform.ts[1].a +``` + +``` +13×13 LinearAlgebra.LowerTriangular{Float64,Array{Float64,2}}: + 0.149061 ⋅ … ⋅ ⋅ ⋅ + -0.00838251 0.0250901 ⋅ ⋅ ⋅ + -3.58984e-5 -0.00630153 ⋅ ⋅ ⋅ + 0.00137296 -0.00359297 ⋅ ⋅ ⋅ + -0.00172894 0.00475108 ⋅ ⋅ ⋅ + -0.0027409 0.029463 … ⋅ ⋅ ⋅ + -0.00595888 -0.00512494 ⋅ ⋅ ⋅ + 0.000255923 -0.00644225 ⋅ ⋅ ⋅ + -0.000385188 -0.00519724 ⋅ ⋅ ⋅ + 0.000447806 -0.00146935 ⋅ ⋅ ⋅ + 0.0104856 -0.0052661 … 0.0216549 ⋅ ⋅ + -0.0169155 0.0415363 -0.0114656 0.0180672 ⋅ + 0.0111835 -0.0243803 0.00163862 -0.00261288 0.0180538 +``` + + + +```julia +heatmap(cov(A * A')) +``` + +![](figures/09_variational-inference_68_1.png) + +```julia +zs = rand(q_full_normal, 10_000); +``` + + +```julia +p1 = plot_variational_marginals(rand(q_mf_normal, 10_000), sym2range) +p2 = plot_variational_marginals(rand(q_full_normal, 10_000), sym2range) + +plot(p1, p2, layout = (1, 2), size = (800, 2000)) +``` + +![](figures/09_variational-inference_70_1.png) + + + + +So it seems like the "full" ADVI approach, i.e. no mean-field assumption, obtain the same modes as the mean-field approach but with greater uncertainty for some of the `coefficients`. This + + +```julia +# Unfortunately, it seems like this has quite a high variance which is likely to be due to numerical instability, +# so we consider a larger number of samples. If we get a couple of outliers due to numerical issues, +# these kind affect the mean prediction greatly. +z = rand(q_full_normal, 10_000); +``` + + +```julia +train_cut.VIFullPredictions = unstandardize(prediction(z, sym2range, train), data.MPG); +test_cut.VIFullPredictions = unstandardize(prediction(z, sym2range, test), data.MPG); +``` + + +```julia +vi_loss1 = mean((train_cut.VIPredictions - train_cut.MPG).^2) +vifull_loss1 = mean((train_cut.VIFullPredictions - train_cut.MPG).^2) +bayes_loss1 = mean((train_cut.BayesPredictions - train_cut.MPG).^2) +ols_loss1 = mean((train_cut.OLSPrediction - train_cut.MPG).^2) + +vi_loss2 = mean((test_cut.VIPredictions - test_cut.MPG).^2) +vifull_loss2 = mean((test_cut.VIFullPredictions - test_cut.MPG).^2) +bayes_loss2 = mean((test_cut.BayesPredictions - test_cut.MPG).^2) +ols_loss2 = mean((test_cut.OLSPrediction - test_cut.MPG).^2) + +println("Training set: + VI loss: $vi_loss1 + Bayes loss: $bayes_loss1 + OLS loss: $ols_loss1 +Test set: + VI loss: $vi_loss2 + Bayes loss: $bayes_loss2 + OLS loss: $ols_loss2") +``` + +``` +Training set: + VI loss: 0.0007001199110321286 + Bayes loss: 1.3743352742545003e-14 + OLS loss: 3.070926124893026 +Test set: + VI loss: 0.0014915402827578768 + Bayes loss: 9.410239119029718e-14 + OLS loss: 27.09481307076062 +``` + + + +```julia +z = rand(q_mf_normal, 1000); +preds = hcat([unstandardize(prediction(z[:, i], sym2range, test), data.MPG) for i = 1:size(z, 2)]...); + +p1 = scatter(1:size(test, 1), mean(preds; dims = 2), yerr=std(preds; dims = 2), label="prediction (mean ± std)", size = (900, 500), markersize = 8) +scatter!(1:size(test, 1), unstandardize(test_label, data.MPG), label="true") +xaxis!(1:size(test, 1)) +ylims!(95, 140) +title!("Mean-field ADVI (Normal)") +``` + +![](figures/09_variational-inference_74_1.png) + +```julia +z = rand(q_full_normal, 1000); +preds = hcat([unstandardize(prediction(z[:, i], sym2range, test), data.MPG) for i = 1:size(z, 2)]...); + +p2 = scatter(1:size(test, 1), mean(preds; dims = 2), yerr=std(preds; dims = 2), label="prediction (mean ± std)", size = (900, 500), markersize = 8) +scatter!(1:size(test, 1), unstandardize(test_label, data.MPG), label="true") +xaxis!(1:size(test, 1)) +ylims!(95, 140) +title!("Full ADVI (Normal)") +``` + +![](figures/09_variational-inference_75_1.png) + +```julia +preds = hcat([unstandardize(prediction_chain(chain[i], test), data.MPG) for i = 1:5:size(chain, 1)]...); + +p3 = scatter(1:size(test, 1), mean(preds; dims = 2), yerr=std(preds; dims = 2), label="prediction (mean ± std)", size = (900, 500), markersize = 8) +scatter!(1:size(test, 1), unstandardize(test_label, data.MPG), label="true") +xaxis!(1:size(test, 1)) +ylims!(95, 140) +title!("MCMC (NUTS)") +``` + +![](figures/09_variational-inference_76_1.png) + +```julia +plot(p1, p2, p3, layout = (1, 3), size = (900, 250), label="") +``` + +![](figures/09_variational-inference_77_1.png) + + + +Here we actually see that indeed both the full ADVI and the MCMC approaches does a much better job of quantifying the uncertainty of predictions for never-before-seen samples, with full ADVI seemingly *overestimating* the variance slightly compared to MCMC. + +So now you know how to do perform VI on your Turing.jl model! Great isn't it? + + +## Appendix + This tutorial is part of the TuringTutorials repository, found at: . + +To locally run this tutorial, do the following commands: +```julia, eval = false +using TuringTutorials +TuringTutorials.weave_file("09-variational-inference", "09_variational-inference.jmd") +``` + +Computer Information: +``` +Julia Version 1.5.3 +Commit 788b2c77c1 (2020-11-09 13:37 UTC) +Platform Info: + OS: Linux (x86_64-pc-linux-gnu) + CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz + WORD_SIZE: 64 + LIBM: libopenlibm + LLVM: libLLVM-9.0.1 (ORCJIT, skylake) + +``` + +Package Information: + +``` +Status `~/Projects/public/TuringTutorials/tutorials/09-variational-inference/Project.toml` + [76274a88] Bijectors v0.8.14 + [b0b7db55] ComponentArrays v0.8.21 + [1624bea9] ConjugatePriors v0.4.0 + [5789e2e9] FileIO v1.5.0 + [38e38edf] GLM v1.4.0 + [b964fa9f] LaTeXStrings v1.2.1 + [91a5bcdd] Plots v1.10.6 + [d330b81b] PyPlot v2.9.0 + [ce6b1742] RDatasets v0.6.10 + [f3b207a7] StatsPlots v0.14.19 + [fce5fe82] Turing v0.14.12 + [3a884ed6] UnPack v1.0.2 + [9a3f8284] Random + +``` diff --git a/markdown/09-variational-inference/figures/09_variational-inference_18_1.png b/markdown/09-variational-inference/figures/09_variational-inference_18_1.png new file mode 100644 index 000000000..3fd80fe92 Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_18_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_19_1.png b/markdown/09-variational-inference/figures/09_variational-inference_19_1.png new file mode 100644 index 000000000..af9aa1d10 Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_19_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_39_1.png b/markdown/09-variational-inference/figures/09_variational-inference_39_1.png new file mode 100644 index 000000000..3b06eb63e Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_39_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_41_1.png b/markdown/09-variational-inference/figures/09_variational-inference_41_1.png new file mode 100644 index 000000000..d5363cdd4 Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_41_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_53_1.png b/markdown/09-variational-inference/figures/09_variational-inference_53_1.png new file mode 100644 index 000000000..94c258262 Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_53_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_54_1.png b/markdown/09-variational-inference/figures/09_variational-inference_54_1.png new file mode 100644 index 000000000..7bde8052b Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_54_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_61_1.png b/markdown/09-variational-inference/figures/09_variational-inference_61_1.png new file mode 100644 index 000000000..39e4728bc Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_61_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_68_1.png b/markdown/09-variational-inference/figures/09_variational-inference_68_1.png new file mode 100644 index 000000000..b8b31dae8 Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_68_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_70_1.png b/markdown/09-variational-inference/figures/09_variational-inference_70_1.png new file mode 100644 index 000000000..b00d55e0e Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_70_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_74_1.png b/markdown/09-variational-inference/figures/09_variational-inference_74_1.png new file mode 100644 index 000000000..67e2410cf Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_74_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_75_1.png b/markdown/09-variational-inference/figures/09_variational-inference_75_1.png new file mode 100644 index 000000000..ccafc6c1e Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_75_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_76_1.png b/markdown/09-variational-inference/figures/09_variational-inference_76_1.png new file mode 100644 index 000000000..7bde8052b Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_76_1.png differ diff --git a/markdown/09-variational-inference/figures/09_variational-inference_77_1.png b/markdown/09-variational-inference/figures/09_variational-inference_77_1.png new file mode 100644 index 000000000..e492d665e Binary files /dev/null and b/markdown/09-variational-inference/figures/09_variational-inference_77_1.png differ diff --git a/markdown/10-bayesian-differential-equations/10_bayesian-differential-equations.md b/markdown/10-bayesian-differential-equations/10_bayesian-differential-equations.md new file mode 100644 index 000000000..3259e9f0e --- /dev/null +++ b/markdown/10-bayesian-differential-equations/10_bayesian-differential-equations.md @@ -0,0 +1,740 @@ +--- +redirect_from: "tutorials/10-bayesiandiffeq/" +title: "Bayesian Estimation of Differential Equations" +permalink: "/:collection/:name/" +--- + + +Most of the scientific community deals with the basic problem of trying to mathematically model the reality around them and this often involves dynamical systems. The general trend to model these complex dynamical systems is through the use of differential equations. Differential equation models often have non-measurable parameters. The popular “forward-problem” of simulation consists of solving the differential equations for a given set of parameters, the “inverse problem” to simulation, known as parameter estimation, is the process of utilizing data to determine these model parameters. Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty. + + +```julia +using Turing, Distributions, DifferentialEquations + +# Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics. +using MCMCChains, Plots, StatsPlots + +# Set a seed for reproducibility. +using Random +Random.seed!(14); +``` + + + + +## The Lotka-Volterra Model + +The Lotka–Volterra equations, also known as the predator–prey equations, are a pair of first-order nonlinear differential equations, frequently used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as prey. The populations change through time according to the pair of equations: + +$$\frac{dx}{dt} = (\alpha - \beta y)x$$ + +$$\frac{dy}{dt} = (\delta x - \gamma)y$$ + + +```julia +function lotka_volterra(du,u,p,t) + x, y = u + α, β, γ, δ = p + du[1] = (α - β*y)x # dx = + du[2] = (δ*x - γ)y # dy = +end +p = [1.5, 1.0, 3.0, 1.0] +u0 = [1.0,1.0] +prob1 = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) +sol = solve(prob1,Tsit5()) +plot(sol) +``` + +![](figures/10_bayesian-differential-equations_2_1.png) + + + +We'll generate the data to use for the parameter estimation from simulation. +With the `saveat` [argument](https://docs.sciml.ai/latest/basics/common_solver_opts/) we specify that the solution is stored only at `0.1` time units. To make the data look more realistic, we add random noise using the function `randn`. + + +```julia +sol1 = solve(prob1,Tsit5(),saveat=0.1) +odedata = Array(sol1) + 0.8 * randn(size(Array(sol1))) +plot(sol1, alpha = 0.3, legend = false); scatter!(sol1.t, odedata') +``` + +![](figures/10_bayesian-differential-equations_3_1.png) + + + +## Direct Handling of Bayesian Estimation with Turing + +Previously, functions in Turing and DifferentialEquations were not inter-composable, so Bayesian inference of differential equations needed to be handled by another package called [DiffEqBayes.jl](https://github.com/SciML/DiffEqBayes.jl) (note that DiffEqBayes works also with CmdStan.jl, Turing.jl, DynamicHMC.jl and ApproxBayes.jl - see the [DiffEqBayes docs](https://docs.sciml.ai/latest/analysis/parameter_estimation/#Bayesian-Methods-1) for more info). + +From now on however, Turing and DifferentialEquations are completely composable and we can write of the differential equation inside a Turing `@model` and it will just work. Therefore, we can rewrite the Lotka Volterra parameter estimation problem with a Turing `@model` interface as below: + + +```julia +Turing.setadbackend(:forwarddiff) + +@model function fitlv(data, prob1) + σ ~ InverseGamma(2, 3) # ~ is the tilde character + α ~ truncated(Normal(1.5,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.5),0,2) + γ ~ truncated(Normal(3.0,0.5),1,4) + δ ~ truncated(Normal(1.0,0.5),0,2) + + p = [α,β,γ,δ] + prob = remake(prob1, p=p) + predicted = solve(prob,Tsit5(),saveat=0.1) + + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end + +model = fitlv(odedata, prob1) + +# This next command runs 3 independent chains without using multithreading. +chain = mapreduce(c -> sample(model, NUTS(.65),1000), chainscat, 1:3) +``` + +``` +Chains MCMC chain (1000×17×3 Array{Float64,3}): + +Iterations = 1:1000 +Thinning interval = 1 +Chains = 1, 2, 3 +Samples per chain = 1000 +parameters = α, β, γ, δ, σ +internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy +_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, +nom_step_size, numerical_error, step_size, tree_depth + +Summary Statistics + parameters mean std naive_se mcse ess rhat + Symbol Float64 Float64 Float64 Float64 Float64 Float64 + + α 1.2120 0.4926 0.0090 0.0911 6.0574 11.8201 + β 1.0672 0.1639 0.0030 0.0098 288.9165 1.0360 + γ 2.8014 0.2742 0.0050 0.0208 46.5485 1.0915 + δ 0.9538 0.0904 0.0017 0.0056 124.1565 1.0487 + σ 1.2712 0.6531 0.0119 0.1205 6.0766 9.8804 + +Quantiles + parameters 2.5% 25.0% 50.0% 75.0% 97.5% + Symbol Float64 Float64 Float64 Float64 Float64 + + α 0.5013 0.5243 1.5228 1.5737 1.6590 + β 0.6738 1.0262 1.0815 1.1337 1.4402 + γ 2.0953 2.6893 2.8377 2.9643 3.2437 + δ 0.8076 0.8993 0.9435 0.9901 1.1873 + σ 0.7482 0.7989 0.8394 2.1115 2.3534 +``` + + + + + +The estimated parameters are close to the desired parameter values. We can also check that the chains have converged in the plot. + + +```julia +plot(chain) +``` + +![](figures/10_bayesian-differential-equations_5_1.png) + + + +### Data retrodiction +In Bayesian analysis it is often useful to retrodict the data, i.e. generate simulated data using samples from the posterior distribution, and compare to the original data (see for instance section 3.3.2 - model checking of McElreath's book "Statistical Rethinking"). Here, we solve again the ODE using the output in `chain`, for 300 randomly picked posterior samples. We plot this ensemble of solutions to check if the solution resembles the data. + + +```julia +pl = scatter(sol1.t, odedata'); +``` + + +```julia +chain_array = Array(chain) +for k in 1:300 + resol = solve(remake(prob1,p=chain_array[rand(1:1500), 1:4]),Tsit5(),saveat=0.1) + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +# display(pl) +plot!(sol1, w=1, legend = false) +``` + +![](figures/10_bayesian-differential-equations_7_1.png) + + + +In the plot above, the 300 retrodicted time courses from the posterior are plotted in gray, and the original data are the blue and red dots, and the solution that was used to generate the data are the green and purple lines. We can see that, even though we added quite a bit of noise to the data (see dot plot above), the posterior distribution reproduces quite accurately the "true" ODE solution. + +## Lokta Volterra with missing predator data + +Thanks to the known structure of the problem, encoded by the Lokta-Volterra ODEs, one can also fit a model with incomplete data - even without any data for one of the two variables. For instance, let's suppose you have observations for the prey only, but none for the predator. We test this case by fitting the model only to the $$y$$ variable of the system, without providing any data for $$x$$: + + +```julia +@model function fitlv2(data, prob1) # data should be a Vector + σ ~ InverseGamma(2, 3) # ~ is the tilde character + α ~ truncated(Normal(1.5,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.5),0,2) + γ ~ truncated(Normal(3.0,0.5),1,4) + δ ~ truncated(Normal(1.0,0.5),0,2) + + p = [α,β,γ,δ] + prob = remake(prob1, p=p) + predicted = solve(prob,Tsit5(),saveat=0.1) + + for i = 1:length(predicted) + data[i] ~ Normal(predicted[i][2], σ) # predicted[i][2] is the data for y - a scalar, so we use Normal instead of MvNormal + end +end + +model2 = fitlv2(odedata[2,:], prob1) +``` + +``` +DynamicPPL.Model{Main.##WeaveSandBox#253.var"#5#6",(:data, :prob1),(),(),Tu +ple{Array{Float64,1},SciMLBase.ODEProblem{Array{Float64,1},Tuple{Float64,Fl +oat64},true,Array{Float64,1},SciMLBase.ODEFunction{true,typeof(Main.##Weave +SandBox#253.lotka_volterra),LinearAlgebra.UniformScaling{Bool},Nothing,Noth +ing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing +,Nothing,typeof(SciMLBase.DEFAULT_OBSERVED),Nothing},Base.Iterators.Pairs{U +nion{},Union{},Tuple{},NamedTuple{(),Tuple{}}},SciMLBase.StandardODEProblem +}},Tuple{}}(:fitlv2, Main.##WeaveSandBox#253.var"#5#6"(), (data = [2.200730 +590544725, 0.8584002186440604, 0.3130803892338444, 0.8065538543184622, -0.3 +4719524379658445, 0.2827563462601055, 0.4633732909134419, 0.938813994609707 +2, -0.029638888419957155, -0.10766570796447744 … 4.484466907306791, 2.276 +637854709268, 3.034635398109261, 1.6534146147281914, 2.3126757947633125, 3. +430419239300897, 1.481768351221498, 1.7989355388635417, 1.343881963121325, +0.25843622408034905], prob1 = ODEProblem with uType Array{Float64,1} and tT +ype Float64. In-place: true +timespan: (0.0, 10.0) +u0: [1.0, 1.0]), NamedTuple()) +``` + + + + + +Here we use the multithreading functionality [available](https://turing.ml/dev/docs/using-turing/guide#multithreaded-sampling) in Turing.jl to sample 3 independent chains + + +```julia +Threads.nthreads() +``` + +``` +6 +``` + + + +```julia +# This next command runs 3 independent chains with multithreading. +chain2 = sample(model2, NUTS(.45), MCMCThreads(), 5000, 3, progress=false) +``` + +``` +Chains MCMC chain (5000×17×3 Array{Float64,3}): + +Iterations = 1:5000 +Thinning interval = 1 +Chains = 1, 2, 3 +Samples per chain = 5000 +parameters = α, β, γ, δ, σ +internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy +_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, +nom_step_size, numerical_error, step_size, tree_depth + +Summary Statistics + parameters mean std naive_se mcse ess rhat + Symbol Float64 Float64 Float64 Float64 Float64 Float64 + + α 1.5612 0.1706 0.0014 0.0082 174.9037 1.0022 + β 1.1232 0.1357 0.0011 0.0063 201.8733 1.0014 + γ 2.9663 0.2786 0.0023 0.0137 167.4011 1.0034 + δ 0.9418 0.2206 0.0018 0.0105 174.6266 1.0012 + σ 0.8129 0.0637 0.0005 0.0039 81.8891 1.0334 + +Quantiles + parameters 2.5% 25.0% 50.0% 75.0% 97.5% + Symbol Float64 Float64 Float64 Float64 Float64 + + α 1.2771 1.4407 1.5336 1.6756 1.9369 + β 0.8886 1.0345 1.1018 1.2067 1.4164 + γ 2.4397 2.7648 2.9761 3.1589 3.4934 + δ 0.5388 0.7779 0.9475 1.0809 1.3972 + σ 0.6917 0.7684 0.8159 0.8532 0.9457 +``` + + + +```julia +pl = scatter(sol1.t, odedata'); +chain_array2 = Array(chain2) +for k in 1:300 + resol = solve(remake(prob1,p=chain_array2[rand(1:12000), 1:4]),Tsit5(),saveat=0.1) + # Note that due to a bug in AxisArray, the variables from the chain will be returned always in + # the order it is stored in the array, not by the specified order in the call - :α, :β, :γ, :δ + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +#display(pl) +plot!(sol1, w=1, legend = false) +``` + +![](figures/10_bayesian-differential-equations_11_1.png) + + + +Note that here, the data values of $$x$$ (blue dots) were not given to the model! Yet, the model could predict the values of $$x$$ relatively accurately, albeit with a wider distribution of solutions, reflecting the greater uncertainty in the prediction of the $$x$$ values. + +## Inference of Delay Differential Equations + +Here we show an example of inference with another type of differential equation: a Delay Differential Equation (DDE). A DDE is an DE system where derivatives are function of values at an earlier point in time. This is useful to model a delayed effect, like incubation time of a virus for instance. + +For this, we will define a [`DDEProblem`](https://diffeq.sciml.ai/stable/tutorials/dde_example/), from the package DifferentialEquations.jl. + +Here is a delayed version of the lokta voltera system: + +$$\frac{dx}{dt} = \alpha x(t-\tau) - \beta y(t) x(t)$$ + +$$\frac{dy}{dt} = - \gamma y(t) + \delta x(t) y(t) $$ + +Where $$x(t-\tau)$$ is the variable $$x$$ at an earlier time point. We specify the delayed variable with a function `h(p, t)`, as described in the [DDE example](https://diffeq.sciml.ai/stable/tutorials/dde_example/). + + +```julia +function delay_lotka_volterra(du, u, h, p, t) + x, y = u + α, β, γ, δ = p + du[1] = α * h(p, t-1; idxs=1) - β * x * y + du[2] = -γ * y + δ * x * y + return +end + +p = (1.5,1.0,3.0,1.0) +u0 = [1.0; 1.0] +tspan = (0.0,10.0) +h(p, t; idxs::Int) = 1.0 +prob1 = DDEProblem(delay_lotka_volterra,u0,h,tspan,p) +``` + +``` +DDEProblem with uType Array{Float64,1} and tType Float64. In-place: true +timespan: (0.0, 10.0) +u0: [1.0, 1.0] +``` + + + +```julia +sol = solve(prob1,saveat=0.1) +ddedata = Array(sol) +ddedata = ddedata + 0.5 * randn(size(ddedata)) +``` + +``` +2×101 Array{Float64,2}: + 1.18609 0.0634866 1.19388 … 2.21241 3.28584 3.16553 2.22741 + 0.808641 0.944246 1.46414 1.73124 1.19192 1.28923 1.14635 +``` + + + + + +Plot the data: + +```julia +scatter(sol.t, ddedata'); plot!(sol) +``` + +![](figures/10_bayesian-differential-equations_14_1.png) + + + +Now we define and run the Turing model. + +```julia +Turing.setadbackend(:forwarddiff) +@model function fitlv(data, prob1) + + σ ~ InverseGamma(2, 3) + α ~ Truncated(Normal(1.5,0.5),0.5,2.5) + β ~ Truncated(Normal(1.2,0.5),0,2) + γ ~ Truncated(Normal(3.0,0.5),1,4) + δ ~ Truncated(Normal(1.0,0.5),0,2) + + p = [α,β,γ,δ] + + #prob = DDEProblem(delay_lotka_volterra,u0,_h,tspan,p) + prob = remake(prob1, p=p) + predicted = solve(prob,saveat=0.1) + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end; +model = fitlv(ddedata, prob1) +``` + +``` +DynamicPPL.Model{Main.##WeaveSandBox#253.var"#8#9",(:data, :prob1),(),(),Tu +ple{Array{Float64,2},SciMLBase.DDEProblem{Array{Float64,1},Tuple{Float64,Fl +oat64},Tuple{},Tuple{},true,NTuple{4,Float64},SciMLBase.DDEFunction{true,ty +peof(Main.##WeaveSandBox#253.delay_lotka_volterra),LinearAlgebra.UniformSca +ling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing, +Nothing,Nothing,Nothing,Nothing},typeof(Main.##WeaveSandBox#253.h),Base.Ite +rators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},SciMLBase.Stan +dardDDEProblem}},Tuple{}}(:fitlv, Main.##WeaveSandBox#253.var"#8#9"(), (dat +a = [1.1860906554746231 0.06348658995769019 … 3.1655347918676964 2.22741117 +6264933; 0.8086410615247461 0.9442456964892731 … 1.2892250042736726 1.14634 +85009181613], prob1 = DDEProblem with uType Array{Float64,1} and tType Floa +t64. In-place: true +timespan: (0.0, 10.0) +u0: [1.0, 1.0]), NamedTuple()) +``` + + + + + +Then we draw samples using multithreading; this time, we draw 3 independent chains in parallel using `MCMCThreads`. + +```julia +chain = sample(model, NUTS(.65), MCMCThreads(), 300, 3, progress=true) +plot(chain) +``` + +![](figures/10_bayesian-differential-equations_16_1.png) + + + +Finally, we select a 100 sets of parameters from the first chain and plot solutions. + + +```julia +chain +``` + +``` +Chains MCMC chain (300×17×3 Array{Float64,3}): + +Iterations = 1:300 +Thinning interval = 1 +Chains = 1, 2, 3 +Samples per chain = 300 +parameters = α, β, γ, δ, σ +internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy +_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, +nom_step_size, numerical_error, step_size, tree_depth + +Summary Statistics + parameters mean std naive_se mcse ess rhat + Symbol Float64 Float64 Float64 Float64 Float64 Float64 + + α 1.5142 0.0583 0.0019 0.0030 349.4243 1.0016 + β 0.9921 0.0459 0.0015 0.0027 392.5261 1.0024 + γ 2.9372 0.1188 0.0040 0.0055 383.4066 1.0013 + δ 0.9823 0.0420 0.0014 0.0022 356.1193 1.0018 + σ 0.4869 0.0241 0.0008 0.0007 651.6937 0.9998 + +Quantiles + parameters 2.5% 25.0% 50.0% 75.0% 97.5% + Symbol Float64 Float64 Float64 Float64 Float64 + + α 1.4002 1.4730 1.5134 1.5536 1.6272 + β 0.9060 0.9585 0.9920 1.0235 1.0853 + γ 2.7137 2.8563 2.9329 3.0122 3.1963 + δ 0.9055 0.9539 0.9814 1.0093 1.0731 + σ 0.4422 0.4689 0.4859 0.5043 0.5360 +``` + + + +```julia +pl = scatter(sol.t, ddedata') +chain_array = Array(chain) +for k in 1:100 + resol = solve(remake(prob1,p=chain_array[rand(1:450),1:4]),Tsit5(),saveat=0.1) + # Note that due to a bug in AxisArray, the variables from the chain will be returned always in + # the order it is stored in the array, not by the specified order in the call - :α, :β, :γ, :δ + + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +#display(pl) +plot!(sol) +``` + +![](figures/10_bayesian-differential-equations_18_1.png) + + + +Here again, the dots is the data fed to the model, the continuous colored line is the "true" solution, and the gray lines are solutions from the posterior. The fit is pretty good even though the data was quite noisy to start. + +## Scaling to Large Models: Adjoint Sensitivities + +DifferentialEquations.jl's efficiency for large stiff models has been shown in multiple [benchmarks](https://github.com/SciML/DiffEqBenchmarks.jl). To learn more about how to optimize solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/latest/tutorials/advanced_ode_example/). + +[Sensitivity analysis](https://docs.sciml.ai/latest/analysis/sensitivity/), or automatic differentiation (AD) of the solver, is provided by the DiffEq suite. The model sensitivities are the derivatives of the solution $$u(t)$$ with respect to the parameters. Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter. Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. + + +The AD ecosystem in Julia allows you to switch between forward mode, reverse mode, source to source and other choices of AD and have it work with any Julia code. For a user to make use of this within [SciML](https://sciml.ai), [high level interactions in `solve`](https://docs.sciml.ai/latest/analysis/sensitivity/#High-Level-Interface:-sensealg-1) automatically plug into those AD systems to allow for choosing advanced sensitivity analysis (derivative calculation) [methods](https://docs.sciml.ai/latest/analysis/sensitivity/#Sensitivity-Algorithms-1). + +More theoretical details on these methods can be found at: https://docs.sciml.ai/latest/extras/sensitivity_math/. + +While these sensitivity analysis methods may seem complicated (and they are!), using them is dead simple. Here is a version of the Lotka-Volterra model with adjoints enabled. + +All we had to do is switch the AD backend to one of the adjoint-compatible backends (ReverseDiff, Tracker, or Zygote) and boom the system takes over and we're using adjoint methods! Notice that on this model adjoints are slower. This is because adjoints have a higher overhead on small parameter models and we suggest only using these methods for models with around 100 parameters or more. For more details, see https://arxiv.org/abs/1812.01892. + + +```julia +using Zygote, DiffEqSensitivity +Turing.setadbackend(:zygote) +prob1 = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) +``` + +``` +ODEProblem with uType Array{Float64,1} and tType Float64. In-place: true +timespan: (0.0, 10.0) +u0: [1.0, 1.0] +``` + + + +```julia +@model function fitlv(data, prob) + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.5),0,2) + γ ~ truncated(Normal(3.0,0.5),1,4) + δ ~ truncated(Normal(1.0,0.5),0,2) + p = [α,β,γ,δ] + prob = remake(prob, p=p) + + predicted = solve(prob,saveat=0.1) + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end; +model = fitlv(odedata, prob1) +chain = sample(model, NUTS(.65),1000) +``` + +``` +Chains MCMC chain (1000×17×1 Array{Float64,3}): + +Iterations = 1:1000 +Thinning interval = 1 +Chains = 1 +Samples per chain = 1000 +parameters = α, β, γ, δ, σ +internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy +_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, +nom_step_size, numerical_error, step_size, tree_depth + +Summary Statistics + parameters mean std naive_se mcse ess rhat + Symbol Float64 Float64 Float64 Float64 Float64 Float64 + + α 1.5546 0.0541 0.0017 0.0032 211.5538 1.0024 + β 1.0906 0.0533 0.0017 0.0026 264.2906 0.9991 + γ 2.8837 0.1472 0.0047 0.0091 207.6229 1.0023 + δ 0.9397 0.0525 0.0017 0.0031 198.9996 1.0024 + σ 0.8163 0.0412 0.0013 0.0021 458.2821 0.9991 + +Quantiles + parameters 2.5% 25.0% 50.0% 75.0% 97.5% + Symbol Float64 Float64 Float64 Float64 Float64 + + α 1.4516 1.5194 1.5551 1.5900 1.6548 + β 0.9938 1.0520 1.0900 1.1240 1.2007 + γ 2.6248 2.7813 2.8758 2.9740 3.1869 + δ 0.8502 0.9037 0.9368 0.9719 1.0482 + σ 0.7411 0.7865 0.8154 0.8419 0.9009 +``` + + + + + +Now we can exercise control of the sensitivity analysis method that is used by using the `sensealg` keyword argument. Let's choose the `InterpolatingAdjoint` from the available AD [methods](https://docs.sciml.ai/latest/analysis/sensitivity/#Sensitivity-Algorithms-1) and enable a compiled ReverseDiff vector-Jacobian product: + + +```julia +@model function fitlv(data, prob) + σ ~ InverseGamma(2, 3) + α ~ truncated(Normal(1.5,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.5),0,2) + γ ~ truncated(Normal(3.0,0.5),1,4) + δ ~ truncated(Normal(1.0,0.5),0,2) + p = [α,β,γ,δ] + prob = remake(prob, p=p) + predicted = solve(prob,saveat=0.1,sensealg=InterpolatingAdjoint(autojacvec=ReverseDiffVJP(true))) + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end; +model = fitlv(odedata, prob1) +@time chain = sample(model, NUTS(.65),1000) +``` + +``` +1458.424425 seconds (5.32 G allocations: 280.687 GiB, 3.79% gc time) +Chains MCMC chain (1000×17×1 Array{Float64,3}): + +Iterations = 1:1000 +Thinning interval = 1 +Chains = 1 +Samples per chain = 1000 +parameters = α, β, γ, δ, σ +internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy +_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, +nom_step_size, numerical_error, step_size, tree_depth + +Summary Statistics + parameters mean std naive_se mcse ess rhat + Symbol Float64 Float64 Float64 Float64 Float64 Float64 + + α 1.5577 0.0527 0.0017 0.0030 198.5876 0.9991 + β 1.0928 0.0527 0.0017 0.0031 233.5858 0.9991 + γ 2.8761 0.1391 0.0044 0.0080 207.3585 0.9995 + δ 0.9362 0.0489 0.0015 0.0029 208.3950 0.9990 + σ 0.8120 0.0399 0.0013 0.0011 659.3252 1.0020 + +Quantiles + parameters 2.5% 25.0% 50.0% 75.0% 97.5% + Symbol Float64 Float64 Float64 Float64 Float64 + + α 1.4573 1.5234 1.5540 1.5896 1.6707 + β 1.0011 1.0560 1.0913 1.1255 1.2084 + γ 2.6005 2.7871 2.8753 2.9665 3.1595 + δ 0.8402 0.9044 0.9370 0.9677 1.0361 + σ 0.7367 0.7843 0.8096 0.8364 0.8995 +``` + + + + + +For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://diffeqflux.sciml.ai/dev/). + +## Inference of a Stochastic Differential Equation +A Stochastic Differential Equation ([SDE](https://diffeq.sciml.ai/stable/tutorials/sde_example/)) is a differential equation that has a stochastic (noise) term in the expression of the derivatives. Here we fit a Stochastic version of the Lokta-Volterra system. + +We use a quasi-likelihood approach in which all trajectories of a solution are compared instead of a reduction such as mean, this increases the robustness of fitting and makes the likelihood more identifiable. We use SOSRI to solve the equation. The NUTS sampler is a bit sensitive to the stochastic optimization since the gradient is then changing with every calculation, so we use NUTS with a target acceptance rate of `0.25`. + + +```julia +u0 = [1.0,1.0] +tspan = (0.0,10.0) +function multiplicative_noise!(du,u,p,t) + x,y = u + du[1] = p[5]*x + du[2] = p[6]*y +end +p = [1.5,1.0,3.0,1.0,0.1,0.1] + +function lotka_volterra!(du,u,p,t) + x,y = u + α,β,γ,δ = p + du[1] = dx = α*x - β*x*y + du[2] = dy = δ*x*y - γ*y +end + + +prob_sde = SDEProblem(lotka_volterra!,multiplicative_noise!,u0,tspan,p) + +ensembleprob = EnsembleProblem(prob_sde) +@time data = solve(ensembleprob,SOSRI(),saveat=0.1,trajectories=1000) +plot(EnsembleSummary(data)) +``` + +``` +6.448378 seconds (39.60 M allocations: 1.342 GiB, 4.35% gc time) +``` + + +![](figures/10_bayesian-differential-equations_22_1.png) + +```julia +Turing.setadbackend(:forwarddiff) +@model function fitlv(data, prob) + σ ~ InverseGamma(2,3) + α ~ truncated(Normal(1.3,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.25),0.5,2) + γ ~ truncated(Normal(3.2,0.25),2.2,4.0) + δ ~ truncated(Normal(1.2,0.25),0.5,2.0) + ϕ1 ~ truncated(Normal(0.12,0.3),0.05,0.25) + ϕ2 ~ truncated(Normal(0.12,0.3),0.05,0.25) + p = [α,β,γ,δ,ϕ1,ϕ2] + prob = remake(prob, p=p) + predicted = solve(prob,SOSRI(),saveat=0.1) + + if predicted.retcode != :Success + Turing.acclogp!(_varinfo, -Inf) + end + for j in 1:length(data) + for i = 1:length(predicted) + data[j][i] ~ MvNormal(predicted[i],σ) + end + end +end; +``` + + + + +We use NUTS sampler with a low acceptance ratio and initial parameters since estimating the parameters of SDE with HMC poses a challenge. Probabilistic nature of the SDE solution makes the likelihood function noisy which poses a challenge for NUTS since the gradient is then changing with every calculation. SGHMC might be better suited to be used here. + + +```julia +model = fitlv(data, prob_sde) +chain = sample(model, NUTS(0.25), 5000, init_theta = [1.5,1.3,1.2,2.7,1.2,0.12,0.12]) +plot(chain) +``` + +![](figures/10_bayesian-differential-equations_24_1.png) + + +## Appendix + This tutorial is part of the TuringTutorials repository, found at: . + +To locally run this tutorial, do the following commands: +```julia, eval = false +using TuringTutorials +TuringTutorials.weave_file("10-bayesian-differential-equations", "10_bayesian-differential-equations.jmd") +``` + +Computer Information: +``` +Julia Version 1.5.3 +Commit 788b2c77c1 (2020-11-09 13:37 UTC) +Platform Info: + OS: Linux (x86_64-pc-linux-gnu) + CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz + WORD_SIZE: 64 + LIBM: libopenlibm + LLVM: libLLVM-9.0.1 (ORCJIT, skylake) +Environment: + JULIA_NUM_THREADS = 6 + +``` + +Package Information: + +``` +Status `~/Projects/public/TuringTutorials/tutorials/10-bayesian-differential-equations/Project.toml` + [a93c6f00] DataFrames v0.22.5 + [2b5f629d] DiffEqBase v6.57.7 + [ebbdde9d] DiffEqBayes v2.23.0 + [41bf760c] DiffEqSensitivity v6.42.0 + [0c46a032] DifferentialEquations v6.16.0 + [31c24e10] Distributions v0.23.12 + [c7f686f2] MCMCChains v4.7.0 + [91a5bcdd] Plots v1.10.6 + [f3b207a7] StatsPlots v0.14.19 + [fce5fe82] Turing v0.15.11 + [9a3f8284] Random + +``` diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_10_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_10_1.png new file mode 100644 index 000000000..110091e64 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_10_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_11_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_11_1.png new file mode 100644 index 000000000..79fd0b8b1 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_11_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_12_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_12_1.png new file mode 100644 index 000000000..231a072a7 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_12_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_14_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_14_1.png new file mode 100644 index 000000000..d738e90ce Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_14_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_15_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_15_1.png new file mode 100644 index 000000000..d492808a9 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_15_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_16_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_16_1.png new file mode 100644 index 000000000..123be3d94 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_16_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_17_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_17_1.png new file mode 100644 index 000000000..d27298236 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_17_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_18_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_18_1.png new file mode 100644 index 000000000..d614c0bc9 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_18_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_19_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_19_1.png new file mode 100644 index 000000000..16871c3d6 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_19_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_22_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_22_1.png new file mode 100644 index 000000000..6778eec5e Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_22_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_23_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_23_1.png new file mode 100644 index 000000000..2fb9c5fdc Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_23_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_24_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_24_1.png new file mode 100644 index 000000000..88891a1d6 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_24_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_2_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_2_1.png new file mode 100644 index 000000000..60a8e53b3 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_2_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_3_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_3_1.png new file mode 100644 index 000000000..b8fdd8c08 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_3_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_4_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_4_1.png new file mode 100644 index 000000000..b8fdd8c08 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_4_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_5_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_5_1.png new file mode 100644 index 000000000..8912ebd8a Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_5_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_6_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_6_1.png new file mode 100644 index 000000000..81ca7eb89 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_6_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_7_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_7_1.png new file mode 100644 index 000000000..94e304009 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_7_1.png differ diff --git a/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_8_1.png b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_8_1.png new file mode 100644 index 000000000..ef8c47717 Binary files /dev/null and b/markdown/10-bayesian-differential-equations/figures/10_bayesian-differential-equations_8_1.png differ diff --git a/tutorials/00-introduction/00_introduction.jmd b/tutorials/00-introduction/00_introduction.jmd index d2e5598e7..4a5d2a5b5 100644 --- a/tutorials/00-introduction/00_introduction.jmd +++ b/tutorials/00-introduction/00_introduction.jmd @@ -1,6 +1,7 @@ --- title: Introduction to Turing permalink: /:collection/:name/ +redirect_from: tutorials/0-introduction/ --- ## Introduction diff --git a/tutorials/01-gaussian-mixture-model/01_gaussian-mixture-model.jmd b/tutorials/01-gaussian-mixture-model/01_gaussian-mixture-model.jmd index 2f75d531c..2fb834bfd 100644 --- a/tutorials/01-gaussian-mixture-model/01_gaussian-mixture-model.jmd +++ b/tutorials/01-gaussian-mixture-model/01_gaussian-mixture-model.jmd @@ -1,6 +1,7 @@ --- title: Unsupervised Learning using Bayesian Mixture Models permalink: /:collection/:name/ +redirect_from: tutorials/1-gaussianmixturemodel/ --- The following tutorial illustrates the use *Turing* for clustering data using a Bayesian mixture model. The aim of this task is to infer a latent grouping (hidden structure) from unlabelled data. diff --git a/tutorials/02-logistic-regression/02_logistic-regression.jmd b/tutorials/02-logistic-regression/02_logistic-regression.jmd index 226aa3b52..b309bf064 100644 --- a/tutorials/02-logistic-regression/02_logistic-regression.jmd +++ b/tutorials/02-logistic-regression/02_logistic-regression.jmd @@ -1,6 +1,7 @@ --- title: Bayesian Logistic Regression permalink: /:collection/:name/ +redirect_from: tutorials/2-logisticregression/ --- [Bayesian logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#Bayesian) is the Bayesian counterpart to a common tool in machine learning, logistic regression. The goal of logistic regression is to predict a one or a zero for a given training item. An example might be predicting whether someone is sick or ill given their symptoms and personal information. diff --git a/tutorials/03-bayesian-neural-network/03_bayesian-neural-network.jmd b/tutorials/03-bayesian-neural-network/03_bayesian-neural-network.jmd index 1f051bb96..9bca3cdc8 100644 --- a/tutorials/03-bayesian-neural-network/03_bayesian-neural-network.jmd +++ b/tutorials/03-bayesian-neural-network/03_bayesian-neural-network.jmd @@ -1,6 +1,7 @@ --- title: Bayesian Neural Networks permalink: /:collection/:name/ +redirect_from: tutorials/3-bayesnn/ --- In this tutorial, we demonstrate how one can implement a Bayesian Neural Network using a combination of Turing and [Flux](https://github.com/FluxML/Flux.jl), a suite of tools machine learning. We will use Flux to specify the neural network's layers and Turing to implement the probabalistic inference, with the goal of implementing a classification algorithm. diff --git a/tutorials/04-hidden-markov-model/04_hidden-markov-model.jmd b/tutorials/04-hidden-markov-model/04_hidden-markov-model.jmd index 0c308bd46..f4717164b 100644 --- a/tutorials/04-hidden-markov-model/04_hidden-markov-model.jmd +++ b/tutorials/04-hidden-markov-model/04_hidden-markov-model.jmd @@ -1,6 +1,7 @@ --- title: Bayesian Hidden Markov Models permalink: /:collection/:name/ +redirect_from: tutorials/4-bayeshmm/ --- This tutorial illustrates training Bayesian [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMM) using Turing. The main goals are learning the transition matrix, emission parameter, and hidden states. For a more rigorous academic overview on Hidden Markov Models, see [An introduction to Hidden Markov Models and Bayesian Networks](http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf) (Ghahramani, 2001). diff --git a/tutorials/05-linear-regression/05_linear-regression.jmd b/tutorials/05-linear-regression/05_linear-regression.jmd index 73b2a7ad7..d3c4b3da4 100644 --- a/tutorials/05-linear-regression/05_linear-regression.jmd +++ b/tutorials/05-linear-regression/05_linear-regression.jmd @@ -1,6 +1,7 @@ --- title: Linear Regression permalink: /:collection/:name/ +redirect_from: tutorials/5-linearregression/ --- Turing is powerful when applied to complex hierarchical models, but it can also be put to task at common statistical procedures, like [linear regression](https://en.wikipedia.org/wiki/Linear_regression). This tutorial covers how to implement a linear regression model in Turing. diff --git a/tutorials/06-infinite-mixture-model/06_infinite-mixture-model.jmd b/tutorials/06-infinite-mixture-model/06_infinite-mixture-model.jmd index bf0bc0113..5447542d3 100644 --- a/tutorials/06-infinite-mixture-model/06_infinite-mixture-model.jmd +++ b/tutorials/06-infinite-mixture-model/06_infinite-mixture-model.jmd @@ -1,6 +1,7 @@ --- title: Probabilistic Modelling using the Infinite Mixture Model permalink: /:collection/:name/ +redirect_from: tutorials/6-infinitemixturemodel/ --- In many applications it is desirable to allow the model to adjust its complexity to the amount the data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applictions, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed. diff --git a/tutorials/07-poisson-regression/07_poisson-regression.jmd b/tutorials/07-poisson-regression/07_poisson-regression.jmd index c11307ff8..7a373d973 100644 --- a/tutorials/07-poisson-regression/07_poisson-regression.jmd +++ b/tutorials/07-poisson-regression/07_poisson-regression.jmd @@ -1,6 +1,7 @@ --- title: Bayesian Poisson Regression permalink: /:collection/:name/ +redirect_from: tutorials/7-poissonregression/ --- This notebook is ported from the [example notebook](https://docs.pymc.io/notebooks/GLM-poisson-regression.html) of PyMC3 on Poisson Regression. diff --git a/tutorials/08-multionomial-regression/08_multinomial-logistic-regression.jmd b/tutorials/08-multionomial-regression/08_multinomial-logistic-regression.jmd index 9c7713d33..fe52647a5 100644 --- a/tutorials/08-multionomial-regression/08_multinomial-logistic-regression.jmd +++ b/tutorials/08-multionomial-regression/08_multinomial-logistic-regression.jmd @@ -1,6 +1,7 @@ --- title: Bayesian Multinomial Logistic Regression permalink: /:collection/:name/ +redirect_from: tutorials/8-multinomiallogisticregression/ --- [Multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes. diff --git a/tutorials/09-variational-inference/09_variational-inference.jmd b/tutorials/09-variational-inference/09_variational-inference.jmd index 624adba51..7da86300d 100644 --- a/tutorials/09-variational-inference/09_variational-inference.jmd +++ b/tutorials/09-variational-inference/09_variational-inference.jmd @@ -1,6 +1,7 @@ --- title: Variational inference (VI) in Turing.jl permalink: /:collection/:name/ +redirect_from: tutorials/9-variationalinference/ --- In this post we'll have a look at what's know as **variational inference (VI)**, a family of _approximate_ Bayesian inference methods, and how to use it in Turing.jl as an alternative to other approaches such as MCMC. In particular, we will focus on one of the more standard VI methods called **Automatic Differentation Variational Inference (ADVI)**. diff --git a/tutorials/10-bayesian-differential-equations/10_bayesian-differential-equations.jmd b/tutorials/10-bayesian-differential-equations/10_bayesian-differential-equations.jmd index ecfcc3d20..fef24c161 100644 --- a/tutorials/10-bayesian-differential-equations/10_bayesian-differential-equations.jmd +++ b/tutorials/10-bayesian-differential-equations/10_bayesian-differential-equations.jmd @@ -1,20 +1,21 @@ --- title: Bayesian Estimation of Differential Equations permalink: /:collection/:name/ +redirect_from: tutorials/10-bayesiandiffeq/ --- Most of the scientific community deals with the basic problem of trying to mathematically model the reality around them and this often involves dynamical systems. The general trend to model these complex dynamical systems is through the use of differential equations. Differential equation models often have non-measurable parameters. The popular “forward-problem” of simulation consists of solving the differential equations for a given set of parameters, the “inverse problem” to simulation, known as parameter estimation, is the process of utilizing data to determine these model parameters. Bayesian inference provides a robust approach to parameter estimation with quantified uncertainty. ```julia -using Turing, Distributions, DataFrames, DifferentialEquations, DiffEqSensitivity +using Turing, Distributions, DifferentialEquations # Import MCMCChain, Plots, and StatsPlots for visualizations and diagnostics. using MCMCChains, Plots, StatsPlots # Set a seed for reproducibility. using Random -Random.seed!(12); +Random.seed!(14); ``` ## The Lotka-Volterra Model @@ -25,79 +26,244 @@ $$\frac{dx}{dt} = (\alpha - \beta y)x$$ $$\frac{dy}{dt} = (\delta x - \gamma)y$$ + ```julia function lotka_volterra(du,u,p,t) x, y = u - α, β, δ, γ = p - du[1] = dx = (α - β*y)x - du[2] = dy = (δ*x - γ)y + α, β, γ, δ = p + du[1] = (α - β*y)x # dx = + du[2] = (δ*x - γ)y # dy = end p = [1.5, 1.0, 3.0, 1.0] u0 = [1.0,1.0] -prob = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) -sol = solve(prob,Tsit5()) +prob1 = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) +sol = solve(prob1,Tsit5()) plot(sol) ``` We'll generate the data to use for the parameter estimation from simulation. -With the `saveat` [argument](https://docs.sciml.ai/latest/basics/common_solver_opts/) we specify that the solution is stored only at `0.1` time units. +With the `saveat` [argument](https://docs.sciml.ai/latest/basics/common_solver_opts/) we specify that the solution is stored only at `0.1` time units. To make the data look more realistic, we add random noise using the function `randn`. + ```julia -odedata = Array(solve(prob,Tsit5(),saveat=0.1)) +sol1 = solve(prob1,Tsit5(),saveat=0.1) +odedata = Array(sol1) + 0.8 * randn(size(Array(sol1))) +plot(sol1, alpha = 0.3, legend = false); scatter!(sol1.t, odedata') ``` -## Fitting Lotka-Volterra with DiffEqBayes +## Direct Handling of Bayesian Estimation with Turing -[DiffEqBayes.jl](https://github.com/SciML/DiffEqBayes.jl) is a high level package that set of extension functionality for estimating the parameters of differential equations using Bayesian methods. It allows the choice of using CmdStan.jl, Turing.jl, DynamicHMC.jl and ApproxBayes.jl to perform a Bayesian estimation of a differential equation problem specified via the DifferentialEquations.jl interface. You can read the [docs](https://docs.sciml.ai/latest/analysis/parameter_estimation/#Bayesian-Methods-1) for an understanding of the available functionality. +Previously, functions in Turing and DifferentialEquations were not inter-composable, so Bayesian inference of differential equations needed to be handled by another package called [DiffEqBayes.jl](https://github.com/SciML/DiffEqBayes.jl) (note that DiffEqBayes works also with CmdStan.jl, Turing.jl, DynamicHMC.jl and ApproxBayes.jl - see the [DiffEqBayes docs](https://docs.sciml.ai/latest/analysis/parameter_estimation/#Bayesian-Methods-1) for more info). + +From now on however, Turing and DifferentialEquations are completely composable and we can write of the differential equation inside a Turing `@model` and it will just work. Therefore, we can rewrite the Lotka Volterra parameter estimation problem with a Turing `@model` interface as below: ```julia -using DiffEqBayes -t = 0:0.1:10.0 -priors = [truncated(Normal(1.5,0.5),0.5,2.5),truncated(Normal(1.2,0.5),0,2),truncated(Normal(3.0,0.5),1,4),truncated(Normal(1.0,0.5),0,2)] -bayesian_result_turing = turing_inference(prob,Tsit5(),t,odedata,priors,num_samples=10_000) +Turing.setadbackend(:forwarddiff) + +@model function fitlv(data, prob1) + σ ~ InverseGamma(2, 3) # ~ is the tilde character + α ~ truncated(Normal(1.5,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.5),0,2) + γ ~ truncated(Normal(3.0,0.5),1,4) + δ ~ truncated(Normal(1.0,0.5),0,2) + + p = [α,β,γ,δ] + prob = remake(prob1, p=p) + predicted = solve(prob,Tsit5(),saveat=0.1) + + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end + +model = fitlv(odedata, prob1) + +# This next command runs 3 independent chains without using multithreading. +chain = mapreduce(c -> sample(model, NUTS(.65),1000), chainscat, 1:3) ``` -The estimated parameters are clearly very close to the desired parameter values. We can also check that the chains have converged in the plot. +The estimated parameters are close to the desired parameter values. We can also check that the chains have converged in the plot. + ```julia -plot(bayesian_result_turing) +plot(chain) ``` -## Direct Handling of Bayesian Estimation with Turing +### Data retrodiction +In Bayesian analysis it is often useful to retrodict the data, i.e. generate simulated data using samples from the posterior distribution, and compare to the original data (see for instance section 3.3.2 - model checking of McElreath's book "Statistical Rethinking"). Here, we solve again the ODE using the output in `chain`, for 300 randomly picked posterior samples. We plot this ensemble of solutions to check if the solution resembles the data. -You could want to do some sort of reduction with the differential equation's solution or use it in some other way as well. In those cases DiffEqBayes might not be useful. Turing and DifferentialEquations are completely composable and you can write of the differential equation inside a Turing `@model` and it will just work. -We can rewrite the Lotka Volterra parameter estimation problem with a Turing `@model` interface as below +```julia +pl = scatter(sol1.t, odedata'); +``` ```julia -Turing.setadbackend(:forwarddiff) +chain_array = Array(chain) +for k in 1:300 + resol = solve(remake(prob1,p=chain_array[rand(1:1500), 1:4]),Tsit5(),saveat=0.1) + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +# display(pl) +plot!(sol1, w=1, legend = false) +``` -@model function fitlv(data) - σ ~ InverseGamma(2, 3) +In the plot above, the 300 retrodicted time courses from the posterior are plotted in gray, and the original data are the blue and red dots, and the solution that was used to generate the data are the green and purple lines. We can see that, even though we added quite a bit of noise to the data (see dot plot above), the posterior distribution reproduces quite accurately the "true" ODE solution. + +## Lokta Volterra with missing predator data + +Thanks to the known structure of the problem, encoded by the Lokta-Volterra ODEs, one can also fit a model with incomplete data - even without any data for one of the two variables. For instance, let's suppose you have observations for the prey only, but none for the predator. We test this case by fitting the model only to the $$y$$ variable of the system, without providing any data for $$x$$: + + +```julia +@model function fitlv2(data, prob1) # data should be a Vector + σ ~ InverseGamma(2, 3) # ~ is the tilde character α ~ truncated(Normal(1.5,0.5),0.5,2.5) β ~ truncated(Normal(1.2,0.5),0,2) γ ~ truncated(Normal(3.0,0.5),1,4) δ ~ truncated(Normal(1.0,0.5),0,2) p = [α,β,γ,δ] - prob = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) + prob = remake(prob1, p=p) predicted = solve(prob,Tsit5(),saveat=0.1) for i = 1:length(predicted) - data[:,i] ~ MvNormal(predicted[i], σ) + data[i] ~ Normal(predicted[i][2], σ) # predicted[i][2] is the data for y - a scalar, so we use Normal instead of MvNormal end end -model = fitlv(odedata) -chain = sample(model, NUTS(.65),10000) +model2 = fitlv2(odedata[2,:], prob1) ``` +Here we use the multithreading functionality [available](https://turing.ml/dev/docs/using-turing/guide#multithreaded-sampling) in Turing.jl to sample 3 independent chains + + +```julia +Threads.nthreads() +``` + +```julia +# This next command runs 3 independent chains with multithreading. +chain2 = sample(model2, NUTS(.45), MCMCThreads(), 5000, 3, progress=false) +``` + +```julia +pl = scatter(sol1.t, odedata'); +chain_array2 = Array(chain2) +for k in 1:300 + resol = solve(remake(prob1,p=chain_array2[rand(1:12000), 1:4]),Tsit5(),saveat=0.1) + # Note that due to a bug in AxisArray, the variables from the chain will be returned always in + # the order it is stored in the array, not by the specified order in the call - :α, :β, :γ, :δ + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +#display(pl) +plot!(sol1, w=1, legend = false) +``` + +Note that here, the data values of $$x$$ (blue dots) were not given to the model! Yet, the model could predict the values of $$x$$ relatively accurately, albeit with a wider distribution of solutions, reflecting the greater uncertainty in the prediction of the $$x$$ values. + +## Inference of Delay Differential Equations + +Here we show an example of inference with another type of differential equation: a Delay Differential Equation (DDE). A DDE is an DE system where derivatives are function of values at an earlier point in time. This is useful to model a delayed effect, like incubation time of a virus for instance. + +For this, we will define a [`DDEProblem`](https://diffeq.sciml.ai/stable/tutorials/dde_example/), from the package DifferentialEquations.jl. + +Here is a delayed version of the lokta voltera system: + +$$\frac{dx}{dt} = \alpha x(t-\tau) - \beta y(t) x(t)$$ + +$$\frac{dy}{dt} = - \gamma y(t) + \delta x(t) y(t) $$ + +Where $$x(t-\tau)$$ is the variable $$x$$ at an earlier time point. We specify the delayed variable with a function `h(p, t)`, as described in the [DDE example](https://diffeq.sciml.ai/stable/tutorials/dde_example/). + + +```julia +function delay_lotka_volterra(du, u, h, p, t) + x, y = u + α, β, γ, δ = p + du[1] = α * h(p, t-1; idxs=1) - β * x * y + du[2] = -γ * y + δ * x * y + return +end + +p = (1.5,1.0,3.0,1.0) +u0 = [1.0; 1.0] +tspan = (0.0,10.0) +h(p, t; idxs::Int) = 1.0 +prob1 = DDEProblem(delay_lotka_volterra,u0,h,tspan,p) +``` + +```julia +sol = solve(prob1,saveat=0.1) +ddedata = Array(sol) +ddedata = ddedata + 0.5 * randn(size(ddedata)) +``` + +Plot the data: + +```julia +scatter(sol.t, ddedata'); plot!(sol) +``` + +Now we define and run the Turing model. + +```julia +Turing.setadbackend(:forwarddiff) +@model function fitlv(data, prob1) + + σ ~ InverseGamma(2, 3) + α ~ Truncated(Normal(1.5,0.5),0.5,2.5) + β ~ Truncated(Normal(1.2,0.5),0,2) + γ ~ Truncated(Normal(3.0,0.5),1,4) + δ ~ Truncated(Normal(1.0,0.5),0,2) + + p = [α,β,γ,δ] + + #prob = DDEProblem(delay_lotka_volterra,u0,_h,tspan,p) + prob = remake(prob1, p=p) + predicted = solve(prob,saveat=0.1) + for i = 1:length(predicted) + data[:,i] ~ MvNormal(predicted[i], σ) + end +end; +model = fitlv(ddedata, prob1) +``` + +Then we draw samples using multithreading; this time, we draw 3 independent chains in parallel using `MCMCThreads`. + +```julia +chain = sample(model, NUTS(.65), MCMCThreads(), 300, 3, progress=true) +plot(chain) +``` + +Finally, we select a 100 sets of parameters from the first chain and plot solutions. + + +```julia +chain +``` + +```julia +pl = scatter(sol.t, ddedata') +chain_array = Array(chain) +for k in 1:100 + resol = solve(remake(prob1,p=chain_array[rand(1:450),1:4]),Tsit5(),saveat=0.1) + # Note that due to a bug in AxisArray, the variables from the chain will be returned always in + # the order it is stored in the array, not by the specified order in the call - :α, :β, :γ, :δ + + plot!(resol, alpha=0.1, color = "#BBBBBB", legend = false) +end +#display(pl) +plot!(sol) +``` + +Here again, the dots is the data fed to the model, the continuous colored line is the "true" solution, and the gray lines are solutions from the posterior. The fit is pretty good even though the data was quite noisy to start. + ## Scaling to Large Models: Adjoint Sensitivities DifferentialEquations.jl's efficiency for large stiff models has been shown in multiple [benchmarks](https://github.com/SciML/DiffEqBenchmarks.jl). To learn more about how to optimize solving performance for stiff problems you can take a look at the [docs](https://docs.sciml.ai/latest/tutorials/advanced_ode_example/). -[Sensitivity analysis](https://docs.sciml.ai/latest/analysis/sensitivity/), or automatic differentiation (AD) of the solver, is provided by the DiffEq suite. The model sensitivities are the derivatives of the solution $u(t)$ with respect to the parameters. Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter. Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. +[Sensitivity analysis](https://docs.sciml.ai/latest/analysis/sensitivity/), or automatic differentiation (AD) of the solver, is provided by the DiffEq suite. The model sensitivities are the derivatives of the solution $$u(t)$$ with respect to the parameters. Specifically, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter. Sensitivity analysis provides a cheap way to calculate the gradient of the solution which can be used in parameter estimation and other optimization tasks. The AD ecosystem in Julia allows you to switch between forward mode, reverse mode, source to source and other choices of AD and have it work with any Julia code. For a user to make use of this within [SciML](https://sciml.ai), [high level interactions in `solve`](https://docs.sciml.ai/latest/analysis/sensitivity/#High-Level-Interface:-sensealg-1) automatically plug into those AD systems to allow for choosing advanced sensitivity analysis (derivative calculation) [methods](https://docs.sciml.ai/latest/analysis/sensitivity/#Sensitivity-Algorithms-1). @@ -109,129 +275,117 @@ While these sensitivity analysis methods may seem complicated (and they are!), u All we had to do is switch the AD backend to one of the adjoint-compatible backends (ReverseDiff, Tracker, or Zygote) and boom the system takes over and we're using adjoint methods! Notice that on this model adjoints are slower. This is because adjoints have a higher overhead on small parameter models and we suggest only using these methods for models with around 100 parameters or more. For more details, see https://arxiv.org/abs/1812.01892. -```julia; eval = false -using Zygote +```julia +using Zygote, DiffEqSensitivity Turing.setadbackend(:zygote) +prob1 = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) +``` -@model function fitlv(data) +```julia +@model function fitlv(data, prob) σ ~ InverseGamma(2, 3) α ~ truncated(Normal(1.5,0.5),0.5,2.5) β ~ truncated(Normal(1.2,0.5),0,2) γ ~ truncated(Normal(3.0,0.5),1,4) δ ~ truncated(Normal(1.0,0.5),0,2) p = [α,β,γ,δ] - prob = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) + prob = remake(prob, p=p) + predicted = solve(prob,saveat=0.1) for i = 1:length(predicted) data[:,i] ~ MvNormal(predicted[i], σ) end end; -model = fitlv(odedata) +model = fitlv(odedata, prob1) chain = sample(model, NUTS(.65),1000) ``` Now we can exercise control of the sensitivity analysis method that is used by using the `sensealg` keyword argument. Let's choose the `InterpolatingAdjoint` from the available AD [methods](https://docs.sciml.ai/latest/analysis/sensitivity/#Sensitivity-Algorithms-1) and enable a compiled ReverseDiff vector-Jacobian product: + ```julia -@model function fitlv(data) +@model function fitlv(data, prob) σ ~ InverseGamma(2, 3) α ~ truncated(Normal(1.5,0.5),0.5,2.5) β ~ truncated(Normal(1.2,0.5),0,2) γ ~ truncated(Normal(3.0,0.5),1,4) δ ~ truncated(Normal(1.0,0.5),0,2) p = [α,β,γ,δ] - prob = ODEProblem(lotka_volterra,u0,(0.0,10.0),p) + prob = remake(prob, p=p) predicted = solve(prob,saveat=0.1,sensealg=InterpolatingAdjoint(autojacvec=ReverseDiffVJP(true))) for i = 1:length(predicted) data[:,i] ~ MvNormal(predicted[i], σ) end end; -model = fitlv(odedata) +model = fitlv(odedata, prob1) @time chain = sample(model, NUTS(.65),1000) ``` -For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://diffeqflux.sciml.ai/dev/) - -## Including Process Noise: Estimation of Stochastic Differential Equations - -This can be easily extended to Stochastic Differential Equations as well. - -Let's create the Lotka Volterra equation with some noise and try out estimating it with the same framework we have set up before. +For more examples of adjoint usage on large parameter models, consult the [DiffEqFlux documentation](https://diffeqflux.sciml.ai/dev/). -Our equations now become: +## Inference of a Stochastic Differential Equation +A Stochastic Differential Equation ([SDE](https://diffeq.sciml.ai/stable/tutorials/sde_example/)) is a differential equation that has a stochastic (noise) term in the expression of the derivatives. Here we fit a Stochastic version of the Lokta-Volterra system. -$$dx = (\alpha - \beta y)xdt + \phi_1 xdW_1$$ +We use a quasi-likelihood approach in which all trajectories of a solution are compared instead of a reduction such as mean, this increases the robustness of fitting and makes the likelihood more identifiable. We use SOSRI to solve the equation. The NUTS sampler is a bit sensitive to the stochastic optimization since the gradient is then changing with every calculation, so we use NUTS with a target acceptance rate of `0.25`. -$$dy = (\delta x - \gamma)ydt + \phi_2 ydW_2$$ ```julia -function lotka_volterra_noise(du,u,p,t) - du[1] = p[5]*u[1] - du[2] = p[6]*u[2] +u0 = [1.0,1.0] +tspan = (0.0,10.0) +function multiplicative_noise!(du,u,p,t) + x,y = u + du[1] = p[5]*x + du[2] = p[6]*y end -p = [1.5, 1.0, 3.0, 1.0, 0.3, 0.3] -prob = SDEProblem(lotka_volterra,lotka_volterra_noise,u0,(0.0,10.0),p) -``` +p = [1.5,1.0,3.0,1.0,0.1,0.1] -Solving it repeatedly confirms the randomness of the solution - -```julia -sol = solve(prob,saveat=0.01) -p1 = plot(sol) -sol = solve(prob,saveat=0.01) -p2 = plot(sol) -sol = solve(prob,saveat=0.01) -p3 = plot(sol) -plot(p1,p2,p3) -``` - -With the `MonteCarloSummary` it is easy to summarize the results from multiple runs through the `EnsembleProblem` interface, here we run the problem for 1000 `trajectories` and visualize the summary: - -```julia -sol = solve(EnsembleProblem(prob),SRIW1(),saveat=0.1,trajectories=500) -summ = MonteCarloSummary(sol) -plot(summ) -``` - -Get data from the means to fit: - -```julia -using DiffEqBase.EnsembleAnalysis -averagedata = Array(timeseries_steps_mean(sol)) -``` +function lotka_volterra!(du,u,p,t) + x,y = u + α,β,γ,δ = p + du[1] = dx = α*x - β*x*y + du[2] = dy = δ*x*y - γ*y +end -Now fit the means with Turing. -We will utilize multithreading with the [`EnsembleProblem`](https://docs.sciml.ai/stable/tutorials/sde_example/#Ensemble-Simulations-1) interface to speed up the SDE parameter estimation. +prob_sde = SDEProblem(lotka_volterra!,multiplicative_noise!,u0,tspan,p) -```julia -Threads.nthreads() +ensembleprob = EnsembleProblem(prob_sde) +@time data = solve(ensembleprob,SOSRI(),saveat=0.1,trajectories=1000) +plot(EnsembleSummary(data)) ``` ```julia Turing.setadbackend(:forwarddiff) - -@model function fitlv(data) - σ ~ InverseGamma(2, 3) - α ~ truncated(Normal(1.5,0.5),0.5,2.5) - β ~ truncated(Normal(1.2,0.5),0,2) - γ ~ truncated(Normal(3.0,0.5),1,4) - δ ~ truncated(Normal(1.0,0.5),0,2) - ϕ1 ~ truncated(Normal(1.2,0.5),0.1,1) - ϕ2 ~ truncated(Normal(1.2,0.5),0.1,1) - +@model function fitlv(data, prob) + σ ~ InverseGamma(2,3) + α ~ truncated(Normal(1.3,0.5),0.5,2.5) + β ~ truncated(Normal(1.2,0.25),0.5,2) + γ ~ truncated(Normal(3.2,0.25),2.2,4.0) + δ ~ truncated(Normal(1.2,0.25),0.5,2.0) + ϕ1 ~ truncated(Normal(0.12,0.3),0.05,0.25) + ϕ2 ~ truncated(Normal(0.12,0.3),0.05,0.25) p = [α,β,γ,δ,ϕ1,ϕ2] - prob = SDEProblem(lotka_volterra,lotka_volterra_noise,u0,(0.0,10.0),p) - ensemble_predicted = solve(EnsembleProblem(prob),SRIW1(),saveat=0.1,trajectories=500) - predicted_means = timeseries_steps_mean(ensemble_predicted) + prob = remake(prob, p=p) + predicted = solve(prob,SOSRI(),saveat=0.1) - for i = 1:length(predicted_means) - data[:,i] ~ MvNormal(predicted_means[i], σ) + if predicted.retcode != :Success + Turing.acclogp!(_varinfo, -Inf) + end + for j in 1:length(data) + for i = 1:length(predicted) + data[j][i] ~ MvNormal(predicted[i],σ) + end end end; +``` + +We use NUTS sampler with a low acceptance ratio and initial parameters since estimating the parameters of SDE with HMC poses a challenge. Probabilistic nature of the SDE solution makes the likelihood function noisy which poses a challenge for NUTS since the gradient is then changing with every calculation. SGHMC might be better suited to be used here. -model = fitlv(averagedata) -chain = sample(model, NUTS(.65),500) + +```julia +model = fitlv(data, prob_sde) +chain = sample(model, NUTS(0.25), 5000, init_theta = [1.5,1.3,1.2,2.7,1.2,0.12,0.12]) +plot(chain) ``` ```julia, echo=false, skip="notebook"