-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Laplace approximation as algorithm #16
base: master
Are you sure you want to change the base?
Add Laplace approximation as algorithm #16
Conversation
@jgabry can you look through this? |
Now this just mirrors the Rstan way of doing this. The prototype was relatively easy to implement. If someone can approve that this is vaguely the right way to go (@jgabry, @avehtari or anyone), then I'll talk to people about interfaces and how this should go in. Prototype code is there and can be run. |
|
||
```laplace_draws``` - The number of draws to take from the posterior approximation. By default, this is zero, and no laplace approximation is done | ||
|
||
```laplace_diag_shift``` - A value to add to the diagonal of the hessian approximation to fix small non-singularities (defaulting to zero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
laplace_diag_shift
-> laplace_diag_jitter
or laplace_diag_add
or laplace_add_diag
?
The new output would look like: | ||
|
||
``` | ||
# stan_version_major = 2 | ||
... | ||
# refresh = 100 (Default) | ||
lp__,b.1,b.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add lg__
, ie, log density with respect to the approximation? Then it would be easier to use Pareto k diagnostic and do importance resampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can evaluate both the true (edit: unnormalized) log density and the log density of the normal approximation. Just want one or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both lp__
and lg__
as in CmdStan advi, and in RStan advi and Laplace
Another design would be the print the Hessian on the unconstrained space and let users handle the sampling and the parameter transformation. The issue here is there is no good way for users to do these parameter transformations outside of certain interfaces (at least Rstan, maybe PyStan). | ||
|
||
Another design would be to print a Hessian on the constrained space and let users handle the sampling. In this case users would also be expected to handle the constraints, and I don't know how that would work practically rejection sampling maybe?) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this I realized that it is not mentioned before what the draws are. I assume that the proposed approach is to draw from multivariate normal and then transform to the constrained space? If so it would be good to explicitly write what are the values for the "Draws from the Laplace approximation" in the new output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add more.
If uopt
is the unconstrained optimum we're doing draws in pseudo-code like:
multivariate_normal(mean = uopt, cov = -inverse(hessian(uopt)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then transform to the constrained space?
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
Rstan does a version of this already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves it unclear if the RStan version is different that the design here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should check that it's actually the same. @bgoodri is this actually the same?
I now reviewed only the design. After it's fixed, I can review code, too. |
It sounds the same. The code is basically from
https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/R/stanmodel-class.R#L446
onwards. The main thing is that it is better to finite-diff the autodiffed
gradient than to finite-diff the log_prob function twice.
…On Mon, Mar 16, 2020 at 4:01 PM Ben Bales ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In designs/0016-hessian_optimize_cmdstan.md
<#16 (comment)>:
> +
+Providing draws instead of the Laplace approximation itself is rather inefficient, but it is the easiest thing to code.
+
+We also have to deal with possible singular Hessians. This is why I also added the laplace_diag_shift to overcome these. They'll probably be quite common, especially with the Hessians computed with finite differences.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+Another design would be the print the Hessian on the unconstrained space and let users handle the sampling and the parameter transformation. The issue here is there is no good way for users to do these parameter transformations outside of certain interfaces (at least Rstan, maybe PyStan).
+
+Another design would be to print a Hessian on the constrained space and let users handle the sampling. In this case users would also be expected to handle the constraints, and I don't know how that would work practically rejection sampling maybe?)
+
+# Prior art
+[prior-art]: #prior-art
+
+Rstan does a version of this already.
I should check that it's actually the same. @bgoodri
<https://github.com/bgoodri> is this actually the same?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZ2XKRHUJYURDYNZB4AVILRH2AR5ANCNFSM4LD3MERA>
.
|
In terms of right way to go the autodiff is better (speed and precision both) where supported, but doesn't exist for some of our higher-order solvers like ODEs. |
…out how constrained samples are generated as response to reviews
@avehtari this should be available to test here:
Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out? There are some software design issues:
I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish. Edit: I guess options for 4 are: a. Leave the different outputs in the same file @bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it. @bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own. |
You are going to have more indeterminate Hessians if you use finite
differences twice.
…On Tue, Mar 17, 2020, 3:01 PM Ben Bales ***@***.***> wrote:
@avehtari <https://github.com/avehtari> this should be available to test
here:
git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian
Currently I only implemented stuff for the lbfgs optimizer. Could you give
this a go and check me for statistical correctness in terms of what I'm
spitting out?
There are some software design issues:
1. Need to share calculations between all the different optimizers
(lbfgs, bfgs, and Newton)
2. If we get errors on log_p evaluation, what to do
3. If we get errors on generated quantities evaluation, what to do
4. Do we print the laplace draws in the output file? If we do we have
a csv with two different formats of outputs. The optimization output has
less columns than the sampling output, and they also are different things.
I'd prefer to print a warning and skip the sample for 2 and 3 instead of
blowing up. In that case if you request N posterior draws you might only
get M < N actual draws. Blowing up is bad cause then you have to redo the
whole optimization (and things might still blow up). I think if we do
rejection sampling to make sure we get N posterior draws we'd be in danger
of people writing models that never finish.
@bgoodri <https://github.com/bgoodri> I'm currently just too lazy to
write and test the hessian_auto equivalent that uses vars. If someone wants
to yell at me I guess I'd get unlazy and go do it.
@bob-carpenter <https://github.com/bob-carpenter> I think he was just
saying finite diff reverse mode to get the Hessian. That makes a lot of
sense, but I just wanted to lean on what was available in stan-dev/math
instead of writing my own.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZ2XKS4HTJHZFTMDXZVNW3RH7JIPANCNFSM4LD3MERA>
.
|
Sigh I guess I'll eat my vegetables. stan/math/rev/functor/finite_diff_hessian_auto.hpp or something incoming, but should still be able to talk through the other issues before I get this done. |
@@ -13,7 +13,16 @@ When computing a MAP estimate, the Hessian of the log density can be used to con | |||
|
|||
I have a Stan model that is too slow to practically sample all the time. Because optimization seem to give reasonable results, it would be nice to have the normal approximation to the posterior to give some sense of the uncertainty in the problem as well. | |||
|
|||
An approximate posterior covariance comes from computing the inverse of the Hessian of the negative log density. | |||
It is standard to compute a normal approximation to the posterior covariance comes from the inverse of the Hessian of the negative log density. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "comes" (I would remove it myself, but I don't have edit rights. It would be useful to have edit rights for design-docs repo)
constrained_sample = constrain(unconstrained_sample) | ||
``` | ||
|
||
We can output unnormalized log densities of the actual model and the approximate model to compute importance sampling diagnostics and estimates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can output unnormalized log densities in the unconstrained space of the true posterior and the approximate posterior
Change the output for the optimization also when there are no approximate draws? If I remember correctly advi has the modal value as the first row and if the number of draws is > 0, the rest of the rows are draws. optimizing could follow that |
I believe that the discussion is missing a critical point -- the current optimizer behavior is not appropriate for computing posterior Laplace approximations. The Remember that the current optimizer doesn't include Jacobian corrections, so instead of computing a maximum of the posterior density it's actually computing the maximum of a penalized likelihood function. The Hessian at this penalized maximum likelihood could be used for defining confidence intervals (although because of the penalty functions they wouldn't have clear coverage properties) but the samples wouldn't have any meaning in this context. In order to compute a Laplace approximation the optimizer would have to run with the Jacobian corrections turned on so that the maximum posterior density and the Hessian of the right posterior density is computed. This is sufficiently different behavior that it would at the very least warrant its own method rather than trying to force it into an inappropriate penalized maximum likelihood context. |
Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?
When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution. |
maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.
Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?
The current code computes the mode on the constrained space, not the unconstrained space.
but the samples wouldn't have any meaning in this context.
When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.
Sure, but is that worth changing the output of optimizing and confusing users as to its meaning?
I think that if samples are included then many users will use them as if they were exact samples
or samples from the Laplace approximation and not weights things carefully enough.
To be clear I have no problem with adding this functionality, I just think that it would be much
clearer to have a new route with this functionality that wraps the internal optimizer and adds all
of this post processing. That way the current optimizer can stay as it is and the new route could
turn on the Jacobian corrections as needed.
|
To put it another way, we do the optimization on the unconstrained space, but turn off the Jacobian adjustment for the change of variables. That's easy enough to turn on from the model class---it's just a template parameter. So maybe we need another flag on the service interface for optimization for +/- Jacobian. I'm also with @betanalpha that we don't want to change current behavior, especially by automatically returning something much larger. |
I wasn't thinking about that. Good catch. I guess these are just different functions with and without the adjustments.
Sounds like this would be better suited as a new algorithm? That way there's no point estimate/sample output confusion either -- the Laplace algorithm only samples from a Laplace approximation to the posterior. |
So we'd have (1) MLE without Jacobian correction, with std errors from the Hessian on the unconstrained scale translated back to constrained, and (2) Laplace approximation using Jacobian correction, with a set of draws returned. It'd be nice to retrieve the parameters to the normal for the Laplace, as it'd let you generate more draws in the future. But that's very much a secondary consideration compared to getting the draws out in our standard interface. |
I think we'd only do this, however we end up doing it interface-wise. At least, this is what I meant to do so that is my goal |
Thanks Mike, now I remember we have discussed this same issue several times before! It seems I just keep forgetting it because the mode and Hessian in unconstrained space so much more natural for me. Great if we can now have also mode and Hessian in unconstrained space. |
This popped up over here: https://github.com/stan-dev/pystan/issues/694 I still need it for my own purposes too (current just using a version of cmdstan with this hacked in). Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it. |
Makes sense to me.
… On Apr 16, 2020, at 8:21 AM, Ben Bales ***@***.***> wrote:
This popped up over here: stan-dev/pystan#694 <https://github.com/stan-dev/pystan/issues/694>
I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).
Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZYCUMLPABIZECGFZ2U3ALRM3Z5FANCNFSM4LD3MERA>.
|
… unconstrained space stuff (design-doc stan-dev#16)
@avehtari @betanalpha @bgoodri this is ready for comments again. I rewrote it as a new algorithm since the first thing was just wrong. I think the differences @betanalpha pointed out between what we want to do here and optimization are enough that I kinda don't want to put them together, even if the gears are basically the same. I dunno. Opinions welcome. It's kinda awkward that it'll share all the same parameters as optimization but be a different algorithm, but I guess that's just interface overhead. I think @bgoodri is right and this should be written with finite differences of gradients instead of just finite differences. I was working with this code hacked together on another project and this was important (and not easily diagnosed). It makes me want to do the full second order autodiff, but we can't with the higher order algorithms. |
Speaking of Hessians, we should also change the testing framework to use finite diffs of gradients rather than double finite diffs for Hessians. I think it could improve the tolerances dramatically. |
why not then output NA, -Inf, or Inf based on what the log density evaluation returns?
This would be useful if people want more draws without need to run the optimizer and Hessian computation again, but it's different behavior and would need to be considered together with variational output. |
Errors raise exceptions, so there's not a return value. We can also have NaN propagate through (there's no NA in Stan like in R, but presumably we're talking about not-a-number). |
I don't think there's a way to write these to csv files. Maybe we already have a convention for it though?
It's a drawback. Getting draws is somewhat expensive since we evaluate the model log density, so there's that. |
Lookin' for some feeeedback. Otherwise I'll just do it. The actual new new code for this won't be much, but it'll be a lot of interface bits. |
I just want to clarify that this design doc does not address the computational validation
needed for an algorithm to be considered for inclusion in the Stan Algorithm library.
… On Apr 30, 2020, at 2:44 AM, Aki Vehtari ***@***.***> wrote:
@avehtari approved this pull request.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#16 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FUVFKJ564EADCDM6ITRPEM4HANCNFSM4LD3MERA>.
|
@betanalpha --- what would you like to see in the way of validation for something like this? |
The current policy is at https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan <https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan>.
… On Apr 30, 2020, at 11:56 AM, Bob Carpenter ***@***.***> wrote:
@betanalpha <https://github.com/betanalpha> --- what would you like to see in the way of validation for something like this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FXP22W3RB7H776O2E3RPGNTXANCNFSM4LD3MERA>.
|
I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable. |
I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 <https://arxiv.org/abs/1507.02646> describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.
It’s more complicated than that. The Laplace algorithm has never been in Stan.
We started with dynamic HMC for estimating posterior expectation values and
penalized maximum likelihood for computing point estimates. ADVI was
shoehorned in later somewhat awkwardly as another method for (poorly)
estimating posterior expectation values.
We’ve never had a MAP algorithm or Laplace or anything else that claims to
do posterior expectation value estimation. The RStan developers added the
Hessian hack on top of the penalized maximum likelihood algorithm but that
was a unilateral decision of the RStan developer and never part of an official
algorithm discussion. Plus, as previously discussed, it was fundamentally
flawed as it was based on the wrong optimum.
There have been numerous MCMC algorithms developed in Stan that were
never exposed because they were too fragile (random walk, ensemble, etc).
for typical user models but would work fine for the same models on which
the Laplace approximation is reasonably accurate. The precedent, other
than the messy political situation of ADVI, has been to not include any
algorithms that aren’t reasonably robust over typical user models. Diagnostics
are necessary but also not sufficient; adding fragile algorithms that work for
small classes of models only confuses the Stan user experience. This is
discussed further in https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan <https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan>.
I agree that there is plenty of literature on the Laplace algorithm, but I disagree
that there is solid theory backing up the robustness of the method in the
preasymptotic, mostly non-log concave regime where typical user models live.
Without that relevant theory we’ll need substantial empirical evidence for
consideration of inclusion.
|
Hi, Michael. This is slightly off topic, but, since you mentioned it, do you have a suggested improvement to what we current do in Rstan to get approximate posterior simulations by drawing from the normal distribution centered at the mode and with curvature based on the second derivative matrix of the log density?
If you have some better ideas, that would be great, and we could try them out.
Andrew
… On May 1, 2020, at 2:55 PM, Michael Betancourt ***@***.***> wrote:
The RStan developers added the
Hessian hack on top of the penalized maximum likelihood algorithm but that
was a unilateral decision of the RStan developer and never part of an official
algorithm discussion. Plus, as previously discussed, it was fundamentally
flawed as it was based on the wrong optimum.
|
This is a proposal for a Laplace approximation algorithm to Stan.
Rendered output is here
Edit: I just totally changed the text here. A lot is changed since this the algorithm I initially put up was just wrong.