-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amortized point estimators #121
Comments
Great proposals! I read the paper. But I am not sure if the method can generate posterior samples, which may be helpful to detect multi-modal distributions. |
Yes, the method will not deal with multimodality out of the box. I think its main use case will be as a fast and frugal heuristic to obtain estimates very quickly. |
Idea: can we train the summary nets based on point inference nets and then
reuse the trained summary nets in combination with full posterior inference
nets?
Stefan Radev ***@***.***> schrieb am Di., 13. Feb. 2024,
12:43:
… Yes, the method will not deal with multimodality out of the box. I think
its main use case will be as a fast and frugal heuristic to obtain
estimates very quickly.
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AH5AXV6TW2ZYGIOC4DYTLHMNAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGMZDIOJQG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Multi-modality on the marginals should be detectable through the posterior
quantiles. The main "compromise" is that there is no easy way to get
information on posterior joint dependence generally. But as Stefan said,
this is more of an "exploratory" step or an alternative for when the full
NF framework is proving difficult to fit.
I like Paul's idea. Possible issue: Summaries that are good for estimating
individual parameters might not be relevant/suitable for extracting
information on posterior correlations (additional summaries might be needed
when doing full posterior inference).
…On Tue, Feb 13, 2024 at 3:00 PM Paul-Christian Bürkner < ***@***.***> wrote:
Idea: can we train the summary nets based on point inference nets and then
reuse the trained summary nets in combination with full posterior
inference
nets?
Stefan Radev ***@***.***> schrieb am Di., 13. Feb. 2024,
12:43:
> Yes, the method will not deal with multimodality out of the box. I think
> its main use case will be as a fast and frugal heuristic to obtain
> estimates very quickly.
>
> —
> Reply to this email directly, view it on GitHub
> <
#121 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ADCW2AH5AXV6TW2ZYGIOC4DYTLHMNAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGMZDIOJQG4>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEU3LL2TL35DSD6ZYOV63YTLQMHAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGM4DONRXGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I like the idea too! We could use the point summary network or the point estimates themselves as summaries, with an auxiliary, smaller summary stream for capturing additional information relevant for the joint. |
I think using the point estimates as summary statistics will be similar to
when one uses expert summary statistics, which we know can work well (e.g.,
the Jiang et al. Statistica Sinica paper) but probably what you have at the
moment is even better. To start off with, from what I know of BayesFlow, I
think we should keep the statistics network as is, and add an option for
the user to choose "point estimation" instead of "full inference" --
technically all is required is to swap out the invertible NN with a much
simpler dense neural network with input dimension equal to the summary
statistic dimension and output dimension equal to the parameter dimension.
There could even be a sensible default for the architecture of this NN. The
user also needs to be able to specify a loss function when choosing "point
estimation", but we could just start off with squared error loss at first
just to get it running.
…On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev ***@***.***> wrote:
I like the idea! We could use the point summary network or the point
estimates themselves as summaries, with an auxiliary, smaller summary
stream for capturing additional information relevant for the joint.
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
yes I fully agree. my idea was going in a different direction though but it
would be more like research which we can do once the implementation is done
Andrew Zammit Mangion ***@***.***> schrieb am Mi., 14. Feb.
2024, 16:10:
… I think using the point estimates as summary statistics will be similar to
when one uses expert summary statistics, which we know can work well
(e.g.,
the Jiang et al. Statistica Sinica paper) but probably what you have at
the
moment is even better. To start off with, from what I know of BayesFlow, I
think we should keep the statistics network as is, and add an option for
the user to choose "point estimation" instead of "full inference" --
technically all is required is to swap out the invertible NN with a much
simpler dense neural network with input dimension equal to the summary
statistic dimension and output dimension equal to the parameter dimension.
There could even be a sensible default for the architecture of this NN.
The
user also needs to be able to specify a loss function when choosing "point
estimation", but we could just start off with squared error loss at first
just to get it running.
On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev ***@***.***>
wrote:
> I like the idea! We could use the point summary network or the point
> estimates themselves as summaries, with an auxiliary, smaller summary
> stream for capturing additional information relevant for the joint.
>
> —
> Reply to this email directly, view it on GitHub
> <
#121 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AAU4KQIMG6AP4GYYPTYTRIOFAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGE2DEMRWGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
and I did not mean using the point estimates as summary statistics. I agree
that would not be very useful in general.
Paul Buerkner ***@***.***> schrieb am Mi., 14. Feb. 2024,
16:51:
… yes I fully agree. my idea was going in a different direction though but
it would be more like research which we can do once the implementation is
done
Andrew Zammit Mangion ***@***.***> schrieb am Mi., 14. Feb.
2024, 16:10:
> I think using the point estimates as summary statistics will be similar
> to
> when one uses expert summary statistics, which we know can work well
> (e.g.,
> the Jiang et al. Statistica Sinica paper) but probably what you have at
> the
> moment is even better. To start off with, from what I know of BayesFlow,
> I
> think we should keep the statistics network as is, and add an option for
> the user to choose "point estimation" instead of "full inference" --
> technically all is required is to swap out the invertible NN with a much
> simpler dense neural network with input dimension equal to the summary
> statistic dimension and output dimension equal to the parameter
> dimension.
> There could even be a sensible default for the architecture of this NN.
> The
> user also needs to be able to specify a loss function when choosing
> "point
> estimation", but we could just start off with squared error loss at first
> just to get it running.
>
> On Wed, Feb 14, 2024 at 12:27 AM Stefan Radev ***@***.***>
> wrote:
>
> > I like the idea! We could use the point summary network or the point
> > estimates themselves as summaries, with an auxiliary, smaller summary
> > stream for capturing additional information relevant for the joint.
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
> #121 (comment)>,
>
> > or unsubscribe
> > <
> https://github.com/notifications/unsubscribe-auth/AABEU3OVKIAMWU7RDJTACD3YTNS3JAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBRGUYTGNZUGM>
>
> > .
> > You are receiving this because you are subscribed to this
> thread.Message
> > ID: ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <#121 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADCW2AAU4KQIMG6AP4GYYPTYTRIOFAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGE2DEMRWGU>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
I also think we need to do it in two steps:
Thus, only inference will potentially be faster, not training. I suspect that attemtps to train summary statistics directly via point estimates might result in poor summaries, because it will be harder for the summary network to capture the uncertainty. |
A luxury method to train point estimates would enhance the point estimator network with additional outputs for the (co)variance of the point estimate and the fraction of the full posterior explained by the MAP mode. These outputs could be trained by destillation from the SBI posterior: (1) Find the MAP mode in the NF posterior output => target value of the MAP point estimate. (2) Fit a Gaussian to the MAP mode (=Laplace approximation) to get the (co)variance and fraction of probability mass explained by the MAP mode. The destillation approach has the additional benefit that the point estimator would return the true MAP estimates (subject to the accuracy of the normalizing flow) instead of the ground truth parameterizations of the training runs. |
Summarizing the ideas so far:
Crucially, all three are enabled, once the functionality is there. Does anyone want to implement the blueprints Marvin suggested in the first post? Otherwise, I would self-assign in a week or two. |
Looks good, I think 1. is where we should start as that will help establish
the core benefits, and then move from there. Matt and I will be happy to
provide comments and check changes/provide suggestions as they come
through; can also contribute to code but probably someone who knows the
internals should drive it.
…On Thu, Feb 15, 2024 at 12:24 AM Stefan Radev ***@***.***> wrote:
Summarizing the ideas so far:
1. *Train point estimators instead of generative networks* - simply
attaches a configurable MLP to the summary network and uses some norm (L1,
L2, L-Infinity, https://www.tensorflow.org/api_docs/python/tf/norm) as
a loss.
- *Advantage*: fast and pleasant to iterate over for larger problems.
- *Disadvantage*: no full Bayesian inference (by design).
1. *Train point estimators and use pre-trained summary network in
tandem with a full generative network* - uses a frozen summary
network, possibly in combination with another, smaller and trainable
summary network, to train a full workflow.
- *Advantage*: the first summary network is trained fast and already
captures some important information. May be equivalent to good hand-crafted
summary statistics.
- *Disadvantage*: summary statistics may not be good, speed gains
diminished due to two-step approach.
1. *Train the default approach and distil it using a point estimator*
- trains the full workflow (i.e., summary and generative net) and attaches
a configurable MLP to the pretrained summary network.
- *Advantage*: the summary statistics will already be good for
learning the full posterior (assuming everything has converged just fine),
so they will also be good for the point estimator. Point estimator can then
be trained hyperefficiently.
- *Disadvantage*: speed gains diminished due to two-step approach, the
utility of the point estimates is unclear if full workflow is already good.
Crucially, all three are enabled, once the functionality is there. Does
anyone want to implement the blueprints Marvin suggested in the first post?
Otherwise, I would self-assign in a week or two.
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEU3MBQXPGPA7ZOIRJAVTYTS3HJAVCNFSM6AAAAABC5CBZ3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTG43DANRQHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi all, I have implemented and tested a first working version of an amortized point estimator, which can be found in the An example use case is:
Let me know what you think. |
Great, thanks! Just adding two pointers to Stefan's post:
|
The workflow looks good to me. How would you generalize the interface to different loss function for different point estimates? The norm_ord is perhaps a bit restrictive an interface for that(?) |
The current |
Great job! Does the current workflow support the uncertainty quantification for estimated parameters? Do we need apply bootstrap for UQ just like what the reference paper did? |
You would currently need bootstrap or Monte Carlo dropout for UQ, unless a custom |
Can I ask the link of this notebook? I can look into and learn it. |
|
Hi all, the current interface supports loss functions of the form
cannot abide this form. To support losses that are not simply functions of the difference The implementation replaces the Let me know what you think! |
Sorry for the delay -- this is really nice, thanks @stefanradev93 for the implementation and @vpratz @han-ol for the easy-to-follow notebook, very cool! I think this is already great as is, from a user point of view I agree with @han-ol that it would be helpful to have a generic loss interface and possibly also some "template" losses one could use. One possibility is to provide some template losses (e.g., The last thing which as a user will be really handy is a function like Anyway, really nice stuff, very happy to see it implemented in BayesFlow! |
Agreed, great job everyone!
Based on your expertise and experience with amortized point estimators, would you mind compiling a list of template losses that might cover the vast majority of users' needs? |
I would say some "experience" not "expert", but I'm fairly confident that the vast majority of cases will be covered by one of the following losses:
Other useful losses include:
There are other losses one might consider, e.g., Stein's loss for covariance matrices which might be useful (one would need to go through the Cholesky factor to ensure positive definiteness of Fan, J., & Yao, Q. (1998). Efficient estimation of conditional variance functions in stochastic regression. Biometrika, 85(3), 645-660. |
Since this issue contains past discussions and involved people are watching it: How do we go about implementing amortized point estimators in the new BayesFlow 2.0 release? |
Naming wise, I would suggest, we will implement a |
While the most common usage of this would be to estimate points, it could also be worthwhile to (A) choose a more general name or (B) split it out in different classes. From a technical point of view, this issue relaxes the definition of I couldn't think of a nice general name that is not already taken (Approximator). So what do you think about having a
? I would expect that when we eventually get to the diagnostics API, it will help if the different approximator outputs correspond to well specified classes. (I have a implementation sketch of a calibration diagnostics for the (*Point estimate each observed dataset and resimulate a few samples corresponding to its point estimate, then apply the point estimator to get some UQ for the points. This captures stochasticity in the likelihood that can lead to point estimation variance.) |
I like the breakdown into the different types of approximators. I think the
word "approximator" is also good, as one can often treat point estimates as
parameters of a highly constrained approximate posterior distribution. Note
that there is nothing stopping one obtaining bootstrap intervals for
intervals as well, although these are less used in practice.
…On Wed, Oct 2, 2024 at 3:31 AM Hans Olischläger ***@***.***> wrote:
While the most common usage of this would be to estimate points, it could
also be worthwhile to (A) *choose a more general name* or (B) *split it
out in different classes*. From a technical point of view, this issue
relaxes the definition of Approximator for all situations where we don't
represent a whole distribution with a generative network, but not all of
these cases are strictly point estimators.
I couldn't think of a nice general name that is not already taken
(Approximator).
So what do you think about having a
- Approximator(could also be called DistributionApproximator for
specificity, or we keep it short as is),
- PointApproximator (feed forward net with 1 output node),
- IntervalApproximator (feed forward net with 2 output nodes, ), and
- QuantileApproximator (feed forward net with any number of output
nodes, quantile levels are selected by loss fun or with convenient
constructor)
?
I would expect that when we eventually get to the diagnostics API, it will
help if the different approximator outputs correspond to well specified
classes. (I have a implementation sketch of a calibration diagnostics for
the QuantileApproximator for example.)
Also parametric bootstrap* could be a method applicable to
PointApproximator but not to IntervalApproximator, etc.
(*Point estimate each observed dataset and resimulate a few samples
corresponding to its point estimate, then apply the point estimator to get
some UQ for the points. This captures stochasticity in the likelihood that
can lead to point estimation variance.)
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEU3JF6LNWR7UPQAUFH6LZZLL7VAVCNFSM6AAAAABPEATCM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBWGU3TOMJZGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In the dev version, we call the main approximator |
We could, in principle, also have a @andrewzm right, interesting point regarding bootstrap on intervals. I guess it still needs custom code to deal with the differing shapes and select which parameters to plug in the simulator, so for that to be handled without the user writing some configuration code for a hook in a parametric_bootstrap function, we would need well specified Stepping back: I don't know whether this |
@han-ol Assuming we're talking about a parametric bootstrap setting, I think what would be useful to the user is getting a feel for the (frequentist) validity of these quantile intervals around a notional parameter value related to the available data, which could be the posterior mean or median. The bootstrapped intervals could be used to quantify the expected frequentist coverage probability, by seeing the proportion of times the "notional" value falls into the bootstrapped intervals. This is similar to the procedure Hermans et al. describes in "A Trust Crisis In Simulation-Based Inference? Your Posterior Approximations Can Be Unfaithful" (https://arxiv.org/pdf/2110.06581) in Section 2.2 but for a notional value of theta (e.g., the posterior mean) instead of averaging over the whole parameter space. From a software point you're right in that things are bit different, but not much -- with point estimators the notional value will be the estimate itself, while in the interval case it would need to be specified (but it could also be parameter value in the middle of the interval as a default.. crude but probably OK as a default). |
The paper "Likelihood-free parameter estimation with neural Bayes estimators" (Sainsbury-Dale, Zammit-Mangion, & Huser, 2023) enables neural amortized point estimation, which is generally faster than fully Bayesian neural inference with conditional generative models (current standard in BayesFlow).
Implementing amortized point estimators in BayesFlow would help with fast prototyping and sanity checks (and surely other tasks, too!).
Some BayesFlow pointers/proposals:
AmortizedPointEstimator
inbayesflow.amortizers
bayesflow.losses
(depending on the target quantity)bayesflow.inference_networks
which takes the summary network output (e.g., DeepSet outputs) and regresses to the point estimatebayesflow.summary_networks
and probably work out-of-the-boxSome links:
NeuralEstimators
package (R): https://github.com/msainsburydale/NeuralEstimatorsNeuralEstimators.jl
package (Julia): https://github.com/msainsburydale/NeuralEstimators.jlThe text was updated successfully, but these errors were encountered: