Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bayesian copula estimation example notebook #257

Merged
merged 22 commits into from
Dec 20, 2023

Conversation

drbenvincent
Copy link
Contributor

@drbenvincent drbenvincent commented Dec 17, 2021

This pull request adds a new example notebook for Bayesian copula estimation. Work done by myself and @ericmjl.

  • Note that I had to use cores=1 to avoid an EOFError
  • Proposing this goes in the How To section

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 18, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-18T11:42:16Z
----------------------------------------------------------------

I would use the filename also as target to reference the notebook so it's not necessary to open the notebook and look at the source to know what target to use.

I also looked at the raw text and saw you are using an html figure to format the image and caption. You should use https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#markdown-figures. This mimimizes html needed which makes sphinx recognize the image, copy it to the _static folder to make sure it is available in the website or pulls it in if generating pdf docs for example and also allows to use myst (and thus markdown) in the caption, for example referencing terms in the glossary or other notebooks.


drbenvincent commented on 2021-12-19T12:00:32Z
----------------------------------------------------------------

Thanks. Have done these things now.

I also simplified the logos. Rather than 2 separate files in a markdown table, I just have a single image with both logos now and no markdown table. Should avoid any strange table rendering issues.

OriolAbril commented on 2021-12-19T16:45:55Z
----------------------------------------------------------------

The :class: myclass is an example to show extra attributes can be used, we don't need to include it here

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 18, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-18T11:42:17Z
----------------------------------------------------------------

I need to add a huge warning to the jupyter style guide about this, but never use html links to refer to other notebooks or to other documentation pages. They much more prone to breaking and can't by design respect the version of the documentation being read, they will always point to latest. The binning uses awkward_binning as target from looking at https://github.com/pymc-devs/pymc-examples/blob/main/examples/case_studies/binning.ipynb, so you can use

{ref}here <awkward_binning>

to generate a link with text here pointing to the binning notebook


drbenvincent commented on 2021-12-19T12:03:43Z
----------------------------------------------------------------

This should hopefully be fixed now.

OriolAbril commented on 2021-12-19T16:48:34Z
----------------------------------------------------------------

The ref won't work like this. reviewnb messed up the formatting, but you need 3 things, the reference type (ref), the text to be shown (here) and the target where the link should point to (awkward_binning). See https://docs.readthedocs.io/en/stable/guides/cross-referencing-with-sphinx.html#the-ref-role for more details

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 18, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-18T11:42:18Z
----------------------------------------------------------------

Line #6.    np.random.seed(seed=42)

this should use rng = np.random.default_rng(42) plus pass the rng when needed to scipy.stats functions. There are two options for this, the base class takes a seed argument and the rvs method takes a random_state argument, both of which are compatible with passing the rng generator to it. Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous and https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.rvs.html#scipy.stats.rv_continuous.rvs.

From numpy docs: https://numpy.org/doc/stable/reference/random/legacy.html#legacy

This class should only be used if it is essential to have randoms that are identical to what would have been produced by previous versions of NumPy.

which I don't think is the case here.


drbenvincent commented on 2021-12-19T12:11:24Z
----------------------------------------------------------------

Thanks, I didn't know SciPy distributions took rng , but now I do :) Fixed

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 18, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-18T11:42:18Z
----------------------------------------------------------------

Is something in the computation of data variable non-deterministic? would we want to perform the computation preserving the chain and draw dimensions and averaging only at the end?


drbenvincent commented on 2021-12-19T12:17:38Z
----------------------------------------------------------------

I'm not sure I follow this question. Is this about lines 14, 15, 16? That computation could be done before, outside of the model as a_mu, a_sigma , b_sigma are basically observed variables. But I'm not sure that's what you are getting at.

ericmjl commented on 2021-12-19T15:18:32Z
----------------------------------------------------------------

Some notes from talking with Oriol:

  1. The entire block of deterministics can be moved out of the model context.
  2. We'll probably want to explain why we're using point estimates.

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 18, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-18T11:42:19Z
----------------------------------------------------------------

how does this differ from the cov variable in the posterior?


drbenvincent commented on 2021-12-19T12:42:52Z
----------------------------------------------------------------

As far as I can tell, we could use copula_idata.posterior.cov rather than calculate ppc samples.

Doing this results in copula_idata.posterior.cov.stack(sample=("chain", "draw")).data.shape being 2x2x4000. So we then need to iterate over this last sample dimension rather than doing dists = [multivariate_normal([0, 0], cov) for cov in ppc["cov"]]

But it's not immediately obvious to me how to iterate over the sample dimension of an xarray object. Any suggestions?

ericmjl commented on 2021-12-19T15:22:57Z
----------------------------------------------------------------

Possibly a transpose. (Just leaving a note from calling Oriol.)

ericmjl commented on 2021-12-20T00:37:51Z
----------------------------------------------------------------

@drbenvincent I think a transpose is what we can do. For an array that is of shape (2, 2, 4000), I think a transpose using np.transpose(cov, axes=(2, 0, 1)) should do the trick. I think that syntax will take the position 2 dimension and move it to position 0.

OriolAbril commented on 2021-12-20T00:48:39Z
----------------------------------------------------------------

I did some xarray+arviz black magic and updated all the calculations to use the full posterior, which does seem to give the same result. Feel free to use all or parts of that. If doing the transposition here however, do it with xarray and take the values after that. You can use copula_idata.posterior.cov.stack(sample=("chain", "draw")).transpose("sample", ...) (the ... being an actual ellipsis) to ensure the sample dimension is the first and the others keep the same order

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the readthedocs build is failing, will try to look into it at some point.

Copy link
Contributor Author

Thanks. Have done these things now.

I also simplified the logos. Rather than 2 separate files in a markdown table, I just have a single image with both logos now and no markdown table. Should avoid any strange table rendering issues.


View entire conversation on ReviewNB

Copy link
Contributor Author

This should hopefully be fixed now.


View entire conversation on ReviewNB

Copy link
Contributor Author

Thanks, I didn't know SciPy distributions took rng , but now I do :) Fixed


View entire conversation on ReviewNB

Copy link
Contributor Author

I'm not sure I follow this question. Is this about lines 14, 15, 16? That computation could be done before, outside of the model as a_mu, a_sigma , b_sigma are basically observed variables. But I'm not sure that's what you are getting at.


View entire conversation on ReviewNB

Copy link
Contributor Author

Might need input from Eric here to double check


View entire conversation on ReviewNB

Copy link
Contributor Author

As far as I can tell, we could use copula_idata.posterior.cov rather than calculate ppc samples.

Doing this results in copula_idata.posterior.cov.stack(sample=("chain", "draw")).data.shape being 2x2x4000. So we then need to iterate over this last sample dimension rather than doing dists = [multivariate_normal([0, 0], cov) for cov in ppc["cov"]]

But it's not immediately obvious to me how to iterate over the sample dimension of an xarray object. Any suggestions?


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

drbenvincent commented Dec 19, 2021

Thanks @OriolAbril. I've dealt with all of those points except for two questions.

FYI. Also need some review (when @ericmjl is able) before we merge this. Could you tag his as a reviewer? I don't seem to be able to request reviewers.

@drbenvincent
Copy link
Contributor Author

I don't understand why the graphviz images aren't displaying.

Copy link
Member

ericmjl commented Dec 19, 2021

Some notes from talking with Oriol:

  1. The entire block of deterministics can be moved out of the model context.
  2. We'll probably want to explain why we're using point estimates.


View entire conversation on ReviewNB

Copy link
Member

ericmjl commented Dec 19, 2021

Possibly a transpose. (Just leaving a note from calling Oriol.)


View entire conversation on ReviewNB

Copy link
Member

The :class: myclass is an example to show extra attributes can be used, we don't need to include it here


View entire conversation on ReviewNB

Copy link
Member

The ref won't work like this. reviewnb messed up the formatting, but you need 3 things, the reference type (ref), the text to be shown (here) and the target where the link should point to (awkward_binning). See https://docs.readthedocs.io/en/stable/guides/cross-referencing-with-sphinx.html#the-ref-role for more details


View entire conversation on ReviewNB

@OriolAbril
Copy link
Member

I don't understand why the graphviz images aren't displaying.

Don't trust reviewnb preview, always check the readthedocs preview to see if things look good or not. reviewnb messes up quite often, and even in the case it did work, what readers will actually see is readthedocs, so this is what we need to double check.

Copy link
Member

ericmjl commented Dec 20, 2021

@drbenvincent I think a transpose is what we can do. For an array that is of shape (2, 2, 4000), I think a transpose using np.transpose(cov, axes=(2, 0, 1)) should do the trick. I think that syntax will take the position 2 dimension and move it to position 0.


View entire conversation on ReviewNB

Copy link
Member

I did some xarray+arviz black magic and updated all the calculations to use the full posterior, which does seem to give the same result. Feel free to use all or parts of that. If doing the transposition here however, do it with xarray and take the values after that. You can use copula_idata.posterior.cov.stack(sample=("chain", "draw")).transpose("sample", ...) (the ... being an actual ellipsis) to ensure the sample dimension is the first and the others keep the same order


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Thanks @OriolAbril. That last commit cleans everything up, and uses your recommended way of getting the posterior data of cov out in a convenient format for plotting and comparison to the data. I've also added you to the acknowledgements 🙂

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 27, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-27T21:56:31Z
----------------------------------------------------------------

The tags kwarg is not working. ablog errors out when attempting to parse the initial comma, I think removing the comma will fix the issue.


drbenvincent commented on 2021-12-28T10:43:31Z
----------------------------------------------------------------

Extra comma now removed.

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 27, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-27T21:56:31Z
----------------------------------------------------------------

This is using point estimates only for computations that are not commutative with the expectation. I did try using the whole posterior here then averaging to get data in the right shape and it did seem to yield similar results, but afaik this is basically coincidence.

In my opinion there are two options here:

  • use the whole posterior then average
  • use point estimates with a note on why the whole posterior is not used

Note: I have several ideas that hopefully will crystalize in a small package extending xarray for linear algebra and statistical computation. I have a note to come back here once that is available to simplify the code using this while using the whole posterior


ericmjl commented on 2021-12-28T00:58:25Z
----------------------------------------------------------------

I think the spelling of transform is incorrect. Should be transform_data.

drbenvincent commented on 2021-12-28T10:43:10Z
----------------------------------------------------------------

Typo fixed.

I will add some text (shortly) to mention why we are using point estimates.

drbenvincent commented on 2021-12-28T12:05:15Z
----------------------------------------------------------------

In fact, we've already got a note about point estimates under the heading "PyMC models for copula and marginal estimation"

ericmjl commented on 2021-12-28T19:38:33Z
----------------------------------------------------------------

Copying Oriol's point from the discussion to keep the discussion a bit easier to follow:

The note is about simultaneous estimation or fitting marginal and covariance models independently, but doesn't address marginalizing before or after the transformation between observed and multivariate spaces. The marginalization after transforming is one if the things I tried and did give the same result, but the two approaches giving the same result isn't straightforward (to me at least) as the operations don't commute with the expectation.

I think I see where Oriol's point - if the expectation of a distribution is E(dist), then exp(E(dist)) is not necessarily equal to E(exp(dist)).

However, we're doing an exp(logcdf(E(dist))) here, not exp(E(dist)). Oriol, I'll readily admit to being ignorant here, but does that change anything?

ericmjl commented on 2022-01-17T16:21:16Z
----------------------------------------------------------------

Copying Oriol's point from the GitHub PR tracker to keep the discussion easier to follow:

The thing is that even if the evaluation of the logcdf are the observations and thus not a random variable, it's parameters are, so I think it's something like E(probit(exp(logcdf(x|dist)))) or probit(exp(logcdf(x|E(dist)))) which doesn't seem like they must be the same

Upon thinking about your point, Oriol, I think it is easiest for now to leave a not on why we're not using the whole posterior. I think I get your point, and it makes sense, though I'm also aware that we should close things out before they get dragged on too long, and it's easiest to add a comment than to try to change the logic of the code.

Ben, please be on the lookout for a few more textual changes to push!

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 27, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-12-27T21:56:32Z
----------------------------------------------------------------

This title is a level 1 title and should be level 2 title to preserve the right hierarchy. It should probably be updated too now that sample_posterior_predictive is not being used. Something about comparing inference and truths is probably more fitting


drbenvincent commented on 2021-12-28T10:42:38Z
----------------------------------------------------------------

Fixed

Copy link
Member

ericmjl commented Dec 28, 2021

I think the spelling of transform is incorrect. Should be transform_data.


View entire conversation on ReviewNB

Copy link
Contributor Author

Fixed


View entire conversation on ReviewNB

Copy link
Contributor Author

Typo fixed.

I will add some text (shortly) to mention why we are using point estimates.


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Housekeeping done. Ready to try to get this over the finish line...

@drbenvincent
Copy link
Contributor Author

Took me a while to fix some issues - waiting for docs to build etc. But notebook looks good now :)

Copy link

review-notebook-app bot commented Nov 30, 2023

View / edit / reply to this conversation on ReviewNB

twiecki commented on 2023-11-30T00:14:54Z
----------------------------------------------------------------

*a value


Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting one change to make sure we don't break any cross-references, after that ready to merge

examples/case_studies/binning.myst.md Outdated Show resolved Hide resolved
@drbenvincent
Copy link
Contributor Author

Thanks for the comments @ricardoV94. Not sure how it happened, but the comments were to an old version where many of those issues were resolved. But I did change random_seed=SEED to random_seed=rng

@twiecki twiecki merged commit fbc76b7 into pymc-devs:main Dec 20, 2023
2 checks passed
@drbenvincent drbenvincent deleted the copula branch December 20, 2023 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants