Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect mixture probabilities returned when mixtures include hurdle families #1552

Closed
jsocolar opened this issue Oct 25, 2023 · 7 comments
Closed
Labels
Milestone

Comments

@jsocolar
Copy link
Contributor

While tinkering with #1551, I realized that mixture() misbehaves as follows:

If I do

mix <- mixture(gaussian, poisson)

I error with

Error: Cannot mix families with real and integer support.

This makes sense.

But if I do:

mix <- mixture(gaussian, zero_one_inflated_beta)

brms doesn't complain. Moreover, if I do

set.seed(2)
dat <- data.frame(
  y = c(0, 1, runif(10))
)
mix <- mixture(gaussian, zero_one_inflated_beta)
fit <- brm(y ~ 1, dat, family = mix)
pp_mixture(fit)[1:3,,]

I get (in addition to a bunch of divergences) the final output

> pp_mixture(fit)[1:3,,]
, , P(K = 1 | Y)

   Estimate  Est.Error      Q2.5     Q97.5
1 0.9318899 0.10426227 0.6236484 0.9997673
2 0.9506989 0.07524888 0.7287995 0.9998412
3 0.9974165 0.02822344 0.9884541 1.0000000

, , P(K = 2 | Y)

     Estimate  Est.Error          Q2.5     Q97.5
1 0.068110140 0.10426227  2.326594e-04 0.3763516
2 0.049301144 0.07524888  1.587643e-04 0.2712005
3 0.002583492 0.02822344 1.282092e-162 0.0115459

This is incorrect; rows 1 and 2 are structurally guaranteed to arise from the zero-one-inflated-beta component which puts probability mass rather than probability density over zero and one.

I think the solution is to prohibit hurdle families as arguments to mixture().

@paul-buerkner
Copy link
Owner

I see the problem. However, I would not want to prevent these models from being excluded from mixture models because of this. Perhaps a warning could be sufficient.

@paul-buerkner paul-buerkner added this to the 2.21.0 milestone Oct 25, 2023
@jsocolar
Copy link
Contributor Author

jsocolar commented Oct 25, 2023

I defer to you (obviously), but want to make sure that it's clear that this isn't a problem in pp_mixture() but rather a problem with the generated stancode.

set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(10) rnorm(5, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
make_stancode(y ~ 1, dat, family = mix)

Yields stan code including

    for (n in 1:N) {
      array[2] real ps;
      ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
      ps[2] = log(theta2) + zero_inflated_beta_lpdf(Y[n] | mu2[n], phi2, zi2);
      target += log_sum_exp(ps);
    }

But it's not correct to log_sum_exp through a mixture of probability densities and probability masses. The correct code here would be

    for (n in 1:N) {
      array[2] real ps;
      if(y[n] == 0){
        target += log(theta2) + log(zi2);
      } else {
        ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
        ps[2] = log(theta2) + log1m(zi2) + beta_lpdf(Y[n] | mu2[n], phi2);
        target += log_sum_exp(ps);
      }
    }

Here's a simple example that shows the consequences for parameter recovery:

library(brms)
set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(20), rnorm(20, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
prior <- c(
  prior(normal(.5, .05), Intercept, dpar = mu1),
  prior(normal(0, .2), Intercept, dpar = mu2),
  prior(lognormal(1,1), sigma1, lb = 0.01)
)
fit <- brm(y ~ 1, dat, family = mix, 
           prior = prior,
           backend = "cmdstanr", 
           cores = 4)
summary(fit)

I included the prior lb on sigma1 because this model desperately wants to estimate sigma1 on the boundary, which leads to very poor sampling when the boundary is zero.

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept     0.00      0.00    -0.00     0.00 1.00     4359     2987
mu2_Intercept     0.06      0.12    -0.17     0.30 1.00     3651     2576

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.01      0.00     0.01     0.01 1.00     2869     1761
phi2       3.47      0.68     2.28     4.88 1.00     3765     2475
zi2        0.02      0.02     0.00     0.09 1.00     2789     1638
theta1     0.71      0.04     0.63     0.78 1.00     3981     2731
theta2     0.29      0.04     0.22     0.37 1.00     3981     2731

Even with the lb on sigma1 (not to mention the zero-avoiding prior), this model estimates mu1 as zero; i.e. in the ten-sigma region of its marginal prior. But the normal part of the likelihood should never even see any of the zeros in the data--in the correct Stan code, they would all go straight to the zero-inflation term via the control flow.

I think that pp_mixture() is in fact returning the correct mixture weights given the model that was fitted; i.e. treating the probability of a structural zero via zi as a probability density rather than a probability mass.

@jsocolar
Copy link
Contributor Author

Note that the Stan code for mixtures including zero_inflated_poisson, zero_inflated_binomial etc is unproblematic, because these families don't mix probability masses and probability densities, and mixture()'s requirement not to mix integer and continuous families will prevent them from being mixed with probability densities later on.

@paul-buerkner
Copy link
Owner

Ah, that makes sense. So the problem are the continuoues zi/hu families, such as zero_inflated_lognormal?

@jsocolar
Copy link
Contributor Author

Yup!

@paul-buerkner
Copy link
Owner

Got it. Will change accordingly.

@jsocolar jsocolar mentioned this issue Oct 28, 2023
4 tasks
paul-buerkner added a commit that referenced this issue Nov 8, 2023
@paul-buerkner
Copy link
Owner

Fixed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants