incorrect mixture probabilities returned when mixtures include hurdle families #1552

jsocolar · 2023-10-25T03:41:47Z

While tinkering with #1551, I realized that mixture() misbehaves as follows:

If I do

mix <- mixture(gaussian, poisson)

I error with

Error: Cannot mix families with real and integer support.

This makes sense.

But if I do:

mix <- mixture(gaussian, zero_one_inflated_beta)

brms doesn't complain. Moreover, if I do

set.seed(2)
dat <- data.frame(
  y = c(0, 1, runif(10))
)
mix <- mixture(gaussian, zero_one_inflated_beta)
fit <- brm(y ~ 1, dat, family = mix)
pp_mixture(fit)[1:3,,]

I get (in addition to a bunch of divergences) the final output

> pp_mixture(fit)[1:3,,]
, , P(K = 1 | Y)

   Estimate  Est.Error      Q2.5     Q97.5
1 0.9318899 0.10426227 0.6236484 0.9997673
2 0.9506989 0.07524888 0.7287995 0.9998412
3 0.9974165 0.02822344 0.9884541 1.0000000

, , P(K = 2 | Y)

     Estimate  Est.Error          Q2.5     Q97.5
1 0.068110140 0.10426227  2.326594e-04 0.3763516
2 0.049301144 0.07524888  1.587643e-04 0.2712005
3 0.002583492 0.02822344 1.282092e-162 0.0115459

This is incorrect; rows 1 and 2 are structurally guaranteed to arise from the zero-one-inflated-beta component which puts probability mass rather than probability density over zero and one.

I think the solution is to prohibit hurdle families as arguments to mixture().

The text was updated successfully, but these errors were encountered:

paul-buerkner · 2023-10-25T07:44:56Z

I see the problem. However, I would not want to prevent these models from being excluded from mixture models because of this. Perhaps a warning could be sufficient.

jsocolar · 2023-10-25T13:25:26Z

I defer to you (obviously), but want to make sure that it's clear that this isn't a problem in pp_mixture() but rather a problem with the generated stancode.

set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(10) rnorm(5, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
make_stancode(y ~ 1, dat, family = mix)

Yields stan code including

    for (n in 1:N) {
      array[2] real ps;
      ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
      ps[2] = log(theta2) + zero_inflated_beta_lpdf(Y[n] | mu2[n], phi2, zi2);
      target += log_sum_exp(ps);
    }

But it's not correct to log_sum_exp through a mixture of probability densities and probability masses. The correct code here would be

    for (n in 1:N) {
      array[2] real ps;
      if(y[n] == 0){
        target += log(theta2) + log(zi2);
      } else {
        ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
        ps[2] = log(theta2) + log1m(zi2) + beta_lpdf(Y[n] | mu2[n], phi2);
        target += log_sum_exp(ps);
      }
    }

Here's a simple example that shows the consequences for parameter recovery:

library(brms)
set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(20), rnorm(20, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
prior <- c(
  prior(normal(.5, .05), Intercept, dpar = mu1),
  prior(normal(0, .2), Intercept, dpar = mu2),
  prior(lognormal(1,1), sigma1, lb = 0.01)
)
fit <- brm(y ~ 1, dat, family = mix, 
           prior = prior,
           backend = "cmdstanr", 
           cores = 4)
summary(fit)

I included the prior lb on sigma1 because this model desperately wants to estimate sigma1 on the boundary, which leads to very poor sampling when the boundary is zero.

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept     0.00      0.00    -0.00     0.00 1.00     4359     2987
mu2_Intercept     0.06      0.12    -0.17     0.30 1.00     3651     2576

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.01      0.00     0.01     0.01 1.00     2869     1761
phi2       3.47      0.68     2.28     4.88 1.00     3765     2475
zi2        0.02      0.02     0.00     0.09 1.00     2789     1638
theta1     0.71      0.04     0.63     0.78 1.00     3981     2731
theta2     0.29      0.04     0.22     0.37 1.00     3981     2731

Even with the lb on sigma1 (not to mention the zero-avoiding prior), this model estimates mu1 as zero; i.e. in the ten-sigma region of its marginal prior. But the normal part of the likelihood should never even see any of the zeros in the data--in the correct Stan code, they would all go straight to the zero-inflation term via the control flow.

I think that pp_mixture() is in fact returning the correct mixture weights given the model that was fitted; i.e. treating the probability of a structural zero via zi as a probability density rather than a probability mass.

jsocolar · 2023-10-25T13:35:28Z

Note that the Stan code for mixtures including zero_inflated_poisson, zero_inflated_binomial etc is unproblematic, because these families don't mix probability masses and probability densities, and mixture()'s requirement not to mix integer and continuous families will prevent them from being mixed with probability densities later on.

paul-buerkner · 2023-10-25T13:38:24Z

Ah, that makes sense. So the problem are the continuoues zi/hu families, such as zero_inflated_lognormal?

jsocolar · 2023-10-25T13:39:26Z

Yup!

paul-buerkner · 2023-10-25T13:41:56Z

Got it. Will change accordingly.

paul-buerkner · 2023-11-08T10:47:32Z

Fixed :-)

paul-buerkner added this to the 2.21.0 milestone Oct 25, 2023

jsocolar mentioned this issue Oct 28, 2023

zi mixture probs (draft) #1555

Closed

4 tasks

paul-buerkner added the bug label Oct 30, 2023

paul-buerkner added a commit that referenced this issue Nov 8, 2023

fix issue #1552

226a95a

paul-buerkner closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect mixture probabilities returned when mixtures include hurdle families #1552

incorrect mixture probabilities returned when mixtures include hurdle families #1552

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

jsocolar commented Oct 25, 2023 •

edited

Loading

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

paul-buerkner commented Nov 8, 2023

incorrect mixture probabilities returned when mixtures include hurdle families #1552

incorrect mixture probabilities returned when mixtures include hurdle families #1552

Comments

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

jsocolar commented Oct 25, 2023 • edited Loading

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

jsocolar commented Oct 25, 2023

paul-buerkner commented Oct 25, 2023

paul-buerkner commented Nov 8, 2023

jsocolar commented Oct 25, 2023 •

edited

Loading