Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support topmodels #1159

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Support topmodels #1159

wants to merge 1 commit into from

Conversation

vincentarelbundock
Copy link
Owner

@vincentarelbundock vincentarelbundock commented Jul 8, 2024

Install:

install.packages("topmodels", repos = "https://R-Forge.R-project.org")
remotes::install_github("vincentarelbundock/marginaleffects@topmodels")

Examples seem to work:

library("topmodels")
library("marginaleffects")

data("FIFA2018", package = "distributions3")
m <- glm(goals ~ difference, data = FIFA2018, family = poisson)
p <- promodel(m)

# Informative error
avg_slopes(p, type = "junk")
> Error in sanitize_type(model = model, type = type, calling_function = "slopes"): Assertion on 'type' failed: Must be element of set {'mean','variance','quantile','probability','density','loglikelihood','kurtosis','skewness'}, but is 'junk'.
# Mean
avg_slopes(p)
> 
>        Term    Contrast Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
>  difference mean(dY/dX)    0.536      0.143 3.74   <0.001 12.4 0.255  0.817
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  mean
# Probability
avg_slopes(p, type = "probability")
> 
>        Term    Contrast Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 %
>  difference mean(dY/dX)   -0.141     0.0333 -4.23   <0.001 15.4 -0.206 -0.0754
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Probability at=3
avg_slopes(p, type = "probability", at = 3)
> 
>        Term    Contrast Estimate Std. Error     z Pr(>|z|)   S  2.5 %  97.5 %
>  difference mean(dY/dX)  -0.0605     0.0218 -2.78  0.00545 7.5 -0.103 -0.0178
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Effect of a change of 2 units of `difference` 
# on the probability that teams score 3 goals or less
avg_comparisons(p,
    variables = list(difference = 2),
    type = "probability",
    at = 3)
> 
>        Term Contrast Estimate Std. Error     z Pr(>|z|)   S 2.5 %  97.5 %
>  difference mean(+2)   -0.291      0.122 -2.39   0.0168 5.9 -0.53 -0.0526
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Predictions also work
predictions(p)
> 
>  Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
>     1.768      0.182  9.73   <0.001  71.9 1.412   2.12
>     0.866      0.120  7.19   <0.001  40.5 0.630   1.10
>     1.030      0.108  9.51   <0.001  68.8 0.817   1.24
>     1.486      0.120 12.36   <0.001 114.1 1.250   1.72
>     1.435      0.113 12.68   <0.001 120.0 1.214   1.66
> --- 118 rows omitted. See ?avg_predictions and ?print.marginaleffects --- 
>     1.324      0.103 12.84   <0.001 122.9 1.122   1.53
>     1.332      0.104 12.86   <0.001 123.2 1.129   1.53
>     1.149      0.102 11.27   <0.001  95.5 0.949   1.35
>     1.604      0.142 11.30   <0.001  96.0 1.326   1.88
>     0.954      0.114  8.38   <0.001  54.1 0.731   1.18
> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, goals, difference 
> Type:  mean

@vincentarelbundock
Copy link
Owner Author

@zeileis I have not tried too many things yet, but at first glance what we discussed seems to work well. See above for installation and examples.

@zeileis
Copy link

zeileis commented Jul 8, 2024

Vincent, really cool, thanks!

I did not try many things, yet, but noticed the following. While avg_slopes() gives me the same output for the original Poisson glm object m and the promodel object p, I get somewhat different uncertainty results for the predictions().

Also, the type is a bit more flexible in procast() than I currently brought out in the main help file. We allow: "quantile", "mean", "variance", "probability", "cdf", "density", "pdf", "pmf", "loglikelihood", "log_pdf", "distribution", "parameters", "kurtosis", "skewness".

Thus, instead of "probability" (for compatibility with the p functions) you can also use "cdf" (for compatibility with distributions3). Similarly for "density" vs. "pdf" and "pmf" - and "loglikelihood" vs. "log_pdf". I haven't, yet decided which to push as the "preferred" set of type labels. (Thoughts and opinions are welcome...)

More tomorrow!

@vincentarelbundock
Copy link
Owner Author

TODO:

  • Agree on a final list of type to allow
  • Do SEs match in predictions when looking at topmodels vs raw model objects?

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Jul 9, 2024

I haven't, yet decided which to push as the "preferred" set of type labels. (Thoughts and opinions are welcome...)

In general, I like verbosity, but in this particular case I feel that it might be best to be consistent with distributions3. Sure, there is a parallel between the first letters of "pnorm" and "probability", but the parallel is not that obvious. So it's better to be consistent with something than to invent new types altogether. But that's a weak preference. What matters most is that you stick with one convention and avoid duplication.

A lot of users are asking me to implement prediction intervals in marginaleffects. So far, that's only been possible in Bayesian models or using Conformal Prediction with the inferences() function.

Is this something that topmodels can help me do reliably and flexibly for many model types? I'm thinking of something like this, which appears to yield adequate 90% coverage in an ultra simple normal model:

library("topmodels")
library("marginaleffects")

sim <- function(...) {
    N <- 500
    dat <- data.frame(x = rnorm(N))
    dat$y <- dat$x + rnorm(N)
    idx <- sample(1:N, N / 2)
    train <- dat[idx, ]
    test <- dat[-idx, ]
    m <- glm(y ~ x, data = train)
    p <- do.call(cbind, list(
        procast(m, newdata = test, type = "mean"),
        procast(m, newdata = test, type = "quantile", at = .05),
        procast(m, newdata = test, type = "quantile", at = .95)
    ))
    p <- setNames(p, c("estimate", "conf.low", "conf.high"))
    p$truth <- test$y
    coverage <- mean(p$truth < p$conf.high & p$truth > p$conf.low)
    return(coverage)
}

mean(sapply(seq_len(100), sim))
> [1] 0.89572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants