Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed improvement for step_lencode_glm() #232

Open
EmilHvitfeldt opened this issue Oct 30, 2024 · 0 comments
Open

speed improvement for step_lencode_glm() #232

EmilHvitfeldt opened this issue Oct 30, 2024 · 0 comments
Labels
feature a feature request or enhancement

Comments

@EmilHvitfeldt
Copy link
Member

I think i found some evidence that we can improve the speed of step_lencode_glm() significantly

the following shows a rough benchmark. to note

  • they produce the same result up to 10^-15
  • the ordering of the values are not the same, but doesn't matter as we left_join it on
  • this only works for the numeric outcome, but would be easy enough to extend to other supported modes
  • old method scales linearly in time with the number of levels of x. new method has same speed
library(embed)
n_obs <- 500000

data <- tibble(
  outcome = rnorm(n_obs),
  x = factor(sample(seq_len(100), n_obs, TRUE))
)

tictoc::tic("old")
res <- recipe(outcome ~ x, data = data) |>
  step_lencode_glm(x, outcome = vars(outcome)) |>
  prep()
tictoc::toc()
#> old: 8.327 sec elapsed


tictoc::tic("new")
tmp <- data |>
  summarise(value = mean(outcome), .by = x)
tictoc::toc()
#> new: 0.007 sec elapsed
@EmilHvitfeldt EmilHvitfeldt added the feature a feature request or enhancement label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant