Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up duplicate detection in as_epi_df() #560

Open
brookslogan opened this issue Oct 31, 2024 · 0 comments
Open

Speed up duplicate detection in as_epi_df() #560

brookslogan opened this issue Oct 31, 2024 · 0 comments

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Oct 31, 2024

When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.

Most of the time in as_epi_df appears to be spent in duplicate detection:
2024-10-31-072042_535x39_scrot

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms      3.34        NA     13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms      5.85        NA     21.5
#> 3 dup_check3(test_tbl, character…   4.09ms   4.56ms    194.          NA     22.0

Created on 2024-10-31 with reprex v2.1.1

vctrs::vec_equal should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.

I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant