Speed up duplicate detection in `as_epi_df()` #560

brookslogan · 2024-10-31T14:34:20Z

When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.

Most of the time in as_epi_df appears to be spent in duplicate detection:

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms      3.34        NA     13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms      5.85        NA     21.5
#> 3 dup_check3(test_tbl, character…   4.09ms   4.56ms    194.          NA     22.0

^{Created on 2024-10-31 with reprex v2.1.1}

vctrs::vec_equal should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.

I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.

The text was updated successfully, but these errors were encountered:

brookslogan added the performance label Oct 31, 2024

brookslogan assigned brookslogan and unassigned brookslogan Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up duplicate detection in `as_epi_df()` #560

Speed up duplicate detection in `as_epi_df()` #560

brookslogan commented Oct 31, 2024 •

edited

Loading

Speed up duplicate detection in as_epi_df() #560

Speed up duplicate detection in as_epi_df() #560

Comments

brookslogan commented Oct 31, 2024 • edited Loading

Speed up duplicate detection in `as_epi_df()` #560

Speed up duplicate detection in `as_epi_df()` #560

brookslogan commented Oct 31, 2024 •

edited

Loading