You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.
Most of the time in as_epi_df appears to be spent in duplicate detection:
Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':#> method from #> as_tibble.grouped_df dplyr#> #> Attaching package: 'epiprocess'#> The following object is masked from 'package:stats':#> #> filterdup_check1<-function(x, other_keys) {
duplicated_time_values<-x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
filter(dplyr::n() >1) %>%
ungroup()
nrow(duplicated_time_values) !=0
}
dup_check2<-function(x, other_keys) {
anyDuplicated(x[c("geo_value", "time_value", other_keys)]) !=0L
}
dup_check3<-function(x, other_keys) {
if (nrow(x) <=1L) {
FALSE
} else {
epikeytime_names<- c("geo_value", "time_value", other_keys)
arranged<- arrange(x, across(all_of(epikeytime_names)))
arranged_epikeytimes<-arranged[epikeytime_names]
any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
}
}
test_tbl<- as_tibble(covid_case_death_rates_extended)
bench::mark(
dup_check1(test_tbl, character()),
dup_check2(test_tbl, character()),
dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is#> disabled.#> # A tibble: 3 × 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms 3.34 NA 13.4#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms 5.85 NA 21.5#> 3 dup_check3(test_tbl, character… 4.09ms 4.56ms 194. NA 22.0
When writing some code for archive to archive slides,
as_epi_df
was taking most of the time. I can/should probably avoid that withnew_epi_df
or anas_epi_df.data.table
, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security ofas_epi_df
.Most of the time in
as_epi_df
appears to be spent in duplicate detection:Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.
Created on 2024-10-31 with reprex v2.1.1
vctrs::vec_equal
should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.
The text was updated successfully, but these errors were encountered: