convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

avallecam · 2024-07-25T18:21:28Z

running cleanepi::convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes

wondering if this may be an expected scenario to happen and if this may require refactoring at an appropriate time to use data.table or dtplyr.

library(rio)
library(cleanepi)
library(tidyverse)
library(tictoc)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) %>% 
  cleanepi::standardize_column_names()

tictoc::tic()
covid %>% 
  dplyr::select(numero_de_hospitalizaciones_recientes) %>% 
  cleanepi::convert_to_numeric(
    target_columns = "numero_de_hospitalizaciones_recientes",
    lang = "es")
#> # A tibble: 502,010 × 1
#>    numero_de_hospitalizaciones_recientes
#>                                    <dbl>
#>  1                                     0
#>  2                                     0
#>  3                                     0
#>  4                                     0
#>  5                                     0
#>  6                                     0
#>  7                                     0
#>  8                                     0
#>  9                                    NA
#> 10                                     0
#> # ℹ 502,000 more rows
#> # ℹ Use `print(n = ...)` to see more rows
tictoc::toc()
#> 150.42 sec elapsed

cc: @Joskerus @lgbermeo

The text was updated successfully, but these errors were encountered:

Bisaloo · 2024-07-28T09:19:11Z

Could you give epiverse-trace/numberize#14 a go please? If the performance is still not sufficient, I have a couple of other ideas.

Bisaloo mentioned this issue Jul 28, 2024

Improve performance epiverse-trace/numberize#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

avallecam commented Jul 25, 2024 •

edited

Loading

Bisaloo commented Jul 28, 2024

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

Comments

avallecam commented Jul 25, 2024 • edited Loading

Bisaloo commented Jul 28, 2024

avallecam commented Jul 25, 2024 •

edited

Loading