Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spatial_clustering_cv retains geometry in folds causing fit_resamples to fail #159

Open
brianmsm opened this issue Jul 19, 2024 · 2 comments

Comments

@brianmsm
Copy link

The problem

When using spatial_clustering_cv to create spatial resamples, the geometry column is retained within the folds. This causes fit_resamples to fail with an error indicating that not all columns of y are known outcome types. It's unclear whether spatial_clustering_cv should drop the spatial information in the folds or if fit_resamples should exclude the geometry information. There might be something I'm missing.

Reproducible example

# Load package
library(dplyr, warn.conflicts = FALSE)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(spatialsample)
library(workflows)
library(parsnip)
library(tune)

# Example data
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# Making spatial clusters
nc_folds <- spatial_clustering_cv(nc, v = 5)

# Workflow for linear regression
lr_recipe <- workflow() %>%
  add_variables(outcomes = BIR74,
                predictors = AREA) %>%
  add_model(linear_reg(engine = "lm"))

# Tuning parameters: Fail
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> → A | error:   Not all columns of `y` are known outcome types. These columns have unknown types: 'geometry'.
#> There were issues with some computations   A: x1
#> There were issues with some computations   A: x5
#> 
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> # Resampling results
#> # 5-fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics .notes          
#>   <list>          <chr> <list>   <list>          
#> 1 <split [77/23]> Fold1 <NULL>   <tibble [1 × 3]>
#> 2 <split [75/25]> Fold2 <NULL>   <tibble [1 × 3]>
#> 3 <split [79/21]> Fold3 <NULL>   <tibble [1 × 3]>
#> 4 <split [84/16]> Fold4 <NULL>   <tibble [1 × 3]>
#> 5 <split [85/15]> Fold5 <NULL>   <tibble [1 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x5: Not all columns of `y` are known outcome types. These columns hav...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

# Best tuning parameters: : Fail
collect_metrics(spatial_lr)
#> Error in `estimate_tune_results()`:
#> ! All models failed. Run `show_notes(.Last.tune.result)` for more information.



# Try with st_drop_geometry:
orig_class <- class(nc_folds)

nc_folds <- nc_folds %>% 
  mutate(splits = purrr::map(splits, ~ {
    .x$data <- st_drop_geometry(.x$data)
    .x
  }))

class(nc_folds) <- orig_class

# Tuning parameters
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> # Resampling results
#> # -fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics         .notes          
#>   <list>          <chr> <list>           <list>          
#> 1 <split [77/23]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [75/25]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [79/21]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [84/16]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [85/15]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

# Best tuning parameters 
collect_metrics(spatial_lr)
#> # A tibble: 2 × 6
#>   .metric .estimator     mean     n  std_err .config             
#>   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
#> 1 rmse    standard   3542.        5 634.     Preprocessor1_Model1
#> 2 rsq     standard      0.178     5   0.0616 Preprocessor1_Model1

Created on 2024-07-19 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14 ucrt)
#>  os       Windows 11 x64 (build 22635)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Peru.utf8
#>  ctype    Spanish_Peru.utf8
#>  tz       America/Lima
#>  date     2024-07-19
#>  pandoc   3.1.12.3 @ c:\\Program Files\\Positron\\bin\\pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version    date (UTC) lib source
#>  class           7.3-22     2023-05-03 [2] CRAN (R 4.4.1)
#>  classInt        0.4-10     2023-09-05 [1] CRAN (R 4.4.0)
#>  cli             3.6.3.9000 2024-06-28 [1] Github (r-lib/cli@d9febb5)
#>  codetools       0.2-20     2024-03-31 [2] CRAN (R 4.4.1)
#>  colorspace      2.1-0      2023-01-23 [1] CRAN (R 4.4.0)
#>  data.table      1.15.4     2024-03-30 [1] CRAN (R 4.4.0)
#>  DBI             1.2.3      2024-06-02 [1] CRAN (R 4.4.0)
#>  dials           1.2.1      2024-02-22 [1] CRAN (R 4.4.1)
#>  DiceDesign      1.10       2023-12-07 [1] CRAN (R 4.4.1)
#>  digest          0.6.36     2024-06-23 [1] CRAN (R 4.4.1)
#>  dplyr         * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)
#>  e1071           1.7-14     2023-12-06 [1] CRAN (R 4.4.0)
#>  evaluate        0.24.0     2024-06-10 [1] CRAN (R 4.4.0)
#>  fansi           1.0.6      2023-12-08 [1] CRAN (R 4.4.0)
#>  fastmap         1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
#>  foreach         1.5.2      2022-02-02 [1] CRAN (R 4.4.0)
#>  fs              1.6.4      2024-04-25 [1] CRAN (R 4.4.0)
#>  furrr           0.3.1      2022-08-15 [1] CRAN (R 4.4.0)
#>  future          1.33.2     2024-03-26 [1] CRAN (R 4.4.0)
#>  future.apply    1.11.2     2024-03-28 [1] CRAN (R 4.4.0)
#>  generics        0.1.3      2022-07-05 [1] CRAN (R 4.4.0)
#>  ggplot2         3.5.1      2024-04-23 [1] CRAN (R 4.4.1)
#>  globals         0.16.3     2024-03-08 [1] CRAN (R 4.4.0)
#>  glue            1.7.0      2024-01-09 [1] CRAN (R 4.4.0)
#>  gower           1.0.1      2022-12-22 [1] CRAN (R 4.4.0)
#>  GPfit           1.0-8      2019-02-08 [1] CRAN (R 4.4.1)
#>  gtable          0.3.5      2024-04-22 [1] CRAN (R 4.4.0)
#>  hardhat         1.4.0      2024-06-02 [1] CRAN (R 4.4.1)
#>  htmltools       0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
#>  ipred           0.9-14     2023-03-09 [1] CRAN (R 4.4.1)
#>  iterators       1.0.14     2022-02-05 [1] CRAN (R 4.4.0)
#>  KernSmooth      2.23-24    2024-05-17 [2] CRAN (R 4.4.1)
#>  knitr           1.48       2024-07-07 [1] CRAN (R 4.4.1)
#>  lattice         0.22-6     2024-03-20 [2] CRAN (R 4.4.1)
#>  lava            1.8.0      2024-03-05 [1] CRAN (R 4.4.1)
#>  lhs             1.2.0      2024-06-30 [1] CRAN (R 4.4.1)
#>  lifecycle       1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
#>  listenv         0.9.1      2024-01-29 [1] CRAN (R 4.4.0)
#>  lubridate       1.9.3      2023-09-27 [1] CRAN (R 4.4.0)
#>  magrittr        2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
#>  MASS            7.3-60.2   2024-04-26 [2] CRAN (R 4.4.1)
#>  Matrix          1.7-0      2024-04-26 [2] CRAN (R 4.4.1)
#>  munsell         0.5.1      2024-04-01 [1] CRAN (R 4.4.0)
#>  nnet            7.3-19     2023-05-03 [2] CRAN (R 4.4.1)
#>  parallelly      1.37.1     2024-02-29 [1] CRAN (R 4.4.0)
#>  parsnip       * 1.2.1      2024-03-22 [1] CRAN (R 4.4.1)
#>  pillar          1.9.0      2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.4.0)
#>  prodlim         2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
#>  proxy           0.4-27     2022-06-09 [1] CRAN (R 4.4.0)
#>  purrr           1.0.2      2023-08-10 [1] CRAN (R 4.4.0)
#>  R6              2.5.1      2021-08-19 [1] CRAN (R 4.4.0)
#>  Rcpp            1.0.12     2024-01-09 [1] CRAN (R 4.4.0)
#>  recipes         1.1.0      2024-07-04 [1] CRAN (R 4.4.1)
#>  reprex          2.1.1      2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang           1.1.4.9000 2024-06-28 [1] Github (r-lib/rlang@cebbabf)
#>  rmarkdown       2.27       2024-05-17 [1] CRAN (R 4.4.0)
#>  rpart           4.1.23     2023-12-05 [2] CRAN (R 4.4.1)
#>  rsample         1.2.1      2024-03-25 [1] CRAN (R 4.4.1)
#>  s2              1.1.6      2023-12-19 [1] CRAN (R 4.4.0)
#>  scales          1.3.0      2023-11-28 [1] CRAN (R 4.4.0)
#>  sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.4.0)
#>  sf            * 1.0-16     2024-03-24 [1] CRAN (R 4.4.0)
#>  spatialsample * 0.5.1      2023-11-08 [1] CRAN (R 4.4.1)
#>  survival        3.6-4      2024-04-24 [2] CRAN (R 4.4.1)
#>  tibble          3.2.1      2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyr           1.3.1      2024-01-24 [1] CRAN (R 4.4.0)
#>  tidyselect      1.2.1      2024-03-11 [1] CRAN (R 4.4.0)
#>  timechange      0.3.0      2024-01-18 [1] CRAN (R 4.4.0)
#>  timeDate        4032.109   2023-12-14 [1] CRAN (R 4.4.0)
#>  tune          * 1.2.1      2024-04-18 [1] CRAN (R 4.4.1)
#>  units           0.8-5      2023-11-28 [1] CRAN (R 4.4.0)
#>  utf8            1.2.4      2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs           0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
#>  withr           3.0.0      2024-01-16 [1] CRAN (R 4.4.0)
#>  wk              0.9.2      2024-07-09 [1] CRAN (R 4.4.1)
#>  workflows     * 1.1.4      2024-02-19 [1] CRAN (R 4.4.1)
#>  xfun            0.45       2024-06-16 [1] CRAN (R 4.4.1)
#>  yaml            2.3.9      2024-07-05 [1] CRAN (R 4.4.1)
#>  yardstick       1.3.1      2024-03-21 [1] CRAN (R 4.4.1)
#> 
#>  [1] C:/Users/brian/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@mikemahoney218
Copy link
Member

Try using add_formula instead of add_variables as a workaround

(Sorry for the brief reply -- I'm traveling at the moment so can't run stuff, but wanted to make sure I could try to help you get unstuck. This is definitely a bug somewhere)

@brianmsm
Copy link
Author

Interesting. If I do it using add_formula() it does work.

# Load package
library(dplyr, warn.conflicts = FALSE)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(spatialsample)
library(workflows)
library(parsnip)
library(tune)

# Example data
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# Making spatial clusters
nc_folds <- spatial_clustering_cv(nc, v = 5)

# Workflow for linear regression
lr_recipe <- workflow() %>%
  add_formula(BIR74 ~ AREA) %>%
  add_model(linear_reg(engine = "lm"))

# Tuning parameters
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> # Resampling results
#> # 5-fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics         .notes          
#>   <list>          <chr> <list>           <list>          
#> 1 <split [79/21]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [75/25]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [77/23]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [85/15]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [84/16]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

# Best tuning parameters:
collect_metrics(spatial_lr)
#> # A tibble: 2 × 6
#>   .metric .estimator     mean     n  std_err .config             
#>   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
#> 1 rmse    standard   3542.        5 634.     Preprocessor1_Model1
#> 2 rsq     standard      0.178     5   0.0616 Preprocessor1_Model1

Created on 2024-07-22 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants