Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to unnest the splits column #157

Closed
anjelinejeline opened this issue Feb 2, 2024 · 5 comments
Closed

Unable to unnest the splits column #157

anjelinejeline opened this issue Feb 2, 2024 · 5 comments

Comments

@anjelinejeline
Copy link

Hi I would like to unnest the rsplit object but I am not able to do it
This is my code

set.seed(123)
cluster_folds=spatial_clustering_cv(out_drivers_sf_norm, v = 10)

class(cluster_folds)
autoplot(cluster_folds)

cluster_folds |> as.data.frame() |> tidyr::unnest(c(splits))
Error in `list_sizes()`:
! `x[[1]]` must be a vector, not a <spatial_clustering_split/spatial_rsplit/rsplit> object.
Backtrace:
 1. tidyr::unnest(as.data.frame(cluster_folds), c(splits))
 2. tidyr:::unnest.data.frame(as.data.frame(cluster_folds), c(splits))
 3. tidyr::unchop(...)
 4. tidyr:::df_unchop(...)
 5. vctrs::list_sizes(col)
@mikemahoney218
Copy link
Member

Does this go away if you call library(sf) at the top of your script? Sorry, away from a computer so I can't test this myself, but that should work.

@EmilHvitfeldt
Copy link
Member

Hello @anjelinejeline 👋

What are you expecting to get back when applying unnest() here? I don't know the rsample packages as much as @mikemahoney218, but I don't see that as something that these packages support.

@anjelinejeline
Copy link
Author

@mikemahoney218 no unfortunately it does not go away ...
BTW @EmilHvitfeldt I am trying to unlist the column with the fold data ..
I am also struggling to create spatial clusters with equal size.. I need equal sized folds to use the predict function of a spatialregression as it is not possible to predict on a dataset with different size.. can you help me with that too? Is there a function in this package I could use?

@mikemahoney218
Copy link
Member

Sorry - Emil was more careful than I was and understood the actual problem better :)

So the key issue here is that there's not really a column that contains "the fold data" as you might expect. If you're interested, I wrote a blog post a while back about the internals of the objects in rsample and spatialsample, but the key thing is that the splits column doesn't actually contain the data assigned to each fold, but rather the row indices of the assessment set for each split of your data. So "unnesting" here doesn't make a ton of sense, because you don't want to unnest those indices; you want (I think!) a record of what row belongs to what assessment set.

So the easiest way to get that, assuming I understand what you're looking for, is to get each assessment set separately, give it an identifier, and then combine those into a single table.

For example, say we've got some rset object that looks like this:

set.seed(123)
library(spatialsample)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))

cluster_folds=spatial_clustering_cv(nc, v = 10)

autoplot(cluster_folds)

We could use the following code to pull out what row belongs to what fold (and obviously, drop the ggplot2 code if you just want the output data frame):

lapply(
  seq_len(nrow(cluster_folds)),
  function(fold) {
    get_rsplit(cluster_folds, fold) |> 
      assessment() |> 
      dplyr::mutate(fold = fold)
  }
) |> 
  do.call(what = rbind) |> 
  ggplot2::ggplot(ggplot2::aes(fill = factor(fold))) + 
  ggplot2::geom_sf()

Created on 2024-02-02 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21)
#>  os       macOS Ventura 13.3.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-02-02
#>  pandoc   3.1.11 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  class           7.3-22  2023-05-03 [1] CRAN (R 4.3.0)
#>  classInt        0.4-9   2023-02-28 [1] CRAN (R 4.3.0)
#>  cli             3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  codetools       0.2-19  2023-02-01 [1] CRAN (R 4.3.0)
#>  colorspace      2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
#>  curl            5.0.2   2023-08-14 [1] CRAN (R 4.3.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.3.0)
#>  digest          0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
#>  dplyr           1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
#>  e1071           1.7-13  2023-02-01 [1] CRAN (R 4.3.0)
#>  evaluate        0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi           1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
#>  farver          2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
#>  fastmap         1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  fs              1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
#>  furrr           0.3.1   2022-08-15 [1] CRAN (R 4.3.0)
#>  future          1.33.0  2023-07-01 [1] CRAN (R 4.3.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>  ggplot2         3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
#>  globals         0.16.2  2022-11-21 [1] CRAN (R 4.3.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  gtable          0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
#>  highr           0.10    2022-12-22 [1] CRAN (R 4.3.0)
#>  htmltools       0.5.6   2023-08-10 [1] CRAN (R 4.3.0)
#>  KernSmooth      2.23-22 2023-07-10 [1] CRAN (R 4.3.0)
#>  knitr           1.43    2023-05-25 [1] CRAN (R 4.3.0)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>  listenv         0.9.0   2022-12-16 [1] CRAN (R 4.3.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
#>  parallelly      1.36.0  2023-05-26 [1] CRAN (R 4.3.0)
#>  pillar          1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  proxy           0.4-27  2022-06-09 [1] CRAN (R 4.3.0)
#>  purrr           1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache         0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils         2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  Rcpp            1.0.11  2023-07-06 [1] CRAN (R 4.3.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang           1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown       2.24    2023-08-14 [1] CRAN (R 4.3.0)
#>  rsample         1.1.1   2022-12-07 [1] CRAN (R 4.3.0)
#>  rstudioapi      0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
#>  s2              1.1.4   2023-05-17 [1] CRAN (R 4.3.0)
#>  scales          1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  sf              1.0-14  2023-07-11 [1] CRAN (R 4.3.0)
#>  spatialsample * 0.5.1   2023-11-08 [1] CRAN (R 4.3.1)
#>  styler          1.10.1  2023-06-05 [1] CRAN (R 4.3.0)
#>  tibble          3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyr           1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
#>  units           0.8-3   2023-08-10 [1] CRAN (R 4.3.0)
#>  utf8            1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
#>  vctrs           0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>  wk              0.7.3   2023-05-06 [1] CRAN (R 4.3.0)
#>  xfun            0.40    2023-08-09 [1] CRAN (R 4.3.0)
#>  xml2            1.3.5   2023-07-06 [1] CRAN (R 4.3.0)
#>  yaml            2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Let me know if that isn't what you're trying to accomplish, but I think this is how you get what you're looking for.

As for

create spatial clusters with equal size

This isn't something we currently support in spatialsample directly. Would you be able to link the package (or paper, or so on) that you're using that has this restriction? What happens if the number of data points are a prime number, and so can't be divided evenly into folds?

What you could do is pass a custom function to the cluster_function argument. That custom function can use whatever logic you want, in order to enforce that all folds are of equal sizes. Hopefully the function documentation (especially the Details section) is helpful in describing what that function needs to accept and return -- but let me know if it isn't and if I can help clarify anything.

@mikemahoney218
Copy link
Member

I'm going to go ahead and close this out -- please feel free to open a new issue if we didn't wind up fixing the core problem here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants