Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have issue running spatial_clustering_cv as it requires Fortran #158

Closed
cthwe opened this issue Mar 6, 2024 · 5 comments
Closed

Have issue running spatial_clustering_cv as it requires Fortran #158

cthwe opened this issue Mar 6, 2024 · 5 comments
Labels
reprex needs a minimal reproducible example

Comments

@cthwe
Copy link

cthwe commented Mar 6, 2024

I'm having problem with running spatial_clustering_cv for my dataset. So my dataset contains ID, latitude and longitude in 4326 CRS system. I converted it to sf object using this code:
cluster <- sf::st_as_sf(extracted, coords=c("lon_4326", "lat_4326"), crs = 4326)

However, I obtained this error when I run spatial clustering cv:
df_cluster <- spatial_clustering_cv(cluster, v=5, cluster_function="kmeans")

Error in do_one(nmeth) :
long vectors (argument 1) are not supported in .Fortran

Also, is there a way to extract list of ID for each cluster?

@mikemahoney218
Copy link
Member

mikemahoney218 commented Mar 6, 2024

Hi @cthwe ! Is there any chance you'd be able to provide a reprex for this issue? Unfortunately, given the information you've provided here, I don't have a way to reproduce this bug and start figuring out what's going on here.

My only guess is that this is a bug in kmeans() itself when given a large data set -- my napkin math suggests that if you've got more than 46,000ish rows in your data set, R wouldn't be able to hand the distance matrix for your data over to Fortran. If I'm right and this is a decent sized data set, you might need to use a different cluster function or a different CV method.

Also, is there a way to extract list of ID for each cluster?

Not built-in, but check out the second code chunk here:
#157 (comment)

@cthwe
Copy link
Author

cthwe commented Mar 6, 2024

Hi @mikemahoney218 , I reduced the dataset to 1000 rows and ran the analysis and it worked without issue. So you were right in the sample size being the problem. My dataset has about 64000 rows in the dataset. Is there a way to make the code suitable for large dataset as I really need to run it on the whole dataset?

Thanks for the comment link. That seems to be what I was looking for.

@mikemahoney218
Copy link
Member

I'm having no luck reproducing this error on my computer, because even with 64GB of RAM the function winds up running out of memory. This function is going to be extremely resource-intensive for this size of data. The bug you're hitting is then in base R itself, so would require some creative workarounds. Are you able to use a different approach, like spatial_block_cv(), which shouldn't have the same issues?

@mikemahoney218 mikemahoney218 added the reprex needs a minimal reproducible example label Mar 15, 2024
@cthwe
Copy link
Author

cthwe commented Mar 18, 2024

spatial_block_cv should work for my analysis. Thanks for the help.

@mikemahoney218
Copy link
Member

Alright, glad to hear it. I'm going to close this issue because I can't currently reproduce it (without just crashing R) and the actual bug appears to not be in spatialsample -- but if anyone reading this in the future has further questions (or fixes), please feel free to open an issue or PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

2 participants