You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:
# Here is code to reproduce
set.seed(921)
d <- tibble(x=runif(100), y=sample(c("y", "n"), 100, replace=TRUE))
d_split <- initial_split(d, strata=y)
d_tr <- training(d_split)
d_tr$y
d_ts <- testing(d_split)
d_ts$y
and we find
> d_tr$y
[1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[13] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[25] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[37] "n" "n" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[49] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[61] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[73] "y" "y"
although test set is not (luckily for me!)
d_ts$y
[1] "n" "n" "n" "n" "y" "y" "n" "y" "y" "y" "y" "y"
[13] "y" "y" "y" "n" "n" "n" "n" "n" "y" "y" "n" "y"
[25] "n" "n"
I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.
The text was updated successfully, but these errors were encountered:
I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:
and we find
although test set is not (luckily for me!)
I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.
The text was updated successfully, but these errors were encountered: