-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Side-by-side data.table
vs. base
vs dplyr
#6620
Comments
I like this idea. Something like this existed somewhere on the internets but no idea who made it or if they kept it up. Having it tied to the package would be useful I think. If we show output from each panel, I think this would go a long way. |
do I recall correctly that was @grantmcdermott ? |
Kind of. I believe you're thinking of https://stata2r.github.io/ |
This might be what you were looking for. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/ |
Grant and friends have a Stata <-> R : https://stata2r.github.io/data_table/ Atrebas has a I emailed Atrebas to asked if they were OK with me ripping off the idea (they were fine with it). Then, my student and I added equivalent base R commands. Showing output might make the page unreadable (because too long), but we can definitely consider it. |
Snap, @vincentarelbundock ;-) |
Sounds good to me! And yes, it was the atrebas page I was thinking of |
As far as output, seems like having some representation of it could be useful. Either that or good descriptions/comments around the code. |
quick comments on the first few sections of the notebook:
|
Thanks everyone for the feedback. What I take from this is that there seems to be interest for a more polished version of this. I will complete the page, polish it up, integrate @MichaelChirico's feedback on the early parts, and come back with a more specific proposal. Thanks! |
Minor comments:
|
Another source of comparisions could be the |
all this work sounds great and please consider applying for a travel award to talk about it at a conf https://rdatatable-community.github.io/The-Raft/posts/2023-11-01-travel_grant_announcement-community_team/ |
Also I would recommend changing gather/spread to pivot_longer/pivot_wider which are more recent and feature-ful. melt into single value column > tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c("part","dim"), names_sep="[.]")
# A tibble: 600 × 4
Species part dim value
<fct> <chr> <chr> <dbl>
1 setosa Petal Length 1.4
2 setosa Sepal Length 5.1
3 setosa Petal Width 0.2
4 setosa Sepal Width 3.5
5 setosa Petal Length 1.4
6 setosa Sepal Length 4.9
7 setosa Petal Width 0.2
8 setosa Sepal Width 3
9 setosa Petal Length 1.3
10 setosa Sepal Length 4.7
# ℹ 590 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(part, dim, sep="."))
Species part dim value
<fctr> <char> <char> <num>
1: setosa Sepal Length 5.1
2: setosa Sepal Length 4.9
3: setosa Sepal Length 4.7
4: setosa Sepal Length 4.6
5: setosa Sepal Length 5.0
---
596: virginica Petal Width 2.3
597: virginica Petal Width 1.9
598: virginica Petal Width 2.0
599: virginica Petal Width 2.3
600: virginica Petal Width 1.8 melt into multiple value columns > tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c(".value","dim"), names_sep="[.]")
# A tibble: 300 × 4
Species dim Petal Sepal
<fct> <chr> <dbl> <dbl>
1 setosa Length 1.4 5.1
2 setosa Width 0.2 3.5
3 setosa Length 1.4 4.9
4 setosa Width 0.2 3
5 setosa Length 1.3 4.7
6 setosa Width 0.2 3.2
7 setosa Length 1.5 4.6
8 setosa Width 0.2 3.1
9 setosa Length 1.4 5
10 setosa Width 0.2 3.6
# ℹ 290 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
Species dim Sepal Petal
<fctr> <char> <num> <num>
1: setosa Length 5.1 1.4
2: setosa Length 4.9 1.4
3: setosa Length 4.7 1.3
4: setosa Length 4.6 1.5
5: setosa Length 5.0 1.4
---
296: virginica Width 3.0 2.3
297: virginica Width 2.5 1.9
298: virginica Width 3.0 2.0
299: virginica Width 3.4 2.3
300: virginica Width 3.0 1.8 dcast one aggregation function > iris_long=melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
> dcast(iris_long, Species ~ dim, mean, value.var=c("Sepal","Petal"))
Key: <Species>
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
<fctr> <num> <num> <num> <num>
1: setosa 5.006 3.428 1.462 0.246
2: versicolor 5.936 2.770 4.260 1.326
3: virginica 6.588 2.974 5.552 2.026
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=mean)
# A tibble: 3 × 5
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03 dcast list of aggregation functions > dcast(iris_long, Species ~ dim, list(mean, sd), value.var=c("Sepal","Petal"))
Key: <Species>
Species Sepal_mean_Length Sepal_mean_Width Petal_mean_Length
<fctr> <num> <num> <num>
1: setosa 5.006 3.428 1.462
2: versicolor 5.936 2.770 4.260
3: virginica 6.588 2.974 5.552
Petal_mean_Width Sepal_sd_Length Sepal_sd_Width Petal_sd_Length
<num> <num> <num> <num>
1: 0.246 0.3524897 0.3790644 0.1736640
2: 1.326 0.5161711 0.3137983 0.4699110
3: 2.026 0.6358796 0.3224966 0.5518947
Petal_sd_Width
<num>
1: 0.1053856
2: 0.1977527
3: 0.2746501
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=list(mean,sd))
Error in `tidyr::pivot_wider()`:
! All elements of `values_fn` must be named.
Run `rlang::last_trace()` to see where the error occurred. |
I thought in the atrebas post too, but also in this one: https://www.infoworld.com/article/2260179/the-ultimate-r-datatable-cheat-sheet.html Regarding the last @grantmcdermott comment.
Agree +1
Agree, but just for
Disagree (actually, I prefer surrounded with backticks, but I |
maybe adding also examples from collapse? https://sebkrantz.github.io/collapse/ |
Many data analysts who wish to learn
data.table
are already familiar withbase
R ordplyr
. For those people, an easy route can be to read a bunch of side-by-side examples in different idioms.I propose to add a new page to the
data.table
documentation with many tables with side-by-side comparisons. My student and I have been working on such a page. If there is interest (and feedback) from the community, we can finish this page and make a Pull Request.The notebook below is very rough and preliminary, but it gives you a flavor of what this could be.
Thoughts? Reactions? Suggestions?
https://arelbundock.com/dt_df_tb.html
Tagging @TysonStanley and @kbodwin but interested in everyone's feedback!
The text was updated successfully, but these errors were encountered: