Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Side-by-side data.table vs. base vs dplyr #6620

Open
vincentarelbundock opened this issue Nov 17, 2024 · 16 comments
Open

Docs: Side-by-side data.table vs. base vs dplyr #6620

vincentarelbundock opened this issue Nov 17, 2024 · 16 comments

Comments

@vincentarelbundock
Copy link

vincentarelbundock commented Nov 17, 2024

Many data analysts who wish to learn data.table are already familiar with base R or dplyr. For those people, an easy route can be to read a bunch of side-by-side examples in different idioms.

I propose to add a new page to the data.table documentation with many tables with side-by-side comparisons. My student and I have been working on such a page. If there is interest (and feedback) from the community, we can finish this page and make a Pull Request.

The notebook below is very rough and preliminary, but it gives you a flavor of what this could be.

Thoughts? Reactions? Suggestions?

https://arelbundock.com/dt_df_tb.html

Tagging @TysonStanley and @kbodwin but interested in everyone's feedback!

@TysonStanley
Copy link
Member

I like this idea. Something like this existed somewhere on the internets but no idea who made it or if they kept it up. Having it tied to the package would be useful I think. If we show output from each panel, I think this would go a long way.

@MichaelChirico
Copy link
Member

do I recall correctly that was @grantmcdermott ?

@grantmcdermott
Copy link
Contributor

do I recall correctly that was @grantmcdermott ?

Kind of. I believe you're thinking of https://stata2r.github.io/

@grantmcdermott
Copy link
Contributor

This might be what you were looking for. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

@vincentarelbundock
Copy link
Author

vincentarelbundock commented Nov 17, 2024

Grant and friends have a Stata <-> R : https://stata2r.github.io/data_table/

Atrebas has a data.table <-> dplyr here: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

I emailed Atrebas to asked if they were OK with me ripping off the idea (they were fine with it). Then, my student and I added equivalent base R commands.

Showing output might make the page unreadable (because too long), but we can definitely consider it.

@grantmcdermott
Copy link
Contributor

Snap, @vincentarelbundock ;-)

@TysonStanley
Copy link
Member

Sounds good to me! And yes, it was the atrebas page I was thinking of

@TysonStanley
Copy link
Member

As far as output, seems like having some representation of it could be useful. Either that or good descriptions/comments around the code.

@MichaelChirico
Copy link
Member

quick comments on the first few sections of the notebook:

  • I suspect "canonical" {dplyr} will suggest slice() even for simple row filters, e.g. TB |> slice(3:5). That easily applies in the middle of any pipeline (vs. |> _[3:5, ]) and translates to any backend (esp. database)
  • I would use pipe everywhere for {dplyr}, it's pretty unusual to see VERB(DF, ...) instead of DF |> VERB(...).
  • In {base}, I strongly suggest using subset() to avoid all the $. It also matches {dplyr} & {data.table} behavior around "exclude NA logical results by default", e.g. DF |> subset(V1 == 1 & V4 == "A")
  • In general we should strive for the comparisons to be as "correct" as possible. The point about whether NA is included in "filter" operations is one, a really subtle one is complete.cases() which does not respect is.na() methods (complete.cases fails in the presence of NA_integer64_ r-lib/bit64#122). In some cases just using footnotes will be the best way to avoid clutter.
  • Wrong section? (what's random here) DT[frankv(-V1, ties.method = "dense") < 2]
  • namespace-qualifying here is mildly inconsistent: dplyr::between
  • typo: DFV$V4

@vincentarelbundock
Copy link
Author

Thanks everyone for the feedback. What I take from this is that there seems to be interest for a more polished version of this.

I will complete the page, polish it up, integrate @MichaelChirico's feedback on the early parts, and come back with a more specific proposal.

Thanks!

@grantmcdermott
Copy link
Contributor

Minor comments:

  • I'd recommend |> instead of %>% for dplyr code, in accordance with the updated style guide that the tidyverse team is using. Also = instead of <- since that's the canonical data.table documentation style (as well as just being the better assignment operator amirite, @vincentarelbundock ? ;-) )
  • Maybe use let() instead of ':=' () for multiple assignment?
    Personally, I'm a big fan of these "coming from x to y" comparison guides. I'm certainly happy for folks to take (a subset of) our Stata2R examples and include them here if that's useful. Another precedent in this space is https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-R-package-data.table

@rikivillalba
Copy link
Contributor

Another source of comparisions could be the dtplyr translation vignette

@tdhock
Copy link
Member

tdhock commented Nov 19, 2024

all this work sounds great and please consider applying for a travel award to talk about it at a conf https://rdatatable-community.github.io/The-Raft/posts/2023-11-01-travel_grant_announcement-community_team/

@tdhock
Copy link
Member

tdhock commented Nov 19, 2024

Also I would recommend changing gather/spread to pivot_longer/pivot_wider which are more recent and feature-ful.

melt into single value column

> tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c("part","dim"), names_sep="[.]")
# A tibble: 600 × 4
   Species part  dim    value
   <fct>   <chr> <chr>  <dbl>
 1 setosa  Petal Length   1.4
 2 setosa  Sepal Length   5.1
 3 setosa  Petal Width    0.2
 4 setosa  Sepal Width    3.5
 5 setosa  Petal Length   1.4
 6 setosa  Sepal Length   4.9
 7 setosa  Petal Width    0.2
 8 setosa  Sepal Width    3  
 9 setosa  Petal Length   1.3
10 setosa  Sepal Length   4.7
# ℹ 590 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(part, dim, sep="."))
       Species   part    dim value
        <fctr> <char> <char> <num>
  1:    setosa  Sepal Length   5.1
  2:    setosa  Sepal Length   4.9
  3:    setosa  Sepal Length   4.7
  4:    setosa  Sepal Length   4.6
  5:    setosa  Sepal Length   5.0
 ---                              
596: virginica  Petal  Width   2.3
597: virginica  Petal  Width   1.9
598: virginica  Petal  Width   2.0
599: virginica  Petal  Width   2.3
600: virginica  Petal  Width   1.8

melt into multiple value columns

> tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c(".value","dim"), names_sep="[.]")
# A tibble: 300 × 4
   Species dim    Petal Sepal
   <fct>   <chr>  <dbl> <dbl>
 1 setosa  Length   1.4   5.1
 2 setosa  Width    0.2   3.5
 3 setosa  Length   1.4   4.9
 4 setosa  Width    0.2   3  
 5 setosa  Length   1.3   4.7
 6 setosa  Width    0.2   3.2
 7 setosa  Length   1.5   4.6
 8 setosa  Width    0.2   3.1
 9 setosa  Length   1.4   5  
10 setosa  Width    0.2   3.6
# ℹ 290 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
       Species    dim Sepal Petal
        <fctr> <char> <num> <num>
  1:    setosa Length   5.1   1.4
  2:    setosa Length   4.9   1.4
  3:    setosa Length   4.7   1.3
  4:    setosa Length   4.6   1.5
  5:    setosa Length   5.0   1.4
 ---                             
296: virginica  Width   3.0   2.3
297: virginica  Width   2.5   1.9
298: virginica  Width   3.0   2.0
299: virginica  Width   3.4   2.3
300: virginica  Width   3.0   1.8

dcast one aggregation function

> iris_long=melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
> dcast(iris_long, Species ~ dim, mean, value.var=c("Sepal","Petal"))
Key: <Species>
      Species Sepal_Length Sepal_Width Petal_Length Petal_Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa        5.006       3.428        1.462       0.246
2: versicolor        5.936       2.770        4.260       1.326
3:  virginica        6.588       2.974        5.552       2.026
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=mean)
# A tibble: 3 × 5
  Species    Sepal_Length Sepal_Width Petal_Length Petal_Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 

dcast list of aggregation functions

> dcast(iris_long, Species ~ dim, list(mean, sd), value.var=c("Sepal","Petal"))
Key: <Species>
      Species Sepal_mean_Length Sepal_mean_Width Petal_mean_Length
       <fctr>             <num>            <num>             <num>
1:     setosa             5.006            3.428             1.462
2: versicolor             5.936            2.770             4.260
3:  virginica             6.588            2.974             5.552
   Petal_mean_Width Sepal_sd_Length Sepal_sd_Width Petal_sd_Length
              <num>           <num>          <num>           <num>
1:            0.246       0.3524897      0.3790644       0.1736640
2:            1.326       0.5161711      0.3137983       0.4699110
3:            2.026       0.6358796      0.3224966       0.5518947
   Petal_sd_Width
            <num>
1:      0.1053856
2:      0.1977527
3:      0.2746501
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=list(mean,sd))
Error in `tidyr::pivot_wider()`:
! All elements of `values_fn` must be named.
Run `rlang::last_trace()` to see where the error occurred.

@iagogv3
Copy link
Contributor

iagogv3 commented Nov 20, 2024

I thought in the atrebas post too, but also in this one: https://www.infoworld.com/article/2260179/the-ultimate-r-datatable-cheat-sheet.html

Regarding the last @grantmcdermott comment.

I'd recommend |> instead of %>% for dplyr code, in accordance with the updated style guide that the tidyverse team is using

Agree +1

Also = instead of <- since that's the canonical data.table documentation style

Agree, but just for data.table columns (it is not the base style recommended assignment)

Maybe use let() instead of ':=' () for multiple assignment?

Disagree (actually, I prefer surrounded with backticks, but I ':=' seems yet more natural than let) -1

@ontam
Copy link

ontam commented Dec 8, 2024

Many data analysts who wish to learn data.table are already familiar with base R or dplyr. For those people, an easy route can be to read a bunch of side-by-side examples in different idioms.

I propose to add a new page to the data.table documentation with many tables with side-by-side comparisons. My student and I have been working on such a page. If there is interest (and feedback) from the community, we can finish this page and make a Pull Request.

The notebook below is very rough and preliminary, but it gives you a flavor of what this could be.

Thoughts? Reactions? Suggestions?

https://arelbundock.com/dt_df_tb.html

Tagging @TysonStanley and @kbodwin but interested in everyone's feedback!

maybe adding also examples from collapse? https://sebkrantz.github.io/collapse/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants