Docs: Side-by-side `data.table` vs. `base` vs `dplyr` #6620

vincentarelbundock · 2024-11-17T20:37:05Z

Many data analysts who wish to learn data.table are already familiar with base R or dplyr. For those people, an easy route can be to read a bunch of side-by-side examples in different idioms.

I propose to add a new page to the data.table documentation with many tables with side-by-side comparisons. My student and I have been working on such a page. If there is interest (and feedback) from the community, we can finish this page and make a Pull Request.

The notebook below is very rough and preliminary, but it gives you a flavor of what this could be.

Thoughts? Reactions? Suggestions?

https://arelbundock.com/dt_df_tb.html

Tagging @TysonStanley and @kbodwin but interested in everyone's feedback!

The text was updated successfully, but these errors were encountered:

TysonStanley · 2024-11-17T20:49:45Z

I like this idea. Something like this existed somewhere on the internets but no idea who made it or if they kept it up. Having it tied to the package would be useful I think. If we show output from each panel, I think this would go a long way.

MichaelChirico · 2024-11-17T20:54:49Z

do I recall correctly that was @grantmcdermott ?

grantmcdermott · 2024-11-17T20:57:50Z

do I recall correctly that was @grantmcdermott ?

Kind of. I believe you're thinking of https://stata2r.github.io/

grantmcdermott · 2024-11-17T20:58:48Z

This might be what you were looking for. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

vincentarelbundock · 2024-11-17T20:59:10Z

Grant and friends have a Stata <-> R : https://stata2r.github.io/data_table/

Atrebas has a data.table <-> dplyr here: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

I emailed Atrebas to asked if they were OK with me ripping off the idea (they were fine with it). Then, my student and I added equivalent base R commands.

Showing output might make the page unreadable (because too long), but we can definitely consider it.

grantmcdermott · 2024-11-17T20:59:43Z

Snap, @vincentarelbundock ;-)

TysonStanley · 2024-11-17T21:00:12Z

Sounds good to me! And yes, it was the atrebas page I was thinking of

TysonStanley · 2024-11-17T21:00:54Z

As far as output, seems like having some representation of it could be useful. Either that or good descriptions/comments around the code.

MichaelChirico · 2024-11-18T16:39:53Z

quick comments on the first few sections of the notebook:

I suspect "canonical" {dplyr} will suggest slice() even for simple row filters, e.g. TB |> slice(3:5). That easily applies in the middle of any pipeline (vs. |> _[3:5, ]) and translates to any backend (esp. database)
I would use pipe everywhere for {dplyr}, it's pretty unusual to see VERB(DF, ...) instead of DF |> VERB(...).
In {base}, I strongly suggest using subset() to avoid all the $. It also matches {dplyr} & {data.table} behavior around "exclude NA logical results by default", e.g. DF |> subset(V1 == 1 & V4 == "A")
In general we should strive for the comparisons to be as "correct" as possible. The point about whether NA is included in "filter" operations is one, a really subtle one is complete.cases() which does not respect is.na() methods (complete.cases fails in the presence of NA_integer64_ r-lib/bit64#122). In some cases just using footnotes will be the best way to avoid clutter.
Wrong section? (what's random here) DT[frankv(-V1, ties.method = "dense") < 2]
namespace-qualifying here is mildly inconsistent: dplyr::between
typo: DFV$V4

vincentarelbundock · 2024-11-18T18:48:53Z

Thanks everyone for the feedback. What I take from this is that there seems to be interest for a more polished version of this.

I will complete the page, polish it up, integrate @MichaelChirico's feedback on the early parts, and come back with a more specific proposal.

Thanks!

grantmcdermott · 2024-11-18T19:17:20Z

Minor comments:

I'd recommend |> instead of %>% for dplyr code, in accordance with the updated style guide that the tidyverse team is using. Also = instead of <- since that's the canonical data.table documentation style (as well as just being the better assignment operator amirite, @vincentarelbundock ? ;-) )
Maybe use let() instead of ':=' () for multiple assignment?
Personally, I'm a big fan of these "coming from x to y" comparison guides. I'm certainly happy for folks to take (a subset of) our Stata2R examples and include them here if that's useful. Another precedent in this space is https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-R-package-data.table

rikivillalba · 2024-11-18T19:32:13Z

Another source of comparisions could be the dtplyr translation vignette

tdhock · 2024-11-19T18:37:07Z

all this work sounds great and please consider applying for a travel award to talk about it at a conf https://rdatatable-community.github.io/The-Raft/posts/2023-11-01-travel_grant_announcement-community_team/

tdhock · 2024-11-19T19:15:45Z

Also I would recommend changing gather/spread to pivot_longer/pivot_wider which are more recent and feature-ful.

melt into single value column

> tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c("part","dim"), names_sep="[.]")
# A tibble: 600 × 4
   Species part  dim    value
   <fct>   <chr> <chr>  <dbl>
 1 setosa  Petal Length   1.4
 2 setosa  Sepal Length   5.1
 3 setosa  Petal Width    0.2
 4 setosa  Sepal Width    3.5
 5 setosa  Petal Length   1.4
 6 setosa  Sepal Length   4.9
 7 setosa  Petal Width    0.2
 8 setosa  Sepal Width    3  
 9 setosa  Petal Length   1.3
10 setosa  Sepal Length   4.7
# ℹ 590 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(part, dim, sep="."))
       Species   part    dim value
        <fctr> <char> <char> <num>
  1:    setosa  Sepal Length   5.1
  2:    setosa  Sepal Length   4.9
  3:    setosa  Sepal Length   4.7
  4:    setosa  Sepal Length   4.6
  5:    setosa  Sepal Length   5.0
 ---                              
596: virginica  Petal  Width   2.3
597: virginica  Petal  Width   1.9
598: virginica  Petal  Width   2.0
599: virginica  Petal  Width   2.3
600: virginica  Petal  Width   1.8

melt into multiple value columns

> tidyr::pivot_longer(iris, c(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width), names_to=c(".value","dim"), names_sep="[.]")
# A tibble: 300 × 4
   Species dim    Petal Sepal
   <fct>   <chr>  <dbl> <dbl>
 1 setosa  Length   1.4   5.1
 2 setosa  Width    0.2   3.5
 3 setosa  Length   1.4   4.9
 4 setosa  Width    0.2   3  
 5 setosa  Length   1.3   4.7
 6 setosa  Width    0.2   3.2
 7 setosa  Length   1.5   4.6
 8 setosa  Width    0.2   3.1
 9 setosa  Length   1.4   5  
10 setosa  Width    0.2   3.6
# ℹ 290 more rows
# ℹ Use `print(n = ...)` to see more rows
> melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
       Species    dim Sepal Petal
        <fctr> <char> <num> <num>
  1:    setosa Length   5.1   1.4
  2:    setosa Length   4.9   1.4
  3:    setosa Length   4.7   1.3
  4:    setosa Length   4.6   1.5
  5:    setosa Length   5.0   1.4
 ---                             
296: virginica  Width   3.0   2.3
297: virginica  Width   2.5   1.9
298: virginica  Width   3.0   2.0
299: virginica  Width   3.4   2.3
300: virginica  Width   3.0   1.8

dcast one aggregation function

> iris_long=melt(data.table(iris), measure.vars=measure(value.name, dim, sep="."))
> dcast(iris_long, Species ~ dim, mean, value.var=c("Sepal","Petal"))
Key: <Species>
      Species Sepal_Length Sepal_Width Petal_Length Petal_Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa        5.006       3.428        1.462       0.246
2: versicolor        5.936       2.770        4.260       1.326
3:  virginica        6.588       2.974        5.552       2.026
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=mean)
# A tibble: 3 × 5
  Species    Sepal_Length Sepal_Width Petal_Length Petal_Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03

dcast list of aggregation functions

> dcast(iris_long, Species ~ dim, list(mean, sd), value.var=c("Sepal","Petal"))
Key: <Species>
      Species Sepal_mean_Length Sepal_mean_Width Petal_mean_Length
       <fctr>             <num>            <num>             <num>
1:     setosa             5.006            3.428             1.462
2: versicolor             5.936            2.770             4.260
3:  virginica             6.588            2.974             5.552
   Petal_mean_Width Sepal_sd_Length Sepal_sd_Width Petal_sd_Length
              <num>           <num>          <num>           <num>
1:            0.246       0.3524897      0.3790644       0.1736640
2:            1.326       0.5161711      0.3137983       0.4699110
3:            2.026       0.6358796      0.3224966       0.5518947
   Petal_sd_Width
            <num>
1:      0.1053856
2:      0.1977527
3:      0.2746501
> tidyr::pivot_wider(iris_long, id_cols=Species, names_from=dim, values_from=c(Sepal,Petal), values_fn=list(mean,sd))
Error in `tidyr::pivot_wider()`:
! All elements of `values_fn` must be named.
Run `rlang::last_trace()` to see where the error occurred.

iagogv3 · 2024-11-20T16:42:25Z

I thought in the atrebas post too, but also in this one: https://www.infoworld.com/article/2260179/the-ultimate-r-datatable-cheat-sheet.html

Regarding the last @grantmcdermott comment.

I'd recommend |> instead of %>% for dplyr code, in accordance with the updated style guide that the tidyverse team is using

Agree +1

Also = instead of <- since that's the canonical data.table documentation style

Agree, but just for data.table columns (it is not the base style recommended assignment)

Maybe use let() instead of ':=' () for multiple assignment?

Disagree (actually, I prefer surrounded with backticks, but I ':=' seems yet more natural than let) -1

ontam · 2024-12-08T19:18:14Z

Many data analysts who wish to learn data.table are already familiar with base R or dplyr. For those people, an easy route can be to read a bunch of side-by-side examples in different idioms.

I propose to add a new page to the data.table documentation with many tables with side-by-side comparisons. My student and I have been working on such a page. If there is interest (and feedback) from the community, we can finish this page and make a Pull Request.

The notebook below is very rough and preliminary, but it gives you a flavor of what this could be.

Thoughts? Reactions? Suggestions?

https://arelbundock.com/dt_df_tb.html

Tagging @TysonStanley and @kbodwin but interested in everyone's feedback!

maybe adding also examples from collapse? https://sebkrantz.github.io/collapse/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Side-by-side `data.table` vs. `base` vs `dplyr` #6620

Docs: Side-by-side `data.table` vs. `base` vs `dplyr` #6620

vincentarelbundock commented Nov 17, 2024 •

edited

Loading

TysonStanley commented Nov 17, 2024

MichaelChirico commented Nov 17, 2024

grantmcdermott commented Nov 17, 2024

grantmcdermott commented Nov 17, 2024

vincentarelbundock commented Nov 17, 2024 •

edited

Loading

grantmcdermott commented Nov 17, 2024

TysonStanley commented Nov 17, 2024

TysonStanley commented Nov 17, 2024

MichaelChirico commented Nov 18, 2024

vincentarelbundock commented Nov 18, 2024

grantmcdermott commented Nov 18, 2024

rikivillalba commented Nov 18, 2024

tdhock commented Nov 19, 2024

tdhock commented Nov 19, 2024 •

edited

Loading

iagogv3 commented Nov 20, 2024 •

edited

Loading

ontam commented Dec 8, 2024

Docs: Side-by-side data.table vs. base vs dplyr #6620

Docs: Side-by-side data.table vs. base vs dplyr #6620

Comments

vincentarelbundock commented Nov 17, 2024 • edited Loading

TysonStanley commented Nov 17, 2024

MichaelChirico commented Nov 17, 2024

grantmcdermott commented Nov 17, 2024

grantmcdermott commented Nov 17, 2024

vincentarelbundock commented Nov 17, 2024 • edited Loading

grantmcdermott commented Nov 17, 2024

TysonStanley commented Nov 17, 2024

TysonStanley commented Nov 17, 2024

MichaelChirico commented Nov 18, 2024

vincentarelbundock commented Nov 18, 2024

grantmcdermott commented Nov 18, 2024

rikivillalba commented Nov 18, 2024

tdhock commented Nov 19, 2024

tdhock commented Nov 19, 2024 • edited Loading

iagogv3 commented Nov 20, 2024 • edited Loading

ontam commented Dec 8, 2024

Docs: Side-by-side `data.table` vs. `base` vs `dplyr` #6620

Docs: Side-by-side `data.table` vs. `base` vs `dplyr` #6620

vincentarelbundock commented Nov 17, 2024 •

edited

Loading

vincentarelbundock commented Nov 17, 2024 •

edited

Loading

tdhock commented Nov 19, 2024 •

edited

Loading

iagogv3 commented Nov 20, 2024 •

edited

Loading