Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc fix? Clarify behavior of unequal joins/on binary operators #6623

Open
katrinabrock opened this issue Nov 21, 2024 · 4 comments
Open

Doc fix? Clarify behavior of unequal joins/on binary operators #6623

katrinabrock opened this issue Nov 21, 2024 · 4 comments

Comments

@katrinabrock
Copy link

Summary

I think there may be a documentation gap specifically with on and binary operators. (Or maybe there's a vignette somewhere that I'm missing?)

Details

Specifically, it wasn't clear to me from the documentation:

  1. Only column names are acceptable on either side of the binary operator, not arbitrary expressions.
  2. The left side of the binary operator must be a column from the outer datatable, the right side must come from i.
  3. This will result in the two columns being combined into one with the name from the left side and the values from the right side. (This is shown in the examples on the help page and SO, but I didn't see it explained in words in any official place.)

All of these diverge from how SQL ON works so IMO worth pointing out to the user.

What Currently Exists

In attempt to understand the proper usage of on, I read the on section of the data table help and skimmed the linked Secondary indices and auto indexing looking for more explanation of non-equi joins.

After I resolved the problem, I did see non-equal joins are covered here: https://rdatatable.gitlab.io/data.table/articles/datatable-joins.html#non-equi-join However, the example doesn't make the three points above terribly clear as it has columns with the same name on either side of the join. Although, they are certainly shown implicitly.

Example

Here is a slightly simplified version of my use-case. Trying to join on y and foo such that foo - 2 < y < foo.

DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9, cj=1)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2), cj=1)

# Goal - this works, but want to avoid cross join
DT[X, on = 'cj', allow.cartesian = TRUE][foo >= y & foo - 2 <= y]

# Attempt - this completely failed, unclear why from docs and error
DT[X, on = .(foo >= y, foo - 2 <= y)]

# Success
DT[
  # Add temp columns for boundaries
  ,c(.SD, .(y.min = y, y.max = y + 2))
][
  # Actual join
  X, on = .(y.min <= foo, y.max >= foo)
][,
  # Add back foo
  c(.SD, .(foo = y.min)),
][
  # Remove temp columns
  ,-c('y.max', 'y.min', 'i.cj'), with = FALSE
]

Suggested Remedy

  1. Expand the Non-equi joins section of the joins vignette to include an example with different column names.
  2. Expand the Non-equi joins section of the joins vignette to explicitly clarify the three points above. (Maybe include a link to the SO question.)
  3. Link from the data.table( help page bullet point under on that describes non-eq joins to the Non-equi Joins section of the Joins vignette
@katrinabrock katrinabrock changed the title Doc fix? Clarify behavior of unequal joins/binary operators Doc fix? Clarify behavior of unequal joins/on binary operators Nov 21, 2024
@tdhock
Copy link
Member

tdhock commented Nov 28, 2024

possibly related to non-equi join naming #6298 (comment)

@katrinabrock
Copy link
Author

Is the behavior I'm describing expected? (It seems so from the existing docs, but seeing it possibly related to bugs, maybe not.) If so, would you accept a PR for the doc fixes I've described?

@tdhock
Copy link
Member

tdhock commented Dec 10, 2024

I currently see these docs on ?data.table

      on: Indicate which columns in 'x' should be joined with which
          columns in 'i' along with the type of binary operator to join
          with (see non-equi joins below on this). When specified, this
          overrides the keys set on 'x' and 'i'. When '.NATURAL'
          keyword provided then _natural join_ is made (join on common
          columns). There are multiple ways of specifying the 'on'
          argument:

            * As an unnamed character vector, e.g., 'X[Y, on=c("a",
              "b")]', used when columns 'a' and 'b' are common to both
              'X' and 'Y'.

            * _Foreign key joins_: As a _named_ character vector when
              the join columns have different names in 'X' and 'Y'.
              For example, 'X[Y, on=c(x1="y1", x2="y2")]' joins 'X' and
              'Y' by matching columns 'x1' and 'x2' in 'X' with columns
              'y1' and 'y2' in 'Y', respectively.

              From v1.9.8, you can also express foreign key joins using
              the binary operator '==', e.g. 'X[Y, on=c("x1==y1",
              "x2==y2")]'.

              NB: shorthand like 'X[Y, on=c("a", V2="b")]' is also
              possible if, e.g., column '"a"' is common between the two
              tables.

            * For convenience during interactive scenarios, it is also
              possible to use '.()' syntax as 'X[Y, on=.(a, b)]'.

            * From v1.9.8, (non-equi) joins using binary operators '>=,
              >, <=, <' are also possible, e.g., 'X[Y, on=c("x>=a",
              "y<=b")]', or for interactive use as 'X[Y, on=.(x>=a,
              y<=b)]'.

          See examples as well as
          'vignette("datatable-secondary-indices-and-auto-indexing")'.

Only column names are acceptable on either side of the binary operator, not arbitrary expressions.

I understand that your concern is whether expressions can be used with binary operators as in the last bullet point of the docs above. Would the following addition answer your question?

              "y<=b")]', or for interactive use as 'X[Y, on=.(x>=a,
              y<=b)]', where 'a' and 'b' are column names (not expressions).

The left side of the binary operator must be a column from the outer datatable, the right side must come from i.

how about the addition below?

              "y<=b")]', or for interactive use as 'X[Y, on=.(x>=a,
              y<=b)]', where 'a' is a column name of 'X' and 'b' is a column name of 'Y'.

Please file a PR, it would be appreciated.

This will result in the two columns being combined into one with the name from the left side and the values from the right side.

I believe that is the same as the non-equi join naming issue, right? #6298 (comment) please check doc changes suggested here https://github.com/Rdatatable/data.table/pull/3093/files#diff-810e8fb9987626f409e07bd5832078a942e5f89ea222c228a6b923b7dab41817 and tell me if that addresses your issue? See also what is written in this section of the new join vignette https://rdatatable.gitlab.io/data.table/articles/datatable-joins.html#operations-after-joining which I believe documents how the joined columns are named. (maybe we should add a link from ?data.table man page to join vignette?)

@katrinabrock
Copy link
Author

, where 'a' is a column name of 'X' and 'b' is a column name of 'Y'.

Yes, that would clarify!

Yes, the "old behavior" mentioned in the linked PR is exactly what I found confusing. I think this is indeed the "non-equi join naming issue". Even if this behavior will soon be phased out, I think it would still be nice to explain the current behavior clearly in user-facing documentation.

I think the vignette still does not quite fully demonstrate or explain this behavior. Putting together "By default all columns are taking their source from the the x table," from the Operations after Joining section and "it returns all rows from the i table" from the Non equi joins section (and that the date in the joined table is 2024-02-06 not 2024-02-05), it might be possible to infer that column names are taken from x (left) and values are taken from i (right). I think it would make make it much more clear to add an example to one of these section of an non equi join with different column names on each side.

Reading that section did help me simplify my solution:

DT[
  ,c(.SD, .(y.max = y + 2))
][
  X,
  on = .(y <= foo, y.max >= foo),
  j = .(
    x,
    y = x.y,
    v,
    cj,
    i.x,
    i.v,
    foo = y.max
  )
]

But I think I still would really tricky to figure out the y = x.y, foo = y.max from the help page + vignette without an explanation of the naming thing (e.g. from the stackoverflow answer I linked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants