only iterate over unique subset of icols #6630

ben-schwen · 2024-11-30T16:10:31Z

          On first read I think "this can be doing a lot of duplicate work since it's basically `for (icol in icols) icol==icols`", but actually it only applies to the subset `icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")]`.

Originally posted by @MichaelChirico in #6603 (comment)

minimal example

x = data.table(a=1L)
y = data.table(c=1L, d=1)
y[x, on=.(c == a, d == a)]

In bmerge we iterate over icols with for (a in seq_along(icols), hence in this case we epxlore twice a=1 while once would be enough.

Corresponding bmerge line:

data.table/R/bmerge.R

Line 37 in 546259d

if (nrow(i)) for (a in seq_along(icols)) {

The text was updated successfully, but these errors were encountered:

TusharNaugain · 2024-12-05T13:08:02Z

          On first read I think "this can be doing a lot of duplicate work since it's basically `for (icol in icols) icol==icols`", but actually it only applies to the subset `icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")]`.
Originally posted by @MichaelChirico in #6603 (comment)

minimal example
x = data.table(a=1L)
y = data.table(c=1L, d=1)
y[x, on=.(c == a, d == a)]
In bmerge we iterate over icols with for (a in seq_along(icols), hence in this case we epxlore twice a=1 while once would be enough.

Corresponding bmerge line:

data.table/R/bmerge.R

Line 37 in 546259d

if (nrow(i)) for (a in seq_along(icols)) {

Hey there,

I noticed that in the current implementation, we iterate over icols which might cause redundant processing. To optimize this, we can create a unique subset of icols that excludes elements from xcols and matches the required i_merge_type. This way, each element is processed only once.

Here's a code snippet to achieve this

const icols = [/* your array of columns /];
const xcols = [/ your array of columns to exclude /];
const i_merge_type = [/ your merge type */];
const uniqueIcols = icols.filter(icol => !xcols.includes(icol) && i_merge_type.includes(icol));
for (const a in uniqueIcols) {

}
By using filter, we ensure that the iteration is done over a unique subset, thus avoiding duplicate work. This should help streamline the processing in bmerge.

Abhishek2634 · 2024-12-19T07:59:39Z

Hi @ben-schwen can we use this line: if (nrow(i)) for (a in seq_along(unique(icols)))
if its okay, can raise a PR for this ?

MichaelChirico · 2024-12-20T02:50:54Z

I don't think it will be so easy -- icols and xcols are paired vectors, and that loop assumes we can pull icols[a] and xcols[a]. Something like cbind(icols, xcols)[!duplicated(icols), ] also won't work since it might drop some xcols that need to be checked.

ben-schwen added the performance label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only iterate over unique subset of icols #6630

only iterate over unique subset of icols #6630

ben-schwen commented Nov 30, 2024 •

edited

Loading

TusharNaugain commented Dec 5, 2024

Abhishek2634 commented Dec 19, 2024

MichaelChirico commented Dec 20, 2024

only iterate over unique subset of icols #6630

only iterate over unique subset of icols #6630

Comments

ben-schwen commented Nov 30, 2024 • edited Loading

TusharNaugain commented Dec 5, 2024

Abhishek2634 commented Dec 19, 2024

MichaelChirico commented Dec 20, 2024

ben-schwen commented Nov 30, 2024 •

edited

Loading