Skip to content

Commit

Permalink
various tweaks to joins rmd file
Browse files Browse the repository at this point in the history
  • Loading branch information
KyleHaynes committed Dec 22, 2024
1 parent e4b0bbb commit 9f83871
Showing 1 changed file with 19 additions and 19 deletions.
38 changes: 19 additions & 19 deletions vignettes/datatable-joins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ The next diagram shows a description for each basic argument. In the following s
x[i, on, nomatch]
| | | |
| | | \__ If NULL only returns rows linked in x and i tables
| | \____ a character vector o list defining match logict
| | \____ a character vector o list defining match logic
| \_____ primary data.table, list or data.frame
\____ secondary data.table
```
Expand Down Expand Up @@ -304,7 +304,7 @@ ProductReceived[Products,

Despite both tables have the same information, they present some relevant differences:

- They present different order for their columns
- They present different order for their columns.
- They have some name differences on their columns names:
- The `id` column of first table has the same information as the `product_id` in the second table.
- The `i.id` column of first table has the same information as the `id` in the second table.
Expand Down Expand Up @@ -391,7 +391,7 @@ Here some important considerations:

- **Row level**
- All rows from in the `i` table were kept as we never received any banana but row is still part of the results.
- The row related to `product_id = 6` is no part of the results any more as it is not present in the `Products` table.
- The row related to `product_id = 6` is not part of the results any more as it is not present in the `Products` table.


#### 3.5.1. Joining after chain operations
Expand Down Expand Up @@ -510,7 +510,7 @@ Use this method if you need to combine columns from 2 tables based on one or mor

As we saw in the previous section, any of the prior operations can keep the missing `product_id = 6` and the **soda** (`product_id = 4`) as part of the results.

To save this problem, we can use the `merge` function even thought it is lower than using the native `data.table`'s joining syntax.
To save this problem, we can use the `merge` function even though it is lower than using the native `data.table`'s joining syntax.

```{r}
merge(x = Products,
Expand All @@ -526,22 +526,22 @@ merge(x = Products,

A non-equi join is a type of join where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like:

- Finding the nearest match
- Comparing ranges of values between tables
- Finding the nearest match.
- Comparing ranges of values between tables.

It's a great alternative if after applying a right of inner join:

- You want to decrease the number of returned rows based on comparing numeric columns of different table.
- You want to decrease the number of returned rows based on comparing numeric columns of a different table.
- You don't need to keep the columns from table `x`*(secondary data.table)* in the final table.

To illustrate how this work, let's center over attention on how are the sales and receives for product 2.
To illustrate how this works, let's focus on the sales and receives for product 2.

```{r}
ProductSalesProd2 = ProductSales[product_id == 2L]
ProductReceivedProd2 = ProductReceived[product_id == 2L]
```

If want to know, for example, if can find any receive that took place before a sales date, we can apply the next code.
If want to know, for example, you can find any receive that took place before a sales date, we can apply the following.

```{r}
ProductReceivedProd2[ProductSalesProd2,
Expand All @@ -552,16 +552,16 @@ ProductReceivedProd2[ProductSalesProd2,

What does happen if we just apply the same logic on the list passed to `on`?

- As this opperation it's still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.
- As this operation is still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met.

- The date related `ProductReceivedProd2` was omited from this new table.
- The date related `ProductReceivedProd2` was omitted from this new table.

```{r}
ProductReceivedProd2[ProductSalesProd2,
on = list(product_id, date < date)]
```

Now, after applying the join, we can limit the results only show the cases that meet all joining criteria.
Now, after applying the join, we can limit the results only showing the cases that meet all joining criteria.

```{r}
ProductReceivedProd2[ProductSalesProd2,
Expand All @@ -574,7 +574,7 @@ ProductReceivedProd2[ProductSalesProd2,

Rolling joins are particularly useful in time-series data analysis. They allow you to **match rows based on the nearest value** in a sorted column, typically a date or time column.

This is valuable when you need to align data from different sources **that may not have exactly matching timestamps**, or when you want to carry forward the most recent value.
This is valuable when you need to align data from different sources **that may not have exact matching timestamps**, or when you want to carry forward the most recent value.

For example, in financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don't occur at the exact same times.

Expand All @@ -594,7 +594,7 @@ ProductPriceHistory = data.table(
ProductPriceHistory
```

Now, we can perform a right join giving a different prices for each product based on the sale date.
Now, we can perform a right join giving a different price for each product based on the sale date.

```{r}
ProductPriceHistory[ProductSales,
Expand All @@ -617,9 +617,9 @@ ProductPriceHistory[ProductSales,

### 7.1. Subsets as joins

As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. Actually, that process is faster than passing a Boolean expression to the `i` argument.
As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. This process is faster than passing a Boolean expression to the `i` argument.

To filter the `x` table at speed we don't to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.
To filter the `x` table at speed we don't need to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table.

For example, to filter dates where the market received 100 units of bananas (`product_id = 1`) or popcorn (`product_id = 3`) we can use the following:

Expand All @@ -628,7 +628,7 @@ ProductReceived[list(c(1L, 3L), 100L),
on = c("product_id", "count")]
```

As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always to add the argument `nomatch = NULL`.
As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always add the argument `nomatch = NULL`.

```{r}
ProductReceived[list(c(1L, 3L), 100L),
Expand All @@ -644,7 +644,7 @@ ProductReceived[!list(c(1L, 3L), 100L),
on = c("product_id", "count")]
```

If you just want to filter a value for a single **character column**, you can omit calling the `list()` function pass the value to been filtered in the `i` argument.
If you just want to filter a value for a single **character column**, you can omit calling the `list()` function and pass the value to be filtered in the `i` argument.

```{r}
Products[c("banana","popcorn"),
Expand Down Expand Up @@ -674,7 +674,7 @@ copy(Products)[ProductPriceHistory,

In this operation:

- The function `copy` prevent that `:=` changes by reference the `Products` table.s
- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=` from changing the original table by reference.
- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
- We update the `price` column with the latest price from `ProductPriceHistory`.
- We add a new `last_updated` column to track when the price was last changed.
Expand Down

0 comments on commit 9f83871

Please sign in to comment.