From 9f83871db90f9bfa1d559b0e4ae78cb1763bf9b4 Mon Sep 17 00:00:00 2001 From: Kyle Haynes <5267027+KyleHaynes@users.noreply.github.com> Date: Mon, 23 Dec 2024 08:05:53 +1000 Subject: [PATCH] various tweaks to joins rmd file --- vignettes/datatable-joins.Rmd | 38 +++++++++++++++++------------------ 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/vignettes/datatable-joins.Rmd b/vignettes/datatable-joins.Rmd index a35f78bb2..e86e6df84 100644 --- a/vignettes/datatable-joins.Rmd +++ b/vignettes/datatable-joins.Rmd @@ -126,7 +126,7 @@ The next diagram shows a description for each basic argument. In the following s x[i, on, nomatch] | | | | | | | \__ If NULL only returns rows linked in x and i tables -| | \____ a character vector o list defining match logict +| | \____ a character vector o list defining match logic | \_____ primary data.table, list or data.frame \____ secondary data.table ``` @@ -304,7 +304,7 @@ ProductReceived[Products, Despite both tables have the same information, they present some relevant differences: -- They present different order for their columns +- They present different order for their columns. - They have some name differences on their columns names: - The `id` column of first table has the same information as the `product_id` in the second table. - The `i.id` column of first table has the same information as the `id` in the second table. @@ -391,7 +391,7 @@ Here some important considerations: - **Row level** - All rows from in the `i` table were kept as we never received any banana but row is still part of the results. - - The row related to `product_id = 6` is no part of the results any more as it is not present in the `Products` table. + - The row related to `product_id = 6` is not part of the results any more as it is not present in the `Products` table. #### 3.5.1. Joining after chain operations @@ -510,7 +510,7 @@ Use this method if you need to combine columns from 2 tables based on one or mor As we saw in the previous section, any of the prior operations can keep the missing `product_id = 6` and the **soda** (`product_id = 4`) as part of the results. -To save this problem, we can use the `merge` function even thought it is lower than using the native `data.table`'s joining syntax. +To save this problem, we can use the `merge` function even though it is lower than using the native `data.table`'s joining syntax. ```{r} merge(x = Products, @@ -526,22 +526,22 @@ merge(x = Products, A non-equi join is a type of join where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=. This allows for **more flexible joining criteria**. In `data.table`, non-equi joins are particularly useful for operations like: -- Finding the nearest match -- Comparing ranges of values between tables +- Finding the nearest match. +- Comparing ranges of values between tables. It's a great alternative if after applying a right of inner join: -- You want to decrease the number of returned rows based on comparing numeric columns of different table. +- You want to decrease the number of returned rows based on comparing numeric columns of a different table. - You don't need to keep the columns from table `x`*(secondary data.table)* in the final table. -To illustrate how this work, let's center over attention on how are the sales and receives for product 2. +To illustrate how this works, let's focus on the sales and receives for product 2. ```{r} ProductSalesProd2 = ProductSales[product_id == 2L] ProductReceivedProd2 = ProductReceived[product_id == 2L] ``` -If want to know, for example, if can find any receive that took place before a sales date, we can apply the next code. +If want to know, for example, you can find any receive that took place before a sales date, we can apply the following. ```{r} ProductReceivedProd2[ProductSalesProd2, @@ -552,16 +552,16 @@ ProductReceivedProd2[ProductSalesProd2, What does happen if we just apply the same logic on the list passed to `on`? -- As this opperation it's still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met. +- As this operation is still a right join, it returns all rows from the `i` table, but only shows the values for `id` and `count` when the rules are met. -- The date related `ProductReceivedProd2` was omited from this new table. +- The date related `ProductReceivedProd2` was omitted from this new table. ```{r} ProductReceivedProd2[ProductSalesProd2, on = list(product_id, date < date)] ``` -Now, after applying the join, we can limit the results only show the cases that meet all joining criteria. +Now, after applying the join, we can limit the results only showing the cases that meet all joining criteria. ```{r} ProductReceivedProd2[ProductSalesProd2, @@ -574,7 +574,7 @@ ProductReceivedProd2[ProductSalesProd2, Rolling joins are particularly useful in time-series data analysis. They allow you to **match rows based on the nearest value** in a sorted column, typically a date or time column. -This is valuable when you need to align data from different sources **that may not have exactly matching timestamps**, or when you want to carry forward the most recent value. +This is valuable when you need to align data from different sources **that may not have exact matching timestamps**, or when you want to carry forward the most recent value. For example, in financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don't occur at the exact same times. @@ -594,7 +594,7 @@ ProductPriceHistory = data.table( ProductPriceHistory ``` -Now, we can perform a right join giving a different prices for each product based on the sale date. +Now, we can perform a right join giving a different price for each product based on the sale date. ```{r} ProductPriceHistory[ProductSales, @@ -617,9 +617,9 @@ ProductPriceHistory[ProductSales, ### 7.1. Subsets as joins -As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. Actually, that process is faster than passing a Boolean expression to the `i` argument. +As we just saw in the prior section the `x` table gets filtered by the values available in the `i` table. This process is faster than passing a Boolean expression to the `i` argument. -To filter the `x` table at speed we don't to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table. +To filter the `x` table at speed we don't need to pass a complete `data.table`, we can pass a `list()` of vectors with the values that we want to keep or omit from the original table. For example, to filter dates where the market received 100 units of bananas (`product_id = 1`) or popcorn (`product_id = 3`) we can use the following: @@ -628,7 +628,7 @@ ProductReceived[list(c(1L, 3L), 100L), on = c("product_id", "count")] ``` -As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always to add the argument `nomatch = NULL`. +As at the end, we are filtering based on a join operation the code returned a **row that was not present in original table**. To avoid that behavior, it is recommended to always add the argument `nomatch = NULL`. ```{r} ProductReceived[list(c(1L, 3L), 100L), @@ -644,7 +644,7 @@ ProductReceived[!list(c(1L, 3L), 100L), on = c("product_id", "count")] ``` -If you just want to filter a value for a single **character column**, you can omit calling the `list()` function pass the value to been filtered in the `i` argument. +If you just want to filter a value for a single **character column**, you can omit calling the `list()` function and pass the value to be filtered in the `i` argument. ```{r} Products[c("banana","popcorn"), @@ -674,7 +674,7 @@ copy(Products)[ProductPriceHistory, In this operation: -- The function `copy` prevent that `:=` changes by reference the `Products` table.s +- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=` from changing the original table by reference. - We join `Products` with `ProductPriceHistory` based on `id` and `product_id`. - We update the `price` column with the latest price from `ProductPriceHistory`. - We add a new `last_updated` column to track when the price was last changed.