ssi-dk · RasmusSkytte · Oct 4, 2024 · Mar 4, 2024 · Mar 4, 2024 · Oct 4, 2024
diff --git a/vignettes/SCDB.Rmd b/vignettes/SCDB.Rmd
@@ -37,7 +37,8 @@ example_data <-
                  name = "example_data",
                  overwrite = TRUE)
 ```
-The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.
+The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of
+time-versioned data.
 
 In practice, this is done by labeling each record in the data with three additional fields:
 
@@ -47,59 +48,69 @@ In practice, this is done by labeling each record in the data with three additio
 
 This strategy of time versioning is often called "type 2" history [@Kimball2013].
 
-Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.
-
-The SCDB package provides the function `update_snapshot` to handle the insertion and deactivation of records using this strategy.
+The SCDB package provides the function `update_snapshot` to handle the insertion and deactivation of records using
+this strategy.
 It further includes several functions to improve the Quality of life for working with database data.
 
 A simple example of usage is shown below.<br/>
 For this example, we use a temporary, on-disk SQLite database.
-Note that `get_connection` tries to establish connection using `DBI::dbConnect` with as few additional arguments as possible.
+Note that `get_connection()` tries to establish connection using `DBI::dbConnect()` with as few additional arguments
+as possible.
 Different drivers may require authentication which can be read from a configuration file.^[
 In the context of the SCDB package, this is most notably `RPostgres::Postgres()`, which may read from a `.pgpass` file.
 See also the [PostgreSQL documentation](https://www.postgresql.org/docs/current/libpq-pgpass.html).
 ]
 
 Our example data is `datasets::mtcars` reduced to only two columns: row names converted to a column `car`, and `hp`
-```{r example_data, eval = requireNamespace("RSQLite", quietly = TRUE)}
-if (!exists("conn")) conn <- get_connection()
-
+```{r example_data, eval = FALSE}
+conn <- get_connection()
+```
+```{r example_data_hidden, eval = requireNamespace("RSQLite", quietly = TRUE)}
 example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data"))
 example_data
 ```
 
-Imagine on Day 1, in this case January 1st, 2020, our currently available data is the first three records of the example_data.
+Imagine on Day 1, in this case January 1st, 2020, our currently available data is the first three records of
+the `example_data`.
 We then store this data in a table `mtcars`:
 ```{r example_1, eval = requireNamespace("RSQLite", quietly = TRUE)}
 data <- head(example_data, 3)
 
-update_snapshot(.data = data,
-                conn = conn,
-                db_table = "mtcars", # the name of the DB table to store the data in
-                timestamp = as.POSIXct("2020-01-01 11:00:00"))
+update_snapshot(
+  .data = data,
+  conn = conn,
+  db_table = "mtcars", # the name of the DB table to store the data in
+  timestamp = as.POSIXct("2020-01-01 11:00:00")
+)
 ```
 
-We can then access out data using the `get_table` function, and include information on data validity period using `include_slice_info = TRUE`:
+We can then access out data using the `get_table()` function, and include information on data validity period using
+`include_slice_info = TRUE`:
 ```{r example_1_results, eval = requireNamespace("RSQLite", quietly = TRUE)}
 get_table(conn, "mtcars")
 
 get_table(conn, "mtcars", include_slice_info = TRUE)
 ```
-Note that where e.g. `dplyr::tbl` requires a more exact specification of the table identity (`tbl(conn, DBI::Id(table = "mtcars"))`), `get_table` will parse any character to a `DBI::Id` object input using `SCDB::id`.
+Note that where e.g. `dplyr::tbl()` requires a more exact specification of the table identity
+(`tbl(conn, DBI::Id(table = "mtcars"))`), `get_table()` will parse any character to a `DBI::Id()` object input
+using `id()`.
 
 The following day, the current data is now the first five rows of our example data.
-We then store this data in the database using update_snapshot:
+We then store this data in the database using `update_snapshot()`:
 
 ```{r example_2, eval = requireNamespace("RSQLite", quietly = TRUE)}
 # Let's say that the next day, our data set is now the first 5 of our example data
 data <- head(example_data, 5)
 
-update_snapshot(.data = data,
-                conn = conn,
-                db_table = "mtcars", # the name of the DB table to store the data in
-                timestamp = as.POSIXct("2020-01-02 12:00:00"))
+update_snapshot(
+  .data = data,
+  conn = conn,
+  db_table = "mtcars", # the name of the DB table to store the data in
+  timestamp = as.POSIXct("2020-01-02 12:00:00")
+)
 ```
-We can again use the `get_table` function to see the latest available data, including time-keeping with `include_slice_info = TRUE`:
+We can again use the `get_table()` function to see the latest available data, including time-keeping with
+`include_slice_info = TRUE`:
 ```{r example_2_results_a, eval = requireNamespace("RSQLite", quietly = TRUE)}
 get_table(conn, "mtcars")
 
@@ -117,13 +128,15 @@ On day 3, we imagine that we have the same 5 records, but one of them is altered
 data <- head(example_data, 5) |>
   dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp))
 
-update_snapshot(.data = data,
-                conn = conn,
-                db_table = "mtcars", # the name of the DB table to store the data in
-                timestamp = as.POSIXct("2020-01-03 10:00:00"))
+update_snapshot(
+  .data = data,
+  conn = conn,
+  db_table = "mtcars", # the name of the DB table to store the data in
+  timestamp = as.POSIXct("2020-01-03 10:00:00")
+)
 ```
 
-We can again access our data using the `get_table` function and see that the currently
+We can again access our data using the `get_table()` function and see that the currently
 available data (with the changed hp value for Mazda RX4)
 ```{r example_3_results_a, eval = requireNamespace("RSQLite", quietly = TRUE)}
 get_table(conn, "mtcars")

diff --git a/vignettes/benchmarks.Rmd b/vignettes/benchmarks.Rmd
@@ -34,7 +34,7 @@ This data forms the basis for three "snapshots" used in the benchmarks:
 The benchmark function uses three consecutive calls to `update_snapshot()` to create the table with first snapshot and
 then update it to the second and third snapshot. Finally, the table is deleted.
 
-The performance of this benchmark function is timed with the `microbenchmark` package using 10 replicates.
+The performance of this benchmark function is timed with the `{{microbenchmark}}` package using 10 replicates.
 All benchmarks are run on the same machine.
 
 The results of the benchmark are shown graphically below (mean and standard deviation), where we compare the current
@@ -126,7 +126,7 @@ data. The datasets are similar to the first benchmark, but the number of repeats
 increasing data size. The benchmarks are run from a "clean" state, where the target_table does not exists. The benchmark
 measures both the time to create the table and to remove it again afterwards (to restore the clean state).
 
-The performance of this benchmark function is timed with the `microbenchmark` package using 5 replicates.
+The performance of this benchmark function is timed with the `{{microbenchmark}}` package using 5 replicates.
 All benchmarks are run on the same machine.
 
 The results of the benchmark are shown graphically below (mean and standard deviation) and with linear scaling (dotted

diff --git a/vignettes/slowly-changing-dimension.Rmd b/vignettes/slowly-changing-dimension.Rmd
@@ -54,7 +54,8 @@ forecasts2 <- filter(forecasts_full, ForecastDate == "2023-09-29") %>%
 forecasts2a <- filter(forecasts_full, ForecastDate == "2023-09-29")
 ```
 
-A slowly changing dimension is a concept in data warehousing which refers to data which may change over time, but at an irregular schedule.
+A slowly changing dimension is a concept in data warehousing which refers to data which may change over time, but at an
+ irregular schedule.
 
 ## Type 1 and Type 2 history
 
@@ -64,10 +65,12 @@ For example, consider the following table of forecasts for a number of cities:
 forecasts
 ```
 
-The following day, the forecasts will have changed, and — barring the occasional data hoarder — the existing data is no longer relevant.
+The following day, the forecasts will have changed, and — barring the occasional data hoarder — the existing data is no
+ longer relevant.
 
 In this example, most (if not all) of the values of the `Forecast` column will change with each regular update.
-Putting it into other words, the table is a *snapshot*^[A *snapshot* is a static view of (part of) a database at a specific point in time] of forecasts at the last time of update.
+Putting it into other words, the table is a *snapshot*^[A *snapshot* is a static view of (part of) a database at a
+specific point in time] of forecasts at the last time of update.
 
 The following day, the forecasts naturally change:
 ```{r forecasts2a, eval = requireNamespace("tidyverse")}
@@ -77,14 +80,17 @@ forecasts2
 We could choose to update the forecasts table so that it would always contain the current data.
 This is what is referred to as Type 1 methodology [@Kimball2013].
 
-Databases are thankfully a rather efficient way of storing and accessing data, so instead of discarding the values from the previous day, we append the new data to those of the previous day.
+Databases are thankfully a rather efficient way of storing and accessing data, so instead of discarding the values from
+ the previous day, we append the new data to those of the previous day.
 Also, in order to keep our data organized, we add a column with the date of the forecast, aptly named `ForecastDate`.
 
-The full table of forecasts for the two days now looks like below, and we are slowly building a full history of forecasts:
+The full table of forecasts for the two days now looks like below, and we are slowly building a full history of
+forecasts:
 ```{r forecasts_full, eval = requireNamespace("tidyverse")}
 forecasts_full
 ```
-Managing historical data by inserting new data in this manner is often referred to as Type 2 methodology or Type 2 history.
+Managing historical data by inserting new data in this manner is often referred to as Type 2 methodology or Type 2
+history.
 
 Our table now provides much more information for the user through filtering:
 ```{r forecasts_full_examples, eval = requireNamespace("tidyverse")}
@@ -105,7 +111,8 @@ forecasts_full %>%
 
 Now, we note that the forecast for Houston has not changed between the two days.
 
-In order to keep our data as minimized as possible, we modify the table again, now expanding `ForecastDate` into `ForecastFrom` and `ForecastUntil`.
+In order to keep our data as minimized as possible, we modify the table again, now expanding `ForecastDate` into
+`ForecastFrom` and `ForecastUntil`.
 
 Our table of forecasts now looks like this:
 ```{r forecasts_scd, eval = requireNamespace("tidyverse")}
@@ -114,10 +121,12 @@ forecasts_scd
 For now, the `ForecastUntil` value is set to `NA`, as it is not known when these rows will "expire" (if ever).
 This also makes it easy to identify currently valid data.
 
-Adding a new column to save a single row of data naturally seems a bit overkill, but as the number of rows in the data set increases indefinitely, this solutions scales much better.
+Adding a new column to save a single row of data naturally seems a bit overkill, but as the number of rows in the data
+ set increases indefinitely, this solutions scales much better.
 
 ## A "timeline of timelines"
-Let's now introduce additional information and see how managing slowly changing dimensions enables us to easily navigate large amounts of data over large periods of time.
+Let's now introduce additional information and see how managing slowly changing dimensions enables us to easily navigate
+large amounts of data over large periods of time.
 
 ```{r adresses_setup, include = FALSE, eval = requireNamespace("tidyverse")}
 addresses <- tibble::tibble(
@@ -176,9 +185,13 @@ addresses2 <- addresses %>%
   distinct()
 ```
 
-Imagine a town of several thousand citizens, with a town hall maintaining a civil registry of names and addresses of every citizen, updated daily with any changes submitted by the citizens, each of whom having an individual identification number.^[If this concept seems very familiar, you may have heard of [the Danish central civil registry](https://en.wikipedia.org/wiki/Personal_identification_number_(Denmark))]
+Imagine a town of several thousand citizens, with a town hall maintaining a civil registry of names and addresses of
+every citizen, updated daily with any changes submitted by the citizens, each of whom having an individual
+identification number.^[If this concept seems very familiar, you may have heard of
+[the Danish central civil registry](https://en.wikipedia.org/wiki/Personal_identification_number_(Denmark))]
 
-The data is largely static, as a very small fraction of citizens move on any given day, but it is of interest to keep data relatively up-to-date.
+The data is largely static, as a very small fraction of citizens move on any given day, but it is of interest to keep
+data relatively up-to-date.
 This is where managing a slowly changing dimension becomes very powerful, compared to full incremental backups.
 
 One day, Alice Doe meets [Robert "Bobby" Tables](https://xkcd.com/327/), and they move in together:
@@ -187,7 +200,8 @@ One day, Alice Doe meets [Robert "Bobby" Tables](https://xkcd.com/327/), and the
 addresses
 ```
 
-First thing to notice is that the registry is not updated in real-time, as citizens may have been late in registering a change of address.
+First thing to notice is that the registry is not updated in real-time, as citizens may have been late in registering
+a change of address.
 This can be seen when comparing the values of `MovedIn` and `ValidFrom` for row 4.
 
 When using Type 2 history, this feature is correctly replicated when reconstructing historical data:
@@ -200,7 +214,8 @@ addresses %>%
          ValidUntil >= !!slice_timestamp | is.na(ValidUntil)) %>%
   select(!c("ValidFrom", "ValidUntil"))
 ```
-In other words, even though Alice's address was subsequently updated in the registry, we can still see that she *was* registered as living in Donut Plains at this time.
+In other words, even though Alice's address was subsequently updated in the registry, we can still see that she *was*
+registered as living in Donut Plains at this time.
 This modeling of "timelines of timelines" is also called bitemporal modeling.
 
 By now, things are going well between Alice and Robert; they get married, with Alice taking Robert's surname.
@@ -212,7 +227,8 @@ filter(addresses2,
   select(ID, GivenName, Surname, MovedIn, MovedOut, ValidFrom, ValidUntil)
 ```
 
-This is now also reflected in the data; the `MovedIn` date is persistent across the date of the name change, only the `Surname` changes:
+This is now also reflected in the data; the `MovedIn` date is persistent across the date of the name change, only the
+`Surname` changes:
 ```{r addresses3, eval = requireNamespace("tidyverse")}
 slice_timestamp <- "2022-03-04"
 
@@ -235,7 +251,8 @@ addresses2 %>%
 
 ## Summary
 
-By now, it is hopefully clear how managing a slowly changing dimension allows you to access data at any point in (tracked) time while potentially avoiding a lot of data redundancy.
+By now, it is hopefully clear how managing a slowly changing dimension allows you to access data at any point in
+(tracked) time while potentially avoiding a lot of data redundancy.
 
 You are now ready to get started with the `SCDB` package!