add vignette

thomaszwagerman · Oct 16, 2024 · 656ecba · 656ecba
1 parent 3b46bcd
commit 656ecba
Show file tree

Hide file tree

Showing 7 changed files with 153 additions and 66 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,4 @@
 .DS_Store
 .quarto
 *.excalidraw
+inst/doc
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -14,8 +14,11 @@ Imports:
     dplyr,
     waldo
 Suggests: 
+    knitr,
+    rmarkdown,
     testthat (>= 3.0.0)
 Config/testthat/edition: 3
 Depends: 
     R (>= 2.10)
 LazyData: true
+VignetteBuilder: knitr
diff --git a/README.Rmd b/README.Rmd
@@ -20,16 +20,18 @@ knitr::opts_chunk$set(
 [![Codecov test coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main)
 <!-- badges: end -->
 
-The goal of butterfly is to aid in the QA/QC of continually updating/overwritten time-series data where we expect new values over time, but where we want to ensure previous data remains unchanged. 
+The goal of butterfly is to aid in the quality assurance of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged. 
 
 ```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap=""}
 knitr::include_graphics("man/figures/README-butterfly_diagram.png")
 ```
 
-Data previously recorded or calculated might change due equipment recalibration, discovery of human error in model code or a change in methodology.
-This could have unintended consequences, as changes to previous input data may also alter future predictions in forecasting models.
 
-The butterfly package aims to flag changes to previous data to prevent data changes going unnoticed.
+Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.
+
+Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models
+
+This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.
 
 ## Installation
 
@@ -58,11 +60,9 @@ butterflycount$february
 butterflycount$march
 ```
 
-We can use `butterfly::loupe()` to check if our previous values have changed.
+We can use `butterfly::loupe()` to examine in detail whether previous values have changed.
 
 ```{r butterfly_example}
-# Let's use butterfly::loupe() to check if our previous values have changed
-# And if so, where this change occurred
 butterfly::loupe(
   butterflycount$february,
   butterflycount$january,
@@ -76,7 +76,7 @@ butterfly::loupe(
 )
 ```
 
-`butterfly::loupe()` uses `dplyr::semi_join()` to match the timesteps of your current dataframe, to the timesteps already present in the previous dataframe. `waldo::compare()` is then used to compare these and return the differences.
+`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.
 
 `butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.
 
@@ -116,14 +116,3 @@ There are other R packages and functions which handle object comparison, which m
 * [diffdf](https://github.com/gowerc/diffdf)
 
 Other functions include `all.equal()` or [dplyr](https://github.com/tidyverse/dplyr)'s `setdiff()`
-
-## Rationale
-There are a lot of other data comparison and QA/QC packages out there, why butterfly?
-
-This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question. 
-
-Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product.
-
-When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted. We use the functionality in this package to detect changes, stop data transfer and notify the user.
-
-This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model.
diff --git a/README.md b/README.md
@@ -10,20 +10,24 @@
 coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main)
 <!-- badges: end -->
 
-The goal of butterfly is to aid in the QA/QC of continually
-updating/overwritten time-series data where we expect new values over
-time, but where we want to ensure previous data remains unchanged.
+The goal of butterfly is to aid in the quality assurance of continually
+updating and overwritten time-series data, where we expect new values
+over time, but want to ensure previous data remains unchanged.
 
 <img src="man/figures/README-butterfly_diagram.png" width="100%" />
 
-Data previously recorded or calculated might change due equipment
-recalibration, discovery of human error in model code or a change in
-methodology. This could have unintended consequences, as changes to
-previous input data may also alter future predictions in forecasting
-models.
+Data previously recorded could change for a number of reasons, such as
+discovery of an error in model code, a change in methodology or
+instrument recalibration. Monitoring data sources for these changes is
+not always possible.
 
-The butterfly package aims to flag changes to previous data to prevent
-data changes going unnoticed.
+Unnoticed changes in previous data could have unintended consequences,
+such as invalidating DOIs, or altering future predictions if used as
+input in forecasting models
+
+This package provides functionality that can be used as part of a data
+pipeline, to check and flag changes to previous data to prevent changes
+going unnoticed.
 
 ## Installation
 
@@ -68,12 +72,10 @@ butterflycount$march
 #> 5 2023-11-01    18
 ```
 
-We can use `butterfly::loupe()` to check if our previous values have
-changed.
+We can use `butterfly::loupe()` to examine in detail whether previous
+values have changed.
 
 ``` r
-# Let's use butterfly::loupe() to check if our previous values have changed
-# And if so, where this change occurred
 butterfly::loupe(
   butterflycount$february,
   butterflycount$january,
@@ -106,10 +108,10 @@ butterfly::loupe(
 #> `new$count`: 17 22 55 11
 ```
 
-`butterfly::loupe()` uses `dplyr::semi_join()` to match the timesteps of
-your current dataframe, to the timesteps already present in the previous
-dataframe. `waldo::compare()` is then used to compare these and return
-the differences.
+`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old
+objects using a common unique identifier, which in a timeseries will be
+the timestep. `waldo::compare()` is then used to compare these and
+provide a detailed report of the differences.
 
 `butterfly` follows the `waldo` philosophy of erring on the side of
 providing too much information, rather than too little. It will give a
@@ -198,32 +200,3 @@ which may suit your specific needs better:
 
 Other functions include `all.equal()` or
 [dplyr](https://github.com/tidyverse/dplyr)’s `setdiff()`
-
-## Rationale
-
-There are a lot of other data comparison and QA/QC packages out there,
-why butterfly?
-
-This package was originally developed to deal with
-[ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)’s
-initial release data, ERA5T. ERA5T data for a month is overwritten with
-the final ERA5 data two months after the month in question.
-
-Usually ERA5 and ERA5T are identical, but occasionally an issue with
-input data can (for example for [09/21 -
-12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth),
-and
-[07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685))
-force a recalculation, meaning previously published data differs from
-the final product.
-
-When publishing ERA5-derived datasets, and minting it with a DOI, it is
-possible to continuously append without invalidating that DOI. However,
-recalculation would overwrite previously published data, thereby forcing
-a new publication and DOI to be minted. We use the functionality in this
-package to detect changes, stop data transfer and notify the user.
-
-This package has intentionally been generalised to accommodate other,
-but similar, use cases. Other examples could include a correction in
-instrument calibration, compromised data transfer or unnoticed changes
-in the parameterisation of a model.
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/butterfly.Rmd b/vignettes/butterfly.Rmd
@@ -0,0 +1,119 @@
+---
+title: "butterfly"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{butterfly}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+The goal of butterfly is to aid in the quality assurance of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged. 
+
+```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap=""}
+knitr::include_graphics("img/butterfly_diagram_light.png")
+```
+
+Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.
+
+Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models
+
+This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.
+
+## Installation
+
+You can install the development version of butterfly from [GitHub](https://github.com/) with:
+
+``` r
+# install.packages("devtools")
+devtools::install_github("thomaszwagerman/butterfly")
+```
+
+## How to use butterfly
+
+This is a basic example which shows you how to use butterfly:
+
+```{r simple_example}
+library(butterfly)
+
+# Imagine a continually updated dataset that starts in January and is updated once a month
+butterflycount$january
+
+# In February an additional row appears, all previous data remains the same
+butterflycount$february
+
+# In March an additional row appears again
+# ...but a previous value has unexpectedly changed
+butterflycount$march
+```
+
+We can use `butterfly::loupe()` to examine in detail whether previous values have changed.
+
+```{r butterfly_example}
+butterfly::loupe(
+  butterflycount$february,
+  butterflycount$january,
+  datetime_variable = "time"
+)
+
+butterfly::loupe(
+  butterflycount$march,
+  butterflycount$february,
+  datetime_variable = "time"
+)
+```
+
+`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.
+
+`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.
+
+### Using butterfly for data wrangling
+You might want to return changed rows as a dataframe, or drop them altogether. For this `butterfly::catch()` and `butterfly::release()` are provided.
+
+Here, `butterfly::catch()` only returns rows which have **changed** from the previous version. It will not return new rows.
+
+```{r butterfly_catch}
+df_caught <- butterfly::catch(
+  butterflycount$march,
+  butterflycount$february,
+  datetime_variable = "time"
+)
+
+df_caught
+```
+
+Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.
+
+```{r butterfly_release}
+df_released <- butterfly::release(
+  butterflycount$march,
+  butterflycount$february,
+  datetime_variable = "time"
+)
+
+df_released
+```
+
+## Incorporating in data pipeline
+
+Examples of using applying butterfly in a pipeline.
+
+## Rationale
+
+There are a lot of other data comparison and QA/QC packages out there, why butterfly?
+
+This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question. 
+
+Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product.
+
+When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted. 
+
+We use the functionality in this package in an automated data processing pipeline to detect changes, stop data transfer and notify the user.
+
+This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model.
diff --git a/vignettes/img/butterfly_diagram_light.png b/vignettes/img/butterfly_diagram_light.png