Skip to content

Commit

Permalink
add vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
thomaszwagerman committed Oct 16, 2024
1 parent 3b46bcd commit 656ecba
Show file tree
Hide file tree
Showing 7 changed files with 153 additions and 66 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
.DS_Store
.quarto
*.excalidraw
inst/doc
3 changes: 3 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,11 @@ Imports:
dplyr,
waldo
Suggests:
knitr,
rmarkdown,
testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends:
R (>= 2.10)
LazyData: true
VignetteBuilder: knitr
27 changes: 8 additions & 19 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,18 @@ knitr::opts_chunk$set(
[![Codecov test coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main)
<!-- badges: end -->

The goal of butterfly is to aid in the QA/QC of continually updating/overwritten time-series data where we expect new values over time, but where we want to ensure previous data remains unchanged.
The goal of butterfly is to aid in the quality assurance of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged.

```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap=""}
knitr::include_graphics("man/figures/README-butterfly_diagram.png")
```

Data previously recorded or calculated might change due equipment recalibration, discovery of human error in model code or a change in methodology.
This could have unintended consequences, as changes to previous input data may also alter future predictions in forecasting models.

The butterfly package aims to flag changes to previous data to prevent data changes going unnoticed.
Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.

Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

## Installation

Expand Down Expand Up @@ -58,11 +60,9 @@ butterflycount$february
butterflycount$march
```

We can use `butterfly::loupe()` to check if our previous values have changed.
We can use `butterfly::loupe()` to examine in detail whether previous values have changed.

```{r butterfly_example}
# Let's use butterfly::loupe() to check if our previous values have changed
# And if so, where this change occurred
butterfly::loupe(
butterflycount$february,
butterflycount$january,
Expand All @@ -76,7 +76,7 @@ butterfly::loupe(
)
```

`butterfly::loupe()` uses `dplyr::semi_join()` to match the timesteps of your current dataframe, to the timesteps already present in the previous dataframe. `waldo::compare()` is then used to compare these and return the differences.
`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.

`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

Expand Down Expand Up @@ -116,14 +116,3 @@ There are other R packages and functions which handle object comparison, which m
* [diffdf](https://github.com/gowerc/diffdf)

Other functions include `all.equal()` or [dplyr](https://github.com/tidyverse/dplyr)'s `setdiff()`

## Rationale
There are a lot of other data comparison and QA/QC packages out there, why butterfly?

This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question.

Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product.

When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted. We use the functionality in this package to detect changes, stop data transfer and notify the user.

This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model.
67 changes: 20 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,24 @@
coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main)
<!-- badges: end -->

The goal of butterfly is to aid in the QA/QC of continually
updating/overwritten time-series data where we expect new values over
time, but where we want to ensure previous data remains unchanged.
The goal of butterfly is to aid in the quality assurance of continually
updating and overwritten time-series data, where we expect new values
over time, but want to ensure previous data remains unchanged.

<img src="man/figures/README-butterfly_diagram.png" width="100%" />

Data previously recorded or calculated might change due equipment
recalibration, discovery of human error in model code or a change in
methodology. This could have unintended consequences, as changes to
previous input data may also alter future predictions in forecasting
models.
Data previously recorded could change for a number of reasons, such as
discovery of an error in model code, a change in methodology or
instrument recalibration. Monitoring data sources for these changes is
not always possible.

The butterfly package aims to flag changes to previous data to prevent
data changes going unnoticed.
Unnoticed changes in previous data could have unintended consequences,
such as invalidating DOIs, or altering future predictions if used as
input in forecasting models

This package provides functionality that can be used as part of a data
pipeline, to check and flag changes to previous data to prevent changes
going unnoticed.

## Installation

Expand Down Expand Up @@ -68,12 +72,10 @@ butterflycount$march
#> 5 2023-11-01 18
```

We can use `butterfly::loupe()` to check if our previous values have
changed.
We can use `butterfly::loupe()` to examine in detail whether previous
values have changed.

``` r
# Let's use butterfly::loupe() to check if our previous values have changed
# And if so, where this change occurred
butterfly::loupe(
butterflycount$february,
butterflycount$january,
Expand Down Expand Up @@ -106,10 +108,10 @@ butterfly::loupe(
#> `new$count`: 17 22 55 11
```

`butterfly::loupe()` uses `dplyr::semi_join()` to match the timesteps of
your current dataframe, to the timesteps already present in the previous
dataframe. `waldo::compare()` is then used to compare these and return
the differences.
`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old
objects using a common unique identifier, which in a timeseries will be
the timestep. `waldo::compare()` is then used to compare these and
provide a detailed report of the differences.

`butterfly` follows the `waldo` philosophy of erring on the side of
providing too much information, rather than too little. It will give a
Expand Down Expand Up @@ -198,32 +200,3 @@ which may suit your specific needs better:

Other functions include `all.equal()` or
[dplyr](https://github.com/tidyverse/dplyr)’s `setdiff()`

## Rationale

There are a lot of other data comparison and QA/QC packages out there,
why butterfly?

This package was originally developed to deal with
[ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)’s
initial release data, ERA5T. ERA5T data for a month is overwritten with
the final ERA5 data two months after the month in question.

Usually ERA5 and ERA5T are identical, but occasionally an issue with
input data can (for example for [09/21 -
12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth),
and
[07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685))
force a recalculation, meaning previously published data differs from
the final product.

When publishing ERA5-derived datasets, and minting it with a DOI, it is
possible to continuously append without invalidating that DOI. However,
recalculation would overwrite previously published data, thereby forcing
a new publication and DOI to be minted. We use the functionality in this
package to detect changes, stop data transfer and notify the user.

This package has intentionally been generalised to accommodate other,
but similar, use cases. Other examples could include a correction in
instrument calibration, compromised data transfer or unnoticed changes
in the parameterisation of a model.
2 changes: 2 additions & 0 deletions vignettes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html
*.R
119 changes: 119 additions & 0 deletions vignettes/butterfly.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: "butterfly"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{butterfly}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

The goal of butterfly is to aid in the quality assurance of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged.

```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap=""}
knitr::include_graphics("img/butterfly_diagram_light.png")
```

Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.

Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

## Installation

You can install the development version of butterfly from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("thomaszwagerman/butterfly")
```

## How to use butterfly

This is a basic example which shows you how to use butterfly:

```{r simple_example}
library(butterfly)
# Imagine a continually updated dataset that starts in January and is updated once a month
butterflycount$january
# In February an additional row appears, all previous data remains the same
butterflycount$february
# In March an additional row appears again
# ...but a previous value has unexpectedly changed
butterflycount$march
```

We can use `butterfly::loupe()` to examine in detail whether previous values have changed.

```{r butterfly_example}
butterfly::loupe(
butterflycount$february,
butterflycount$january,
datetime_variable = "time"
)
butterfly::loupe(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
```

`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.

`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

### Using butterfly for data wrangling
You might want to return changed rows as a dataframe, or drop them altogether. For this `butterfly::catch()` and `butterfly::release()` are provided.

Here, `butterfly::catch()` only returns rows which have **changed** from the previous version. It will not return new rows.

```{r butterfly_catch}
df_caught <- butterfly::catch(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
df_caught
```

Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.

```{r butterfly_release}
df_released <- butterfly::release(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
df_released
```

## Incorporating in data pipeline

Examples of using applying butterfly in a pipeline.

## Rationale

There are a lot of other data comparison and QA/QC packages out there, why butterfly?

This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question.

Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product.

When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted.

We use the functionality in this package in an automated data processing pipeline to detect changes, stop data transfer and notify the user.

This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model.
Binary file added vignettes/img/butterfly_diagram_light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 656ecba

Please sign in to comment.