-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3b46bcd
commit 656ecba
Showing
7 changed files
with
153 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,4 @@ | |
.DS_Store | ||
.quarto | ||
*.excalidraw | ||
inst/doc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.html | ||
*.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
--- | ||
title: "butterfly" | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{butterfly} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
The goal of butterfly is to aid in the quality assurance of continually updating and overwritten time-series data, where we expect new values over time, but want to ensure previous data remains unchanged. | ||
|
||
```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap=""} | ||
knitr::include_graphics("img/butterfly_diagram_light.png") | ||
``` | ||
|
||
Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible. | ||
|
||
Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models | ||
|
||
This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed. | ||
|
||
## Installation | ||
|
||
You can install the development version of butterfly from [GitHub](https://github.com/) with: | ||
|
||
``` r | ||
# install.packages("devtools") | ||
devtools::install_github("thomaszwagerman/butterfly") | ||
``` | ||
|
||
## How to use butterfly | ||
|
||
This is a basic example which shows you how to use butterfly: | ||
|
||
```{r simple_example} | ||
library(butterfly) | ||
# Imagine a continually updated dataset that starts in January and is updated once a month | ||
butterflycount$january | ||
# In February an additional row appears, all previous data remains the same | ||
butterflycount$february | ||
# In March an additional row appears again | ||
# ...but a previous value has unexpectedly changed | ||
butterflycount$march | ||
``` | ||
|
||
We can use `butterfly::loupe()` to examine in detail whether previous values have changed. | ||
|
||
```{r butterfly_example} | ||
butterfly::loupe( | ||
butterflycount$february, | ||
butterflycount$january, | ||
datetime_variable = "time" | ||
) | ||
butterfly::loupe( | ||
butterflycount$march, | ||
butterflycount$february, | ||
datetime_variable = "time" | ||
) | ||
``` | ||
|
||
`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences. | ||
|
||
`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects. | ||
|
||
### Using butterfly for data wrangling | ||
You might want to return changed rows as a dataframe, or drop them altogether. For this `butterfly::catch()` and `butterfly::release()` are provided. | ||
|
||
Here, `butterfly::catch()` only returns rows which have **changed** from the previous version. It will not return new rows. | ||
|
||
```{r butterfly_catch} | ||
df_caught <- butterfly::catch( | ||
butterflycount$march, | ||
butterflycount$february, | ||
datetime_variable = "time" | ||
) | ||
df_caught | ||
``` | ||
|
||
Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected. | ||
|
||
```{r butterfly_release} | ||
df_released <- butterfly::release( | ||
butterflycount$march, | ||
butterflycount$february, | ||
datetime_variable = "time" | ||
) | ||
df_released | ||
``` | ||
|
||
## Incorporating in data pipeline | ||
|
||
Examples of using applying butterfly in a pipeline. | ||
|
||
## Rationale | ||
|
||
There are a lot of other data comparison and QA/QC packages out there, why butterfly? | ||
|
||
This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question. | ||
|
||
Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product. | ||
|
||
When publishing ERA5-derived datasets, and minting it with a DOI, it is possible to continuously append without invalidating that DOI. However, recalculation would overwrite previously published data, thereby forcing a new publication and DOI to be minted. | ||
|
||
We use the functionality in this package in an automated data processing pipeline to detect changes, stop data transfer and notify the user. | ||
|
||
This package has intentionally been generalised to accommodate other, but similar, use cases. Other examples could include a correction in instrument calibration, compromised data transfer or unnoticed changes in the parameterisation of a model. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.