README.Rmd

---
title: "senamhiR: A collection of functions to obtain Peruvian climate data in R"
output: 
  github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(senamhiR)
library(dplyr)
```

The package provides an automated solution for the acquisition of archived Peruvian climate and hydrology data directly within R. The data was compiled from the Senamhi website, and contains all of the data that was available as of April 10, 2018. This data was originally converted from HTML, and is now accessible via an API hosted by the package author. 

It is important to note that the info on the Senamhi website has not undergone quality control, however, this package includes a helper function to perform the most common quality control operations for the temperature variables. More functions will be added in the future.

## Installing

This package is under active development, and is not available from the official Comprehensive R Archive Network (CRAN). To make installation easier, I have written a script that should facilitate the installation of the package and its dependencies. Use the following command to run this script:
``` {r, eval = FALSE}
source("https://gitlab.com/snippets/1793256/raw");install("senamhiR")
```
_Note: It is always a good idea to review code before you run it. Click the URL in the above command to see the commands that we will run to install._

Once the packages have installed, load **senamhiR** by:
``` {r, eval = FALSE}
library(senamhiR)
```

## Basic workflow

The functions contained in the **senamhiR** functions allow for the discovery and visualization of meteorological and hydrological stations, and the acquisition of daily climate data from these stations.

### `station_search()`

To search for a station by name, use the `station_search()` function. For instance, to search for a station with the word 'Santa' in the station name, use the following code:

```{r}
station_search("Santa")
```

Note that the `tbl_df` object (a special sort of `data.frame`) won't print more than the first 10 rows by default. To see all of the results, you can wrap the command in `View()` so that it becomes `View(find_station("Santa"))`.

Note that you can also use wildcards as supported by the `glob2rx()` from the **utils** package by passing the argument `glob = TRUE`, as in the following example.

```{r}
station_search("San*", glob = TRUE)
```

You can filter your search results by region, by station type, by a given period, and by proximity to another station or a vector of coordinates. You can use any combination of these four filters in your search. The function is fully documented, so take a look at `?station_search`. Let's see some examples.

#### Find all stations in the San Martín Region
```{r}
station_search(region = "SAN MARTIN")
```
#### Find stations named "Santa", with data available between 1971 to 2000
```{r}
station_search("Santa", period = 1971:2000)
```
#### Find all stations between 0 and 100 km from Station No. 000401
```{r}
station_search(target = "000401", dist = 0:100)
```
#### Find all stations that are within 50 km of Machu Picchu
```{r}
station_search(target = c(-13.163333, -72.545556), dist = 0:50)
```

### Acquire data: `senamhiR()`

Once you have found your station of interest, you can download the daily data using the eponymous `senamhiR()` function. The function takes two arguments, station and year. If year is left blank, the function will return all available archived data. 

If I wanted to download data for Requena (station no. 000280) from 1981 to 2010, I could use: 

```{r}
requ <- senamhiR("000280", 1981:2010)
```
_Note: Since the StationID numbers contain leading zeros, any station that is less than six characters long will be padded with zeroes. i.e. 280 becomes 000280._

```{r}
requ
```

Make sure to use the assignment operator (`<-`) to save the data into an R object, otherwise the data will just print out to the console, and won't get saved anywhere in the memory. 

## For easier station visualization

### `map_stations()`

Sometimes a long list of stations is hard to visualize spatially. The `map_stations()` function helps to overcome this. This function takes a list of stations and shows them on a map powered by the [Leaflet](http://leafletjs.com/) library. Like the previous function, the map function is even smart enough to take a search as its list of stations as per the example below. Note that this mapping functionality requires the **leaflet** package to be installed, and it is not included as a dependency of **senamhiR**.

#### Show a map of all stations that are between 30 and 50 km of Machu Picchu
```{r, eval=FALSE}
map_stations(station_search(target = c(-13.163333, -72.545556), dist = 30:50), zoom = 7)
```

## Quality control functions

There are two functions included to perform some basic quality control. 

### `quick_audit()`

The `quick_audit()` function will return a tibble listing the percentage or number of missing values for a station. For instance, the following command will return the percentage of missing values in our 30-year Requena data set:

```{r}
quick_audit(requ, c("Tmax", "Tmin"))
```

Use `report = "n"` to show the _number_ of missing values. Use `by = "month"` or `by = "year"` to show missing data by month or year. For instance, the number of days for which Mean Temperature was missing at Tocache in 1980:

```{r}
toca <- senamhiR("000463", year = 1980)
quick_audit(toca, "Tmean", by = "month", report = "n")
```

### `qc()`

There is an incomplete and experimental function to perform automated quality control on data acquired through this package. Fow now, the package tests temperature and river level only. The logic used between these two types of data is different. Note that these methods are not necessarily statistically robust, and have not been subjected to rigourous testing. Your mileage may vary. In all cases, the original values are archived in an "observations" column, so you can always restore the original values manually. 

#### Temperature variables

##### Case 1: Missing decimal point
 
Any number above 100 °C or below -100 °C is tested: 

If the number appears to have missed a decimal place (e.g. 324 -> 32.4; 251 -> 25.1), we try to divide that number by 10. If the result is within 1.5 standard deviations of all values 30 days before and after the day in question, we keep the result, otherwise, we discard it.

If the number seems to be the result of some other typographical error (e.g. 221.2), we discard the data point. 

##### Case 2: _T_<sub>max</sub> < _T_<sub>min</sub>

We perform the same tests for both _T_<sub>max</sub> and _T_<sub>min</sub>. If the number is within 1.5 standard deviations of all values 30 days before and after the day in question, we leave the number alone. (Note: this is often the case for _T_<sub>min</sub> but seldom the case for _T_<sub>max</sub>). If the number does not fall within 1.5 standard deviations, we perform an additional level of testing to check if the number is the result of a premature decimal point (e.g. 3.4 -> 34.0; 3 -> 30.0). In this case, we try to multiply the number by 10. If this new result is within 1.5 standard deviations of all values 30 days before and after the day in question, we keep the result, otherwise, we discard it.

_I have less confidence in this solution than I do for Case 1._

##### Example:

```{r}
requ_dirty <- senamhiR("000280") #1960 to 2018
requ_qc <- qc(requ_dirty)
requ_qc %>% filter(Observations != "") %>% select(Fecha, `Tmax (C)`, `Tmin (C)`, `Tmean (C)`, Observations)
```

##### Cases that are currently missed:

 - Cases where _T_<sub>min</sub> is small because of a typo.
 - Cases where _T_<sub>max</sub> is small because of a typo, but not smaller than _T_<sub>min</sub>.
 
##### Cases where this function is plain wrong: 

 - When there are a number of similar errors within the 60-day period, bad data is sometimes considered okay. This is especially apparent at, for instance, Station 47287402.

#### River level:

##### Case 1: Suspected decimal place shift

The function first calculates the daily range in river level across the four daily observations. If any range is greater than the (somewhat arbitrary) value of ten times the average range, then we extract a slice of the level observations corresponding to two days before and two days after the day in question. We standardize the slice of data; if any _single_ standardized value is above 1 (below -1), we try to multiply (divide) the value by 10. If the new value falls within 1.5 standard deviations of the mean of the "good" values, then we keep the modified value and call it a decimal place error, otherwise, we set the value to missing and label it as an error.

##### Example:

```{r, echo=2:4}
options(tibble.width = Inf)
pico_dirty <- senamhiR("230715") #2003 to 2018
pico_qc <- qc(pico_dirty)
pico_qc %>% filter(Observations != "") %>% select(Fecha, starts_with("Nivel"), Observations)
```

##### Cases that are currently missed:

 - Cases where there is an error in more than one of the 20 river level observations used for context.
 - Cases where one of the river level observations is missing.

##### Cases where this function is plain wrong: 

 - It may be the case that there is a quick spike in the river level due to a precipitation event. This function may incorrectly detect such spikes as human entry errors. Any correction made by the function should be confirmed manually. 

#### Variables controlled for: 

 - _T_<sub>max</sub>
 - _T_<sub>min</sub>
 - _T_<sub>mean</sub>
 - River Level

__No other variables are currently tested; hydrological data is not tested. This data should not be considered "high quality", use of the data is your responsibility.__ Note that all values that are modified from their original values will be recorded in a new "Observations" column in the resultant tibble.

## Disclaimer

The package outlined in this document is published under the GNU General Public License, version 3 (GPL-3.0). The GPL is an open source, copyleft license that allows for the modification and redistribution of original works. Programs licensed under the GPL come with NO WARRANTY. In our case, a simple R package isn't likely to blow up your computer or kill your cat. Nonetheless, it is always a good idea to pay attention to what you are doing, to ensure that you have downloaded the correct data, and that everything looks ship-shape. 

## What to do if something doesn't work

If you run into an issue while you are using the package, you can email me and I can help you troubleshoot the issue. However, if the issue is related to the package code and not your own fault, you should contribute back to the open source community by reporting the issue. You can report any issues to me here on [GitLab](https://gitlab.com/ConorIA/senamhiR).

If that seems like a lot of work, just think about how much work it would have been to do all the work this package does for you, or how much time went in to writing these functions ... it is more than I'd like to admit!

## Senamhi terms of use

Senamhi's terms of use are [here](http://senamhi.gob.pe/?p=terminos_condiciones), but as of writing that link was redirecting to the Senamhi home page. An archived version is available [here](https://web.archive.org/web/20170822092538/http://senamhi.gob.pe/?p=terminos_condiciones). The terms allow for the free and public access to information on the Senamhi website, in both for-profit and non-profit applications. However, Senamhi stipulates that any use of the data must be accompanied by a disclaimer that Senamhi is the proprietor of the information. The following text is recommended (official text in Spanish):

- **Official Spanish:** _Información recopilada y trabajada por el Servicio Nacional de Meteorología e Hidrología del Perú. El uso que se le da a esta información es de mi (nuestra) entera responsabilidad._
- **English translation:** This information was compiled and maintained by Peru's National Meteorology and Hydrology Service (Senamhi). The use of this data is of my (our) sole responsibility.

A message similar to the English message above is printed to the R console whenever the package is loaded.