Skip to content

Commit

Permalink
Add geography lookup data and functions (#81)
Browse files Browse the repository at this point in the history
* Initial code to query ONS API

* WIP: add lookup data from ONS API

* Working wd_pcon_lad_la file, needs tidying, documentation improvements and tests

* still WIP, got mostly working but a few data issues (going to need to squash these commits!)

* got fetch_ons_api batching up okay

* update column order in description

* add verbose toggle and general WIP

* general package tidying

* removed multiple years and stopped returning time cols in fetch functions

* WIP: two todo's left to fix

* Everything finally works!

* add examples to readme, fix up pkgdown site

* Initial code to query ONS API

* WIP: add lookup data from ONS API

* Working wd_pcon_lad_la file, needs tidying, documentation improvements and tests

* still WIP, got mostly working but a few data issues (going to need to squash these commits!)

* got fetch_ons_api batching up okay

* update column order in description

* add verbose toggle and general WIP

* general package tidying

* removed multiple years and stopped returning time cols in fetch functions

* WIP: two todo's left to fix

* Everything finally works!

* add examples to readme, fix up pkgdown site

* Quick refactor of the create_project() code to prevent linting issues

* remove notes in contributing

* tidy up wordlist

* fix issue in api wrapper

* code to add country information

* add countries data set

* add rgn, gor, ctry to shorthands

* extend to join on region and country, add fetch region and ward

* tidy URLs in data set sources

* Increment version number to 0.5.0

* add regions

* add regions into pkgdown yml

* shuffle shuffle (to <family>_utils.R and binning helper_functions.R)

* typo fix in test file naming

* tighten up check_fetch_location_inputs

* Update countries and regions to use year variables at start of scripts

* Move to 4 digit years and improve documentation

* Add a section to the contributing guide about the geography data

* update function comments

* add internal link into contributing

* add rich's error guidance

* fix typo
  • Loading branch information
cjrace authored Sep 16, 2024
1 parent 54cad63 commit 45a6834
Show file tree
Hide file tree
Showing 68 changed files with 2,889 additions and 431 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@
^CODE_OF_CONDUCT\.md$
^codecov\.yml$
^README\.Rmd$
^data-raw$
File renamed without changes.
78 changes: 78 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,25 @@ lintr::lint_package()

[styler](https://CRAN.R-project.org/package=styler) will not fix all linting issues, so we recommend using that first, then using [lintr](https://lintr.r-lib.org/articles/lintr.html) to check for places you may need to manually fix styling issues such as line length or not using snake_case.

## Folder and script structure conventions

In an R package you are not allowed to have any sub directories in the R/ folder. Where possible we should have:

* one script per function
or
* one script per function family if a function belongs to a family

The script should share the name of the function or family. If needed then a <family>_utils.R script should be used for
internal only functions that relate to a specific family.

Documentation for all data shipped with the packages is kept in `R/datasets_documentation.R`. Scripts used for preparing
data used in the package is not in the R folder, it is in the `data-raw/` folder, helper functions for this can be found
in the `R/datasets_utils.R` folder, and more details on maintaining the data sets can be found under the [Package data](#package-data) header on this page.

`utils.R` should be used to hold any cross-package helpers that aren't exported as functions or specific to a family.

Every exported function or data set gets its own test script, called `test-<function-name>.R` or `test-data-<data-name>.R` if for a data set.

### Testing

We use [testthat](https://cran.r-project.org/package=testthat) for unit tests, we expect all new functions to have some level of test coverage.
Expand Down Expand Up @@ -219,6 +238,65 @@ Vignettes can be found in the `vignettes/` folder as .Rmd files. To start a new
usethis::use_vignette("name_of_vignette")
```

## Package data

Our general workflow for data in the package is:

0. Make sure you have everything you need installed and the package is loaded with `devtools::load_all()`
1. Use the relevant script in data-raw/ to generate and save the data set
2. Document the data set in R/datasets_documentation.R
3. Update any relevant fetch_ functions that use the data if appropriate
4. Run all the usual package checks and re-documenting of the package as you would for any other update

Our general principle is that all data should be created through reproducible code, so if it's custom data you're defining, write it in code. If you're sourcing it from elsewhere, try to make use of API connections. This saves on unnecessary data storage bloat and makes our scripts nice and reproducible without external dependencies to worry about.

We try to keep the data-raw/ scripts as tidy as possible, so some helper functions have been created in R/datasets-utils.R. These are not exported for users of the package and are only used by scripts in the data-raw/ folder for the creation of data exported in the package.

Sometimes when running the scripts to create new data sets you might hit this error:

```
Error in `check_is_package()`:
i use_data() is designed to work with packages
X Project "some letters and numbers" is not an R package.
```

If you do, try restarting R, making sure you have the project open, and the package loaded using `devtools::load_all()` and then run again.

For more details on maintaining data with an R package generally, see [chapter 7 Data, from R packages by Hadley Wickham and Jennifer Bryan](https://r-pkgs.org/data.html).

### Geography data sets

In the package we export a number of data sets derived from the [ONS Open Geography portal](https://geoportal.statistics.gov.uk/) for easy reuse within DfE analysis. Whenever new data appears or we want to make updates to these we need to do those manually.

#### Source

Where we can, we use their API to get the data, so that we have completely reproducible pipelines for this (rather than saving static files manually and then having to check if updates have been made, or having to worry about file storage).

On the [ONS Open Geography portal](https://geoportal.statistics.gov.uk/), you will usually be looking for data published as a feature or feature layer, as these are the ones made available via the API connection. You'll be able to preview the data in the browser and do basic searching / filtering on the table if you want to visualise it. Any feature data should have an option somewhere for 'I want to use this data' (or something similar if they update their website design) where you can get to an API explorer that allows you to run a basic query in the browser. In here you can usually find the dataset_id and also the parameters you want to use to get the data you need.

We have a `get_ons_api_data()` function that acts as a wrapper to the ONS API, it does things like converting readable parameters into a query string and also handles batching and multiple requests if needed, so you get all of the data in one nice neat data frame (there's a limit on the rows per single query for the API).

The way ONS publish has varied over their first few years of publishing, and on top of that each data set has an individual API connection for every year of boundaries. As there's no link over time from the ONS side we have helper functions defined in R/datasets_utils.R that wrap these up into a single neat time series bundle for us. Given the likelihood of further variations, don't be too surprised if adding new years to the data sets results in errors first time around, some manual fudgery is often needed so roll up your sleeves and prepare to get elbow deep into the murky depths of the R/datasets_utils.R file!

There is also some data we just define ourselves in code as we curate that, like custom regions we publish in DfE or our own lookup table for the shorthands used in the column names by ONS.

#### Workflow for updating geography data

Our general workflow for data in the package is:

1. Add a new year into the relevant script in data-raw/ script
2. Run the script to create a new data set
3. Run all package checks to make sure the data hasn't gone all funky on you
4. Update any fetch_ functions that use the data if appropriate
5. Document any changes to the data set in R/datasets_documentation.R if appropriate

Most data sets have tests that will fail as soon as the number of rows or columns change, this is both to provide a reliable service to users, but also to catch and remind us to maintain the documentation as the row number and all column names are defined in R/datasets_documentation.R. If these tests fail, update the relevant documentation, and then (ONLY THEN!) update the test expectations to match the new documentation.

The fetch_ family of functions in R/fetch.R act as quick helpers that pull from the data sets we export, so users can ees-ily grab say a list of all Scottish Parliamentary Constituencies for 2024, rather than needing to pull in a whole data frame and process it.

Often if adding a new year of data in, you will need to edit the year variables set near the start of the data-raw/ file and then also in the relevant fetch_ function @param year, as well as updating the public documentation of the data set in R/datasets_documentation.R.

## Code of Conduct

Please note that the dfeR project is released with a [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this project you agree to abide by its terms.
13 changes: 12 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: dfeR
Title: Common DfE R tasks
Version: 0.4.1
Version: 0.5.0
Authors@R: c(
person("Cam", "Race", , "[email protected]", role = c("aut", "cre")),
person("Laura", "Selby", , "[email protected]", role = "aut"),
Expand All @@ -18,22 +18,33 @@ License: GPL (>= 3)
URL: https://dfe-analytical-services.github.io/dfeR/,
https://github.com/dfe-analytical-services/dfeR
BugReports: https://github.com/dfe-analytical-services/dfeR/issues
Depends:
R (>= 2.10)
Imports:
dplyr,
emoji,
httr,
jsonlite,
lifecycle,
magrittr,
renv,
rlang,
tidyselect,
usethis,
utils,
withr
Suggests:
knitr,
readxl,
rmarkdown,
spelling,
stringr,
testthat (>= 3.0.0)
VignetteBuilder:
knitr
Config/testthat/edition: 3
Encoding: UTF-8
Language: en-GB
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
10 changes: 10 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,28 @@

export(comma_sep)
export(create_project)
export(fetch_countries)
export(fetch_lads)
export(fetch_las)
export(fetch_pcons)
export(fetch_regions)
export(fetch_wards)
export(format_ay)
export(format_ay_reverse)
export(format_fy)
export(format_fy_reverse)
export(get_clean_sql)
export(get_ons_api_data)
export(pretty_filesize)
export(pretty_num)
export(pretty_time_taken)
export(round_five_up)
export(toggle_message)
import(renv, except = run)
importFrom(emoji,emoji)
importFrom(lifecycle,deprecated)
importFrom(magrittr,"%>%")
importFrom(rlang,.data)
importFrom(usethis,create_package)
importFrom(usethis,create_project)
importFrom(usethis,proj_set)
Expand Down
28 changes: 27 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# dfeR 0.5.0

Add the following lookup data sets into the package:

- ons_geog_shorthands
- countries
- regions
- wd_pcon_lad_la_rgn_ctry

Add the following fetch_locations() functions:

- fetch_wards()
- fetch_pcons()
- fetch_lads()
- fetch_las()
- fetch_regions()
- fetch_countries()

Add wrapper for ONS Open Geography Portal API:

- get_ons_api_data()

Add helper for turning messages on or off:

- toggle_message()

# dfeR 0.4.1

Update comma_sep() function to allow non-numeric values instead of throwing an error, now returns them unchanged.
Expand All @@ -10,7 +36,7 @@ Add function which creates a DfE R project:

# dfeR 0.3.1

Fix bug in get_clean_sql() where using the additional settings would lose the original SQL statement
Fix bug in get_clean_sql() where using the additional settings would lose the original SQL statement.

# dfeR 0.3.0

Expand Down
26 changes: 14 additions & 12 deletions R/create_project.R
Original file line number Diff line number Diff line change
Expand Up @@ -46,19 +46,21 @@ create_project <- function(
...) {
# Function parameter checks ---
# Check if the parameters are 1 length booleans
if (!is.logical(init_renv) || length(init_renv) != 1) {
stop("init_renv must be a boolean.")
} else if (!is.logical(include_structure_for_pkg) ||
length(include_structure_for_pkg) != 1) {
stop("include_structure_for_pkg must be a boolean.")
} else if (!is.logical(create_publication_proj) ||
length(create_publication_proj) != 1) {
stop("create_publication_proj must be a boolean.")
} else if (!is.logical(include_github_gitignore) ||
length(include_github_gitignore) != 1) {
stop("include_github_gitignore must be a boolean.")
}
# List of variables to check
variables <- list(
init_renv = init_renv,
include_structure_for_pkg = include_structure_for_pkg,
create_publication_proj = create_publication_proj,
include_github_gitignore = include_github_gitignore
)

# Loop through each variable and check if it's a boolean
for (var_name in names(variables)) {
var_value <- variables[[var_name]]
if (!is.logical(var_value) || length(var_value) != 1) {
stop(paste(var_name, "must be a boolean."))
}
}

# Project creation -----
usethis::create_project(path = path, open = FALSE)
Expand Down
106 changes: 106 additions & 0 deletions R/datasets_documentation.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#' Lookup for ONS geography columns shorthands
#'
#' A lookup of ONS geography shorthands and their respective column names in
#' line with DfE open data standards.
#'
#' GOR (Government Office Region) was the predecessor to RGN.
#'
#' @format ## `ons_geog_shorthands`
#' A data frame with 7 rows and 3 columns:
#' \describe{
#' \item{ons_level_shorthands}{ONS shorthands used in their lookup files}
#' \item{name_column}{DfE names for geography name columns}
#' \item{code_column}{DfE names for geography code columns}
#' }
#' @source curated by explore.statistics@@education.gov.uk
"ons_geog_shorthands"

#' Ward to Constituency to LAD to LA to Region to Country lookup
#'
#' A lookup showing the hierarchy of ward to Westminster parliamentary
#' constituency to local authority district to local authority to region to
#' country for years 2017, 2019, 2020, 2021, 2022, 2023 and 2024.
#'
#' Changes we've made to the original lookup:
#' 1. The original lookup from ONS uses the Upper Tier Local Authority, we then
#' update this so that where there is a metropolitan local authority we use the
#' local authority district as the local authority to match how
#' DfE publish data for local authorities.
#'
#' 2. We have noticed that in the 2017 version, the Glasgow East constituency
#' had a code of S1400030 instead of the usual S14000030, we've assumed this
#' was an error and have change this in our data so that Glasgow East is
#' S14000030 in 2017.
#'
#' 3. We have joined on regions using the Ward to LAD to County to Region file.
#'
#' 4. We have joined on countries based on the E / N / S / W at the start of
#' codes.
#'
#' 5. Scotland had no published regions in 2017, so given the rest of the years
#' have Scotland as the region, we've forced that in for 2017 too to complete
#' the data set.
#'
#' @format ## `wd_pcon_lad_la_rgn_ctry`
#' A data frame with 24,629 rows and 14 columns:
#' \describe{
#' \item{first_available_year_included}{
#' First year in the lookups that we see this location
#' }
#' \item{most_recent_year_included}{
#' Last year in the lookups that we see this location
#' }
#' \item{ward_name}{Ward name}
#' \item{pcon_name}{Parliamentary constituency name}
#' \item{lad_name}{Local authority district name}
#' \item{la_name}{Local authority name}
#' \item{region_name}{Region name}
#' \item{country_code}{Country name}
#' \item{ward_code}{9 digit ward code}
#' \item{pcon_code}{9 digit westminster constituency code}
#' \item{lad_code}{9 digit local authority district code}
#' \item{new_la_code}{9 digit local authority code}
#' \item{region_code}{9 digit region code}
#' \item{country_code}{9 digit country code}
#' }
#' @source https://geoportal.statistics.gov.uk/search?tags=lup_wd_pcon_lad_utla
#' and https://geoportal.statistics.gov.uk/search?q=lup_wd_lad_cty_rgn_gor_ctry
"wd_pcon_lad_la_rgn_ctry"

#' Lookup for valid country names and codes
#'
#' A lookup of ONS geography country names and codes, as well as some custom
#' DfE names and codes. This is used as the definitive list for the screening
#' of open data before it is published by the DfE.
#'
#' @format ## `countries`
#' A data frame with 10 rows and 2 columns:
#' \describe{
#' \item{country_name}{Country name}
#' \item{country_code}{Country code}
#' }
#' @source curated by explore.statistics@@education.gov.uk, ONS codes sourced
#' from
#' https://geoportal.statistics.gov.uk/search?q=countries%20names%20and%20codes
"countries"

#' Lookup for valid region names and codes
#'
#' A lookup of ONS geography region names and codes for England. In their
#' lookups Northern Ireland, Scotland and Wales are regions.
#'
#' Also included inner and outer London county split as DfE frequently publish
#' those as regions, as well as some custom DfE names and codes. This is used
#' as the definitive list for the screening of open data before it is published
#' by the DfE.
#'
#' @format ## `regions`
#' A data frame with 16 rows and 2 columns:
#' \describe{
#' \item{region_name}{Region name}
#' \item{region_code}{Region code}
#' }
#' @source curated by explore.statistics@@education.gov.uk, ONS codes sourced
#' from
#' https://geoportal.statistics.gov.uk/search?q=NAC_RGN
"regions"
Loading

0 comments on commit 45a6834

Please sign in to comment.