-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data-integration-R.Rmd
454 lines (287 loc) · 23.6 KB
/
Data-integration-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "Data Integration in R"
author: "Susan Jarvis and Laura Graham"
date: "May 2017"
output:
md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
# Before you start!
The tutorial below relies on you having several R packages installed which are listed below. These can be installed using the `install.packages` command.
- [lattice](https://cran.r-project.org/web/packages/lattice/index.html)
- [spocc](https://cran.r-project.org/web/packages/spocc/index.html)
- [blighty](https://cran.r-project.org/web/packages/blighty/index.html)
- [raster](https://cran.r-project.org/web/packages/raster/index.html)
- [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)
- [sqldf](https://cran.r-project.org/web/packages/sqldf/index.html)
# 1. Loading Data
R packages exist to load in pretty much any form of data you can think of. Some key examples include:
- [readr](https://cran.r-project.org/web/packages/readr/README.html) tends to work faster and have more functionality for flat files (.csv, .txt) than base R (useful for big files)
- [readxl](https://blog.rstudio.org/2015/04/15/readxl-0-1-0/) for Excel spreadsheets
- [RODBC](https://cran.r-project.org/web/packages/RODBC/RODBC.pdf) for many types of database including Access
- [RPostgreSQL](https://www.r-bloggers.com/getting-started-with-postgresql-in-r/) for PostgreSQL databases
- [googlesheets](https://cran.r-project.org/web/packages/googlesheets/googlesheets.pdf) to interface with Google sheets
- [raster](https://cran.r-project.org/web/packages/raster/raster.pdf) and [rgdal](https://cran.r-project.org/web/packages/rgdal/rgdal.pdf) for spatial data
- [ncdf4](https://cran.r-project.org/web/packages/ncdf4/ncdf4.pdf) for NetCDF files
- [RCurl](https://cran.r-project.org/web/packages/RCurl/RCurl.pdf) contains functions to fetch data from webpages (along with lots more functionality for interfacing with webpages)
## 1.1 Reading in .csv files
Most of the data we will use in this tutorial are stored in .csv files. Below we use the `read.csv` function to load these files and the `summary` function to inspect their contents.
*Hint* - The .csv files you require are in BES-QSIG/Data-integration-R folder on Github. To download the .csv files along with this document, click 'Clone or download' and select 'Download ZIP' to download a ZIP file with all the course contents. Unzip this and save the files to a location on your computer.
```{r, eval = FALSE}
#set working directory
setwd("Put the path to the csv files you downloaded from Github here e.g. C:/My name/Documents/")
```
```{r}
#read in csv files using read.csv
cs_data <- read.csv("LANDSCAPE_AREA_FEATURE_HAB_1978.csv")
cs_locs <- read.csv("CS_locations_false.csv")
ag_data <- read.csv("EnglandWales_1979_2k.csv")
```
```{r, eval = FALSE}
#Look at summaries of the three csv files you have loaded
summary(cs_data)
summary(cs_locs)
summary(ag_data)
```
# 2. Fetching data from online sources
Plenty of sources of data exist online, and a lot of these have related R packages to facilitate their downloading. The [ROpenSci](https://ropensci.org/) project have developed a suite of packages to do just this. Other useful R packages for downloading online data sources include [MODIS](https://cran.r-project.org/web/packages/MODIS/index.html) for downloading the MODIS remote sensing data and [bioclim](https://rforge.net/doc/packages/climates/bioclim.html) for recreating the 19 bioclimatic variables used by WorldClim.
## 2.1 Downloading species' occurrence data using [`spocc`](https://ropensci.org/tutorials/spocc_tutorial.html)
We are going to use the `spocc` package to download records for yellowhammer (*Emberiza citrinella*) from GBIF.
First, we will need a bounding box to limit the records to those within our study area. I have created one for the entire Great Britain using a [tool to create a Well Known Text (WKT) polygon](http://arthur-e.github.io/Wicket/sandbox-gmaps3.html).
```{r}
bb <- "POLYGON((-9.457034468650818 59.70101353199732,3.902340531349182 59.70101353199732,3.902340531349182 49.13859653703879,-9.457034468650818 49.13859653703879,-9.457034468650818 59.70101353199732))"
```
We can then load the `spocc` package and use the `occ` function to download records for a species of interest, here we will download occurrence records for the yellowhammer from the GBIF database.
```{r, message = F}
library(spocc)
yellowhammer <- occ(query = 'Emberiza citrinella', from = 'gbif', geometry = bb)
```
The `spocc` package also provides a function to convert the retrieved data to a dataframe
```{r}
yellowhammer_df <- occ2df(yellowhammer)
head(yellowhammer_df)
```
###
# 3. Visualising and exploring data:
Here we will look in more detail at the data from the .csv files you loaded in part 1. There are three tables:
- `ag_data` gives a range of Agricultural Census variables for year 1979 at 2km gridded resolution. A range of additional variables and/or years can be downloaded from [agcensus](https://access.edina.ac.uk/agcensus/)
- `cs_data` gives the Countryside Survey habitat areas for 256 1km squares surveyed in 1978. Additional datasets and/or years can be downloaded from the [EIDC](https://catalogue.ceh.ac.uk/eidc/documents#term=Countryside+Survey&page=1)
- `cs_locs` gives (incorrect) locations for the 256 CS squares. The locations are confidential which is why false locations are given for the purposes of this tutorial. True locations are available on request at 10km resolution via the [EIDC](https://catalogue.ceh.ac.uk/eidc/documents#term=Countryside+Survey&page=1)
Firstly, we will look at the `cs_data` table and do some exploratory analysis of this data, before moving on to look at integrating this information with the `cs_locs` and `ag_data` tables.
We can start by removing some columns that aren't useful from the CS data (year, ID code, land class and environmental zone which we won't be using in the analysis).
```{r}
cs_data <- subset(cs_data, select = -c(YEAR,ID,LAND_CLASS90,EZ_DESC_07))
```
The CS data holds data on habitat area mapped within each 1km square. We can make a simple boxplot to compare the average habitat area per 1km Countryside Survey sample square for each habitat.
```{r}
boxplot(cs_data$AREA ~ cs_data$BROAD_HABITAT)
```
This is very messy and not very informative! Let's focus on a few habitats of interest and group the priority habitats (e.g. saltmarsh, blanket bog) together in one category. Arable, improved grassland and woodland, along with the total priority habitat area, might be expected to be related to agricultural intensity so we'll focus on these habitats to start with.
Firstly, we'll create new columns for the area of each of our habitats of interest
```{r}
cs_data$arable_area <- 0 #create empty column to start
cs_data$arable_area[cs_data$BROAD_HABITAT == 4] <- cs_data$AREA[cs_data$BROAD_HABITAT == 4] #arable has BROAD_HABITAT code 4
cs_data$impgrass_area <- 0 #create empty column to start
cs_data$impgrass_area[cs_data$BROAD_HABITAT == 5] <- cs_data$AREA[cs_data$BROAD_HABITAT == 5] #improved grassland has BROAD_HABITAT code 5
cs_data$woodland_area <- 0 #create empty column to start
cs_data$woodland_area[cs_data$BROAD_HABITAT %in% c(1,2)] <- cs_data$AREA[cs_data$BROAD_HABITAT %in% c(1,2)] #this considers both broadleaf (code = 1) and coniferous (2) woodland
```
Now we will create a new column of priority habitat area
```{r}
cs_data$priority <- 0
cs_data$priority[cs_data$BROAD_HABITAT %in% 24:45] <- 1
cs_data$priority_area <- cs_data$AREA*cs_data$priority
```
Finally, we can aggregate the data by CS 1km square (SQUARE field) to calculate the sum of priority habitat for each square
```{r}
cs_areas <- aggregate(cbind(arable_area,impgrass_area,woodland_area,priority_area) ~ SQUARE + COUNTRY + COUNTY, data = cs_data,FUN = sum)
```
We can now look at some exploratory plots of the aggregated data. For example, we could look at how the average area of arable habitat in each CS square varies between England, Scotland and Wales.
```{r}
boxplot(cs_areas$arable_area ~ cs_areas$COUNTRY)
```
It is fairly clear that the area of arable land in a 1km CS square tends to be higher in England than in Scotland and Wales. You can swap `arable_area` for `priority_area` to see if a similar pattern holds for the area of priority habitats.
We can also look at the distributions of each habitat area by plotting a histogram.
```{r}
hist(cs_areas$arable_area)
```
Here we can see that the majority of the 256 CS squares have relatively little arable area, with a long tail of squares with high areas of arable land. Does the same pattern occur in the distributions of the other habitats?
## 3.1 Mapping the Countryside Survey data
To enable us to map the Countryside Survey data we need to join it to location data which sits in a separate file called `CS_locs`.
The next step is to join the habitat data to this location information. Note these locations are incorrect due to confidentiality over Countryside Survey locations.
```{r}
#add (false) locations where SQUARE column matches between data sets
cs_areas$EASTING <- cs_locs$Easting[match( cs_areas$SQUARE,cs_locs$SQUARE)]
cs_areas$NORTHING <- cs_locs$Northing[match( cs_areas$SQUARE,cs_locs$SQUARE)]
```
We will use the `blighty` package to map the data.
```{r, message = FALSE}
library(blighty)
blighty()
points(cs_areas$EASTING/1000, cs_areas$NORTHING/1000, cex = 1, pch = 20, col = "red")
```
Note the nonsense locations so some might be in the sea!
We can colour the points by the area of one of the habitats (in this case arable):
```{r, message = F}
blighty()
points(cs_areas$EASTING/1000, cs_areas$NORTHING/1000, pch = 20, col = ceiling(cs_areas$arable_area/100000)+1)
legend(550,1200, c("0-0.09", "0.1-0.19","0.2-0.29","0.3-0.39","0.4-0.49","0.5-0.59","0.6-0.69","0.7-0.79","0.8-0.89","0.9-0.99", "1"), col = 1:11, pch = 20, cex = 0.6, title = "Arable land (km2)")
```
Not much spatial pattern due to the false locations!
## 3.2 Exploring and mapping the Agcensus data
The Agcensus data in `ag_data` has a selection of data available from the [agcensus](https://access.edina.ac.uk/agcensus/) webpage (total number of holdings, total area of holdings, total area of crops and fallow land, area of rough grass, amount of woodland on holding - all areas in hectares). We can use the `lattice` package to easily look at relationships between different data.
```{r}
library(lattice)
pairs(ag_data[1:1000,3:7])
```
Here only data from the first 1000 rows are plotted to speed up drawing of the plot. Which variables are related to each other? Does this make sense?
We can then use the `blighty` package again to map the data.
```{r,message = FALSE}
blighty()
points(ag_data$x/1000, ag_data$y/1000)
```
Yikes! The map is full of points! A better solution is to convert to raster data. This makes sense as the data are gridded and have a set resolution (2 km). We can convert the data using the `raster` package.
Also note we only have Agcensus data for England and Wales in this example.
```{r, message = FALSE}
library(raster)
#look at total area of holdings
ag_raster_df <- ag_data[,c(1,2,4)]
coordinates(ag_raster_df) <- ~x+y
gridded(ag_raster_df) <- TRUE
ag_raster <- raster(ag_raster_df)
plot(ag_raster)
#get information about raster
ag_raster
#note resolution - 2km by 2km (units are m)
```
Obtaining the resolution of the Agcensus data is straightforward, however what is the resolution of the Countryside Survey data? The sample is at 1km resolution but the accuracy of location is at 10km! Because we don't have the 1km locations we will treat the CS data as if it is at 10km resolution in the next section, but it is important to consider what this means for interpretation later on.
# 4. Aggregating raster data
If we want to be able to compare the total area of agricultural holdings to the 10km CS squares, we need to aggregate the agricultural census data to 10km x 10km. We need to aggregate the original 2km x 2km raster by a factor of 5.
```{r}
ag_10km <- aggregate(ag_raster, fact = 5)
plot(ag_10km)
```
The `aggregate` function gives us the mean value for the aggregated raster by default. Sometimes we may want to apply a different function to aggregate a raster. For example, if we had elevation data, we may be more interested in the variation of the terrain, rather than a mean elevation. This can be done by using the argument `fun = var`.
# 5. Matching spatial data sources
Spatial data come associated with coordinate reference systems. This allows data sets stored or displayed with different CRS's to be compared to each other. The CS and agricultural census data are stored as British National Grid, and the species data are WGS84. We can get the string required by R for any CRS from [http://spatialreference.org/](http://spatialreference.org/).
```{r}
wgs84_proj <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
bng_proj <- CRS("+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs")
```
Now we shall convert the CS areas to points, so that we can spatially match them to the agricultural census data to extract the mean area of holdings for each CS square.
```{r}
cs_coords <- cs_areas[,c("EASTING", "NORTHING")]
cs_points <- SpatialPointsDataFrame(coords = cs_coords, data = cs_areas, proj4string = bng_proj)
```
The agricultural census raster currently does not have a CRS associated with it. We can set this using the `proj4string` function.
```{r}
proj4string(ag_10km) <- bng_proj
```
We can now plot the locations of the CS squares together with the agricultural census data.
```{r}
plot(ag_10km)
plot(cs_points, add=TRUE, pch = 16)
```
We can also extract information from the underlying agricultural census raster to each of the CS points using the `extract` function.
```{r}
cs_points$ag_holding <- extract(ag_10km, cs_points)
```
# 6. Joining datasets
Now that we have the agricultural holdings data for each CS location, we can attach that information to the CS data to create a complete dataset. When joining datasets, it's important to think about the kind of relationship between them. There are three such relationships:
- One-to-one: each element of dataset A is related to exactly one element of dataset B
- One-to-many: each element of dataset A is related to > 1 element of dataset B
- Many-to-many: >1 element of dataset A is related to > 1 element of dataset B
In our case, we want to join `cs_points` with `cs_data`. `cs_points` has one row per CS square, whereas `cs_data` has one row for each CS square and broad habitat combination. This means that we have a one-to-many relationship.
We use the function `merge` to join these two datasets. First we will select only the columns we want to join, as well as the ID column
```{r}
cs_points@data <- subset(cs_points@data, select=c("SQUARE", "ag_holding"))
cs_data <- merge(cs_data, cs_points, by="SQUARE")
```
Sometimes we may be interested in where there is not a match between two datasets. For example if we had a set of grid cell IDs and a set of species occurrences referenced by grid cell. We might want to keep all grid cells regardless of whether the species has been recorded or not. This can be done using the `all.x=TRUE`, `all.y=TRUE` or `all=TRUE` arguments. These are equivalent to what is known as a left join, right join or full outer join respectively. The join we did above is known as an inner join.
# 7. Summarising data
We may also want to summarise data in a non-spatial way. For example summarising from daily to monthly, or summarising by a categorical variable such as habitat. In the below example, we calculate the mean and total agricultural holding per broad habitat type. The [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html) and [dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) packages offer many functions for manipulating and summarising data in a readable and user friendly way.
```{r, message = F}
library(dplyr)
ag_summary <- group_by(cs_data, BROAD_HABITAT_NAME) %>%
summarise(mean_ag_holding = mean(ag_holding, na.rm=TRUE),
total_ag_holding = sum(ag_holding, na.rm=TRUE))
ag_summary
```
# 8. Relationships between data
To look at relationships between habitat area and Agcensus data (specifically the total area of holdings) using the merged dataset created in section 6. However, we want to make two changes:
1. We want to add in the other Agcensus data (area of crops, grass and woodlands as well as total number of holdings)
2. We want to summarise the data by 1km square
We could repeat the process of creating an aggregated raster layer for each Agcensus data type and then extracting by CS square, but instead we will use a different process to demonstrate some alternative techniques.
Firstly, we add two new columns to the `ag_data` which give the 10km grid references. We can then use `aggregate` to calculate the mean agricultural data per 10km square (this is a non-spatial alternative to the raster aggregration process in section 4).
```{r}
ag_data$x_10km <- floor(ag_data$x/ 10000)*10000
ag_data$y_10km <- floor(ag_data$y/ 10000)*10000
ag_10km <- aggregate(. ~ x_10km + y_10km, data = ag_data, FUN = mean)
```
Secondly we will use another package called `sqldf` to use the SQL language to join the datasets together. This package is very useful if you are used to using SQL in other programs such as SAS. Here we are going to use a "left join", equivalent to `all.x=TRUE` using the `match` function. This means that all Countryside Survey squares are returned in the `cs_join_ag` table, even if they do not match any Agcensus data (e.g. they are in Scotland).
```{r, message = F, verbose = F}
library(sqldf)
cs_join_ag <- sqldf("select *
from cs_areas as t1
left join ag_10km as t2
on t1.EASTING = t2.x_10km
and t1.NORTHING = t2.y_10km")
```
However, we may need to remove rows where there is missing data for some analyses. We can do this with the `complete.cases` function. This returns only rows with no `NA` values.
```{r}
cs_join_agCC <- cs_join_ag[complete.cases(cs_join_ag),]
```
We can use the plot produced by the `pairs` function again to visualise relationships. We first use the `subset` function to remove unwanted variables.
```{r}
cs_join_agCC <- subset(cs_join_agCC, select = -c(x_10km,y_10km,x,y))
pairs(cs_join_agCC[,-c(1,2,3)]) # don't plot country/county columns
```
This gives us a lot of relationships to view! Try removing further columns in the `pairs` command to focus on a few of interest.
We can also use the `cor` function to produce a table of correlations.
```{r, eval = F}
cor(cs_join_agCC[,-c(1,2,3)])
```
We can think about relationships we might expect to occur between CS and Agcensus data. For example, we would expect woodland area in CS to be correlated with woodland area in ag census...
```{r}
plot(cs_join_agCC$woodland_area, cs_join_agCC$Woodland.on.holding..c8.)
cor(cs_join_agCC$woodland_area, cs_join_agCC$Woodland.on.holding..c8.)
```
In this example there isn't much evidence for a relationship, but that might not be surprising given the incorrect CS locations!
For most of the relationships between CS and Agcensus it might be appropriate to investigate correlations as it not clear which variable would affect which. The exception is relationships with Easting and Northing, it is clear that Easting can affect arable area but not the other way round! In this case, a regression model can be built.
Model with arable area measured in CS:
```{r}
lm1 <- lm(arable_area ~ EASTING, data = cs_join_agCC)
summary(lm1)
```
Model with crop area from Agcensus:
```{r}
lm2 <- lm(Total.Crops...Fallow..c35. ~ EASTING, data = cs_join_agCC)
summary(lm2)
```
We can see when using the Agcensus data and Easting that there is more crop and fallow land in the east of the UK (this should be what we expected!).
You can consider other relationships that can be investigated with regression relationships. Extensions to include multiple predictors or generalised linear models could be developed e.g. a zero inflated distribution could be used to describe the area of priority habitat.
One aspect that has not been investigated in this short tutorial but could be a useful exercise to continue improving your skills in integration would be to look at change over time. We looked at the earliest Countryside Survey and matched to an appropriate Agricultural Census but the survey has been repeated three times since then (1990, 1998 and 2007). You could look at:
- Whether the patterns in data were the same in 2007 as they were in 1978
- Whether change in e.g. area of priority habitats is related to changes in agricultural practice
- Whether additional covariates from either Agricultural Census or other sources (e.g. climate data) explain changes in habitat areas
Things to consider might be:
- How to choose appropriate data to match e.g. which Agricultural Census years correspond to Countryside Survey years?
- How to match up locations between years? Do you need a different process for Countryside Survey data and Agricultural Census data?
- How to define an appropriate model to look at change over time
# 9. Interpretation and further reading
When considering how to intepret an integrated analysis we need to think about the assumptions we made in the analysis. For example, in the analysis above we assumed:
1. It was logical and valid to join the two datasets. We justified this by using a similar temporal period and spatial scope (late 1970s, UK/England).
2. The data meanings have been interpreted and used correctly. This requires proper investigation of metadata to understand e.g. units, methods of measurement that will influence interpretation of the results. Metadata for the datasets used in this analysis is available online from [agcensus](https://access.edina.ac.uk/agcensus/) and [EIDC](https://catalogue.ceh.ac.uk/eidc/documents#term=Countryside+Survey&page=1). Note not all metadata is of comparable quality and it is important to investigate this before beginning an analysis.
3. The spatial scales (extent and resolution) are comparable. Here we had two problems to contend with: 1) the CS data locations were given at a lower resolution to preserve confidentiality, 2) the Agcensus data was then averaged to the scale of the low resolution CS data. Consider what this means for interpretation - what would you write in a paper?
Interpretation is more complex with an integrated analysis so put aside plenty of time to think about this.
## Further reading
[A framework for data fusion in species distribution modelling (Pacifici *et al*)](http://onlinelibrary.wiley.com/doi/10.1002/ecy.1710/full)
[Combining and aggregating environmental data (Maas-Hebner *et al*)](https://www.ncbi.nlm.nih.gov/pubmed/25893765)
[Ecological data sharing (Michener *et al*)](http://www.sciencedirect.com/science/article/pii/S1574954115001004)
[BES guide to Data Management in Ecology and Evolution](http://www.britishecologicalsociety.org/wp-content/uploads/Publ_Data-Management-Booklet.pdf)
## Contact us
This tutorial was put together for a Data Integration course due to run on 4th May 2017 which was unfortunately cancelled. Therefore we have not tested this course in a real life environment so there may be bugs and errors. Please let us know if you find any by contacting us at [email protected] or requesting access to the repository on Github. We are happy for you to re-use any part of this tutorial in any in-person or online course.
This course was developed by the British Ecological Society Quantitative Ecology and Agricultural Ecology Special Interest Groups. You can contact the SIGs by email ([email protected] and [email protected]) or on Twitter (@BES_QE_SIG and @bes_aeg).
# Happy Data Integration!!