-
Notifications
You must be signed in to change notification settings - Fork 0
/
DAISIEprep.qmd
325 lines (234 loc) · 16.7 KB
/
DAISIEprep.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "DAISIEprep"
author: Luis Valente, Joshua W. Lambert, Lizzie Roeble
---
## Using DAISIEprep pack to extract and format data {#sec-daisieprep}
In this part of the practical, the main features of DAISIEprep are explained. DAISIEprep (Lambert et al. 2023) is an R package that facilitates formatting of data for subsequent use in DAISIE.
![](images/DAISIEprep_logo.png){fig-align="center"}
A typical DAISIEprep pipeline is as follows:
1. Identify island colonisation events for the taxonomic group of interest from time-calibrated phylogenetic trees
2. Assign an island endemicity status (endemic, non-endemic, not present) to each of the species
3. Automatically extract times of colonisation of the island and diversification within the island from the phylogenies
4. Add any missing species
5. Format data for DAISIE.
The DAISIEprep tutorial is divided into the following sections:
- [Single Phylogeny example] - Using a simulated phylogeny including island and non-island species, learn how to extract and format island data for running DAISIE.
- [Adding missing species] - Learn how to add missing species, lineages, etc, to your DAISIE data list.
- [Prepare object for analyses in DAISIE]
## Single Phylogeny Example
In this section we demonstrate a simple example of extracting and formatting data from a single phylogeny. We use a simulated phylogeny, which facilitates explaining how the data is structured.
### Load the required packages:
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.align = "center",
fig.height = 5,
fig.width = 7
)
```
```{r setup, warning=FALSE,message=FALSE}
library(DAISIEprep)
library(ape)
```
### Simulate example phylogeny
We now simulate a phylogeny using the package ape.
```{r simulate phylogeny}
set.seed(
4,
kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection"
)
phylo <- ape::rcoal(10)
```
**Important:** DAISIEprep requires the tip labels (taxon names) in the phylogeny to be formatted as genus name and species name separated by an underscore (e.g. "Canis_lupus", "Species_1").
Here we add tip labels (representing our species) to the simulated phylogeny. In this case, all taxa sampled are different plant species from the same genus.
```{r}
phylo$tip.label <- c("Plant_a", "Plant_b", "Plant_c", "Plant_d", "Plant_e",
"Plant_f", "Plant_g", "Plant_h", "Plant_i", "Plant_j")
```
Then we convert the phylogeny to a phylo4 class, defined in the package phylobase. This allows users to easily work with data for each tip in the phylogeny, for example whether they are endemic to the island or not.
```{r convert phylo to phylo4}
phylo <- phylobase::phylo4(phylo)
phylobase::plot(phylo)
```
Now we have a phylogeny in the phylo4 format to which we can easily append data.
### Add distribution information to the tips of the tree
Specify tips corresponding to species from your focal island by specifying that they are endemic and/or non-endemic to the island. In the example below, species Plant_h is found on the island but it not endemic, and species Plant_f and Plant_e are found on the island and are endemic to the island. All other species on the phylogeny are NOT present on the island.
```{r}
island_species <- data.frame(
tip_labels = c("Plant_h",
"Plant_f",
"Plant_e")
,
tip_endemicity_status = c("nonendemic","endemic","endemic"))
```
Now that we know which species are present on the island, we can assign an endemicity status to all the species in the phylogeny.
```{r}
endemicity_status <- create_endemicity_status(
phylo = phylo,
island_species = island_species
)
```
And append this information to the phylogeny object:
```{r}
phylod <- phylobase::phylo4d(phylo, endemicity_status)
```
We can now visualise our phylogeny with the island endemicity statuses plotted at the tips. This uses the ggtree and ggplot2 packages.
```{r plot phylogeny with tip data}
plot_phylod(phylod = phylod)
```
As you can see, in the example phylogeny, there appear to be 2 separate colonisation events of the island - one which has led to species Plant_f and Plant_e and other that has led to the non-endemic species Plant_h.
### Extract data using DAISIEprep
Now that we can see the tips that are present on the island, we can extract them to form our island community data set that can be used in the DAISIE R package to fit likelihood models of island colonisation and diversification.
Before we extract species, we first create an object to store all of the island colonists' information. This uses the `island_tbl` class introduced in this package (DAISIEprep). This `island_tbl` object can then easily be converted to a DAISIE data list using the function `create_daisie_data` (more information on this below).
```{r create island_tbl}
island_tbl <- island_tbl()
island_tbl
```
We can see that this is an object containing an empty data frame. In order to fill this data frame with information on the island colonisation and diversification events we can run:
```{r extract island data}
island_tbl_min <- extract_island_species(
phylod = phylod,
extraction_method = "min"
)
island_tbl_min
```
The function `extract_island_species()` is the main function in DAISIEprep to extract data from the phylogeny. In the example above, we used the "min" extraction algorithm. The "min" algorithm extracts island community data assuming no back-colonisation from the island to the mainland. Each row in the `island_tbl` corresponds to a separate colonisation of the island. **In this case, two colonist lineages were correctly identified using the 'min' extraction algorithm, one endemic and another non-endemic. Each row in the table corresponds to one colonisation event.**
However, if back-colonisation is frequent (for example, one species within a large endemic island radiation colonised another island or mainland), we recommend using the "asr" algorithm, which extracts the most likely colonisations inferred in an ancestral area reconstruction. To use the "asr" algorithm, we first need to know the probability of the ancestors of the island species being present on the island to determine the time of colonisation. To do this, we can fit one of many ancestral state reconstruction methods. Here we use the "mk" model (which uses a continuous-time markov model to reconstruct ancestral states), as it is a simple method that should prove reliable for reconstructing the ancestral species areas (i.e. on the island or not on the island) for most cases. First, we translate our extant species endemicity status to a numeric representation of whether that species is on the island.
```{r}
phylod <- add_asr_node_states(phylod = phylod, asr_method = "mk")
```
Now we can plot the phylogeny, which this time includes the node labels for the presence/absence on the island in ancestral nodes.
```{r plot phylogeny with tip and node data}
plot_phylod(phylod = phylod)
```
\
Sidenote (optional): if you are wondering what the probabilities are at each node and whether this should influence your decision to pick a preference for island or mainland when the likelihoods for each state are equal, we can plot the probabilities at the nodes to visualise the ancestral state reconstruction using `plot_phylod(phylod = phylod, node_pies = TRUE)`.
Now we can extract island colonisation and diversification times from the phylogeny with asr using the reconstructed ancestral states of island presence/absence.
```{r}
island_tbl_asr <- extract_island_species(
phylod = phylod,
extraction_method = "asr"
)
```
```{r display asr island_tbl}
island_tbl_asr
```
By comparing `island_tbl_min` and `island_tbl_asr` you can see the differences in the 2 extraction methods.\
```{r compare}
all.equal(island_tbl_min,island_tbl_asr)
```
In this case, the results using "asr" and "min" are exactly the same. For the purposes of this practical, we will proceed with the results from "asr" `(island_tbl_asr`). For convenience, let's rename the object to `island_tbl`.
```{r}
island_tbl<-island_tbl_asr
```
You can visualize the contents of the island_tbl object to try and figure out what they mean.
```{r eval=FALSE, include=TRUE}
View(island_tbl)
```
For example, the code below shows you the names of the species included in each lineage:
```{r echo=TRUE}
island_tbl@island_tbl$species
```
## Adding missing species
It is often the case that phylogenetic data is not available for some island species or even for entire lineages present in the island community. But we can still include these species in our DAISIE analyses using DAISIEprep. This section is about the tools that DAISIEprep provides in order to handle missing data, and generally to handle species that are missing and need to be input into the data manually.
For this section, as with the previous section, the core data structure we are going to work with is the `island_tbl`. We will use the `island_tbl` for the example plant dataset produced in the last section.
### Adding missing species to an island clade that has been sampled in the phylogeny
This option is for cases in which a clade has been sampled in the phylogeny, and at least 1 colonisation or 1 branching time is available in the datalist, but 1 or more species from that clade were not sampled. For this example, we imagine that 2 species have not been sampled, and that we want to add them as missing species to the lineage containing species "Plant_e" that is sampled in the phylogeny. To assign two missing species to this clade we use the function `add_missing_species()` The argument `species_to_add_to` uses a representative sampled species from that island clade to work out which colonist in the `island_tbl` to assign the specified number of missing species (`num_missing_species`) to. You can find out which species are already sampled in each lineage using:
```{r}
island_tbl@island_tbl$species
```
Now run the code:
```{r}
island_tbl <- add_missing_species(
island_tbl = island_tbl,
# num_missing_species equals total species missing
num_missing_species = 2,
# name of a sampled species you want to "add" the missing to
# it can be any in the clade
species_to_add_to = "Plant_e"
)
```
The new island table now has missing species added to the lineage we wanted to:
```{r}
island_tbl@island_tbl$missing_species
```
With the new missing species added to the `island_tbl` we can repeat the conversion steps above using `create_daisie_data()` to produce data accepted by the DAISIE model.
```{r}
data_list <- create_daisie_data(
data = island_tbl,
island_age = 12,
num_mainland_species = 100,
precise_col_time = TRUE
)
```
### Adding an entire lineage when a phylogeny is not available for the lineage.
For cases when the missing species belong to a lineage that has not been sampled in the phylogeny. Assuming we did not have any phylogenetic data or colonisation time estimate for the island clade, we can insert species as missing and specify them as a separate lineage. Because no colonisation and branching times are known for this lineage, when this lineage later gets processed by the DAISIE inference model it will be assumed it colonised the island any time between the island's formation and the present.
The input needed are:
- `island_tbl` to add to an existing `island_tbl`
- `clade_name` a name to represent the new clade addeed, can either be a specific species from the clade or a genus name, or another name that represent those species
- `status` either "endemic" or "nonendemic"
- `species` a vector of species names contained within colonist (if applicable)
- `missing_species` **In the case of a lineage with just 1 species (i.e. not an island radiation) the number of missing species is zero, as by adding the colonist it already counts as one automatically. In the case of an island clade of more than one species, the number of missing species in this case should be `n-1`.**
Example for adding lineage with 1 species:
```{r}
island_tbl <- add_island_colonist(
island_tbl = island_tbl,
clade_name = "Plant_y",
status = "endemic",
# clade with just 1 species, missing_species = 0
# because adding the lineage already counts as 1
missing_species = 0,
col_time = NA_real_,
col_max_age = FALSE,
branching_times = NA_real_,
min_age = NA_real_,
clade_type = 1,
species = "Plant_a"
)
```
Example for adding lineage with 5 species ("Plant_a", "Plant_b", "Plant_c", "Plant_d", "Plant_e") to a lineage called "Plant_radiation"
```{r}
island_tbl <- add_island_colonist(
island_tbl = island_tbl,
clade_name = "Plant_radiation",
status = "endemic",
# the total species is 5 and all are missing
# but we add missing_species = 4 because
# adding the lineage already counts as 1
missing_species = 4,
col_time = NA_real_,
col_max_age = FALSE,
branching_times = NA_real_,
min_age = NA_real_,
clade_type = 1,
species = c("Plant_a", "Plant_b", "Plant_c",
"Plant_d", "Plant_e")
)
```
## Prepare object for analyses in DAISIE
**Convert to data object to be used in DAISIE**
Now that we have the `island_tbl` we can convert this to the DAISIE data list to be used by the DAISIE inference model. To convert to the DAISIE data list (i.e. the input data of the DAISIE inference model) we use `create_daisie_data()`, providing the `island_tbl` as input. We also need to specify:
- The age of the island or archipelago. Here we use an island age of twelve million years (`island_age = 12`).
- Whether the colonisation times extracted from the phylogenetic data should be considered precise (`precise_col_time = TRUE`). We will not discuss the details of this here, but briefly by setting this to `TRUE` the data will tell the DAISIE model that the colonisation times are known without error. Setting `precise_col_time = FALSE` will change tell the DAISIE model that the colonisation time is uncertain and should interpret this as the upper limit to the time of colonisation and integrate over the uncertainty between this point and either the present time or to the first branching point (either speciation or divergence into subspecies).
- The number of species in the mainland source pool. Here we set it to 100 (`num_mainland_species = 100`). This will be used to calculate the number of species that could have potentially colonised the island but have not. When we refer to the mainland pool, this does not necessarily have to be a continent, it could be a different island if the source of species immigrating to an island is largely from another nearby island (a possible example of this could be Madagascar being the source of species colonising Comoros). This information is used by the DAISIE model to calculate the colonisation rate of the island.
```{r}
data_list <- create_daisie_data(
data = island_tbl,
island_age = 12,
num_mainland_species = 100,
precise_col_time = TRUE
)
```
Below we show two elements of the DAISIE data list produced. The first element `data_list[[1]]` in every DAISIE data list is the island community metadata, containing the island age and the number of species in the mainland pool that did not leave descendants on the island at the present day. This is important information for DAISIE inference, as it is possible some mainland species colonised the island but went extinct leaving no trace of their island presence.
```{r}
data_list[[1]]
```
Next is the first element containing information on island colonists (every element `data_list[[x]]` in the list after the metadata contains information on individual island colonists). This contains the name of the colonist, the number of missing species, and the branching times, which is a vector containing the age of the island, the colonisation time and the times of any cladogenesis events. Confusingly, it may be that the branching times vector contains no branching times: when there are only two numbers in the vector these are the island age followed by the colonisation time. Then there is the stac, which stands for status of colonist. This is a number which tells the DAISIE model how to identify the endemicity and colonisation uncertainty of the island colonist ([these are explained here if you are interested](https://cran.r-project.org/package=DAISIE/vignettes/stac_key.html)). Lastly, the type1or2 defines which macroevolutionary regime an island colonist is in. By macroevolutionary regime we mean the set of rates of colonisation, speciation and extinction for that colonist. Most applications will assume all island clades have the same regime and thus all are assigned type 1. However, if there is **a priori** expectation that one clade significantly different from the rest, e.g. the Galápagos finches amongst the other terrestrial birds of the Galápagos archipelago this clade can be set to type 2.
```{r}
data_list[[2]]
```
This `datalist` is now ready to be used in the DAISIE maximum likelihood inference model from the R package DAISIE.
End of the DAISIEprep tutorial!