- 1 Introduction
- 2 Preprocessing
- 3 Generating Stacks Catalogs and Calling SNPs
- 4 Analysis
- 4.1 NLCD Data
- 4.2 Make maps of sampling locations
- 4.3 Continental population structure: population statistics by species
- 4.4 Continental population structure: Structure software results
- 4.5 Continental population structure: Structure plots
- 4.6 Validation of Structure results with sNMF
- 4.7 AMOVA
- 4.8 Local: $F_{IS}$ - Homozygosity within population
- 4.9 Local: $\rho$ - Pairwise comparison
- 4.10 Local: $\bar{r}_d$ - Linkage disequilibrium
- 4.11 Isolation by distance
- 4.12 Isolation by environment
- 4.13 Correlation between Urbanness and Admixture
- 5 Appendix
In this experiment, we used quaddRAD library prep to prepare the sample DNA. This means that there were both two unique outer barcodes (typical Illumina barcodes) AND two unique inner barcodes (random barcode bases inside the adapters) for each sample - over 1700 to be exact!
The sequencing facility demultiplexes samples based on the outer barcodes (typically called 5nn and i7nn). Once this is done, each file still contains a mix of the inner barcodes. We will refer to these as “sublibraries” because they are sort of halfway demultiplexed. We separate them out bioinformatically later.
Here’s a bit of information on the file name convention. The typical raw file looks like this:
AMH_macro_1_1_12px_S1_L001_R1_001.fastq.gz
-
These are author initials and “macro” stands for “Macrosystems”. These are on every file.
AMH_macro
-
The first number is the i5nn barcode for the given sublibrary. We know all these samples have a i5nn barcode “1”, so that narrows down what they can be. The second number is the i7nn barcode for the given sublibrary. We know all these samples have a i7nn barcode “1”, so that further narrows down what they can be.
1_1
-
This refers to how many samples are in the sublibrary. “12px” means 12-plexed, or 12 samples. In other words, we will use the inner barcodes to further distinguish 12 unique samples in this sublibrary.
12px
-
This is a unique sublibrary name. S1 = 1 i5nn and 1 i7nn.
S1
-
This means this particular file came from lane 1 of the NovaSeq. There are four lanes. All samples should appear across all four lanes.
L001
-
This is the first (R1) of two paired-end reads (R1 and R2).
R1
-
The last part doesn’t mean anything - it was just added automatically before the file suffix (
fastq.gz
)001.fastq.gz
There are three main systems at play for file transfer: the local machine, the sequencing facility’s (GRCF) Aspera server, and MARCC. The Aspera server is where the data were/are stored immediately after sequencing. MARCC is where we plan to do preprocessing and analysis. Scripts and text files are easy for me to edit on my local machine. We used Globus to transfer these small files from my local machine to MARCC.
Midway through this analyses, we transitioned to another cluster, JHU’s Rockfish. Scripts below, with the exception of file transfer from the Aspera server, should reflect the new filesystem, though you will have to adjust the file paths accordingly.
Referred to through files as “Step 1”. Files can be found in the
01_transfer_files/
directory.
This directory contains files named in this convention:
01-aspera_transfer_n.txt
. These are text files containing the names
of fastq.gz
files that we wanted to transfer from the sequencing
facility’s Aspera server to the computing cluster
(MARCC). This was to maximize ease of
transferring only certain files over at once, since transferring could
take a long time. We definitely did this piecemeal. Possible file names
shown in Aspera Transfer File Names.
There are multiple of these files so that we could parallelize (replace
n with the correct number in the command used below). This text file
will need to be uploaded to your scratch directory in MARCC.
Files were then transferred using the following commands. Before starting, make sure you are in a data transfer node. Then, load the aspera module. Alternatively, you can install the Aspera transfer software and use that.
module load aspera
Initiate the transfer from within your scratch directory:
ascp -T -l8G -i /software/apps/aspera/3.9.1/etc/asperaweb_id_dsa.openssh
--file-list=01-aspera_transfer_n.txt
--mode=recv --user=<aspera-user> --host=<aspera-IP> /scratch/users/<me>@jhu.edu
Referred to through files as “Step 2”. Files can be found in the
02_concatenate_and_check/
directory.
Step 2a. We ran my samples across the whole flow cell of the NovaSeq, so results came in 8 files for each demultiplexed sublibrary (4 lanes * paired reads). For example, for sublibrary 1_1, we’d see the following 8 files:
AMH_macro_1_1_12px_S1_L001_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L001_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L002_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L002_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L003_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L003_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L004_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L004_R2_001.fastq.gz
The 02_concatendate_and_check/02-concat_files_across4lanes.sh
script
finds all files in the working directory with the name pattern
*_L001_*.fastq.gz
and then concatenates across lanes 001, 002, 003,
and 004 so they can be managed further. The “L001” part of the filename
is then eliminated. For example the 8 files above would become:
AMH_macro_1_1_12px_S1_R1.fastq.gz
AMH_macro_1_1_12px_S1_R2.fastq.gz
Rockfish uses slurm to manage
jobs. To run the script, use the sbatch
command. For example:
sbatch ~/code/02-concat_files_across4lanes.sh
This command will run the script from within the current directory, but will look for and pull the script from the code directory. This will concatenate all files within the current directory that match the loop pattern.
Step 2b. On Rockfish,
Stacks will need to be
downloaded to each user’s code directory. Stacks, and software in
general, should be compiled in an interactive mode or loaded via module.
For more information on interactive mode, see interact --usage
.
interact -p debug -g 1 -n 1 -c 1
module load gcc
Now download Stacks. We used version 2.60.
wget http://catchenlab.life.illinois.edu/stacks/source/stacks-2.60.tar.gz
tar xfvz stacks-2.60.tar.gz
Next, go into the stacks-2.60 directory and run the following commands:
./configure --prefix=/home/<your_username>/code4-<PI_username>
make
make install
export PATH=$PATH:/home/<your_username>/code4-<PI_username>/stacks-2.60
The filesystem patterns on your cluster might be different, and you should change these file paths accordingly.
Referred to through files as “Step 3”. Files can be found in the
03_clone_filter/
directory.
Step 3a. The 03-clone_filter.sh
script runs clone_filter
from
Stacks. The program was
run with options --inline_inline --oligo_len_1 4 --oligo_len_2 4
. The
--oligo_len_x 4
options indicate the 4-base pair degenerate sequence
was included on the outside of the barcodes for detecting PCR
duplicates. The script uses the file name prefixes listed for each
single sub-pooled library in 03-clone_filter_file_names.txt
and loops
to run clone_filter
on all of them. Possible file names shown in
clone_filter
File Names.
Step 3b. If you want to extract descriptive statistics from the
clone_filter
output, you can use the 03.5-parse_clone_filter.py
script to do so. It can be run on your local terminal after transferring
the clone_filter.out
logs to your local computer.
source("03_clone_filter/examine_clones.R")
make_cloneplot()
Files can be found in the 04_demux_filter/
directory.
The 04-process_radtags.sh
script runs process_radtags
from
Stacks. The program was
run with options
-c -q --inline_inline --renz_1 pstI --renz_2 mspI --rescue --disable_rad_check
.
The script uses the same file prefixes as Step 3 -
03-clone_filter.sh
. Each sub-pooled
library has a forward and reverse read file that was filtered in the
previous step. Like the above section,
the script uses the file name prefixes listed for each single sub-pooled
library in 04-process_radtags_file_names.txt
and loops to run
process_radtags
on all of them. Possible file names shown in
clone_filter
File Names.
Each sub-pooled library also has a demultiplexing file (04-demux/
directory) that contains the sample names and inner(i5 and i7) barcodes.
For example, the sublibrary 1_1, we’d see the following barcode file:
ATCACG AGTCAA DS.BA.PIK.U.1
CGATGT AGTTCC DS.BA.PIK.U.2
TTAGGC ATGTCA DS.BA.PIK.U.3
TGACCA CCGTCC DS.BA.PIK.U.4
ACAGTG GTCCGC DS.BA.PIK.U.5
GCCAAT GTGAAA DS.BA.DHI.U.1
CAGATC GTGGCC DS.BA.DHI.U.2
ACTTGA GTTTCG DS.BA.DHI.U.3
GATCAG CGTACG DS.BA.DHI.U.4
TAGCTT GAGTGG DS.BA.DHI.U.5
GGCTAC ACTGAT DS.BA.GA.U.1
CTTGTA ATTCCT DS.BA.GA.U.2
The ‘process_radtags’ command will demultiplex the data by separating out each sublibrary into the individual samples. It will then clean the data, and will remove low quality reads and discard reads where a barcode was not found.
In a new directory, make sure the files are organized by species. In the
process_radtags
script, we specified that files be sent to
~/scratch/demux/*sublibrary_name*
(reasoning for this is in Step
4c), but files
should manually be organized into species folders (i.e.,
~/scratch/demux/*SPP*
) after process_radtags
is performed. For
example, the file “DS.MN.L01-DS.M.1.1.fq.gz” should be sent to the
~/scratch/demux/DS
directory.
Note: this is not automated at this point but it would be nice to automate the file moving process so it’s not forgotten at this point.
In the script for Step 4, we have specified that a new output folder be created for each sublibrary. The output folder is where all sample files and the log file will be dumped for each sublibrary. It is important to specify a different output folder if you have multiple sublibraries because we will be assessing the output log for each sublibrary individually (and otherwise, the log is overwritten when the script loops to a new sublibrary).
The utility stacks-dist-extract
can be used to extract data from the
log file. First, we examined the library-wide statistics to identify
sublibraries where barcodes may have been misentered or where sequencing
error may have occurred. We used:
stacks-dist-extract process_radtags.log total_raw_read_counts
to pull out data on the total number of sequences, the number of low-quality reads, whether barcodes were found or not, and the total number of retained reads per sublibary. Look over these to make sure there are no outliers or sublibraries that need to be checked and rerun.
Next, we used:
stacks-dist-extract process_radtags.log per_barcode_raw_read_counts
to analyze how well each sample performed. There are three important statistics to consider for each sample.
-
The proportion of reads per sample for each sublibrary indicates the proportion that each individual was processed and sequenced within the overall library. This is important to consider as cases where a single sample dominates the sublibrary may indicate contamination.
-
The number of reads retained for each sample can be an indicator of coverage. It is most likely a good idea to remove samples with a very low number of reads. Where you decide to place the cutoff for low coverage samples is dependent on your dataset. For example, a threshold of 1 million reads is often used but this is not universal.
-
The proportion of reads retained for each sample can also indicate low-quality samples and will give an idea of the variation in coverage across samples.
Output for sublibraries for this step are summarized in
process_radtags-library_output.csv
.
Output for individual samples for this step are summarized in
process_radtags-sample_output.csv
.
The script 04c-process_radtags_stats.R
was used to create many plots
for easily assessing each statistic. Output from this step can be found
in figures/process_radtags/
where figures are organized by species.
source("04_demux_filter/04c-radtags_filter_summary.R")
make_filterplot()
downstream analysis
Using the same output log and the above statistics, we removed low-coverage and low-quality samples that may skew downstream analyses.
Samples were identified and removed via the following procedure:
-
First, samples that represented less than 1% of the sequenced sublibrary were identified and removed. These samples correlate to low-read and low-coverage samples.
-
Next, a threshold of 1 million retained reads per sample was used to remove any remaining low-read samples. Low-read samples correlate to low coverage and will lack enough raw reads to contribute to downstream analyses.
Good/kept samples are summarized in
process_radtags-kept_samples.csv
.
Discarded samples are summarized in
process_radtags-discarded_samples.csv
.
source("04_demux_filter/04c-radtags_filter_summary.R")
make_manual_discard_plot()
Note: At this point, we started using Stacks 2.62 for its multi-threading capabilities. Functionality of the previous steps should be the same, however.
Files can be found in the 05_ustacks_and_params/
directory.
Going forward, when we use the term metapopulation, we are referring to the collection of all samples within species among all cities where the species was present.
It is important to conduct preliminary analyses that will identify an
optimal set of parameters for the dataset (see Step
5a). Following the parameter optimization, the
program ustacks
can be run to generate a catalog of loci.
Stack assembly will differ based on several different aspects of the dataset(such as the study species, the RAD-seq method used, and/or the quality and quantity of DNA used). So it is important to use parameters that will maximize the amount of biological data obtained from stacks.
There are three main parameters to consider when doing this:
-
m = controls the minimum number of raw reads required to form a stack(implemented in
ustacks
) -
M = controls the number of mismatches between stacks to to merge them into a putative locus (implemented in
ustacks
) -
n = controls the number of mismatches allowed between stacks to merge into the catalog (implemented in
cstacks
)
There are two main ways to optimize parameterization:
-
an iterative method were you sequentially change each parameter while keeping the other parameters fixed (described in Paris et al. 2017), or
-
an iterative method were you sequentially change the values of M and n(keeping M = n) while fixing m = 3, and then test m = 2, 4 once the optimal M = n is determined(described in Rochette and Catchen 2017, Catchen 2020).
We performed the second method and used the denovo_map.sh
script to
run the denovo_map.pl
command to perform iterations. This script
requires that we first choose a subset of samples to run the iterations
on. The samples should be representative of the overall dataset; meaning
they should include all populations and have similar read coverage
numbers. Read coverage numbers can be assessed by looking at the
descriptive statistics produced from Step
4c.
Place these samples in a text file (popmap_test_samples.txt
) with the
name of the sample and specify that all samples belong to the same
population. For example, popmap_test_samples.txt
should look like…
DS.BA.GA.U.1 A
DS.PX.BUF.M.5 A
DS.B0.HC4.M.1 A
...
It is important to have all representative samples treated as one
population because you will assess outputs found across 80% of the
individuals. The script will read this text file from the --popmap
argument.
The script also requires that you specify an output directory after
-o
. This should be unique to the parameter you are testing… for
example, if you are testing M = 3, then you could make a subdirectory
labeled stacks.M3
where all outputs from denovo_map.sh
will be
placed. Otherwise, for each iteration, the outputs will be overwritten
and you will lose the log from the previous iteration. The
denovo_map.sh
script also requires that you direct it toward where
your samples are stored, which is your directory built in Step
4b. Make sure to run the
--min-samples-per-pop 0.80
argument.
To decide which parameters to use, examine the following from each iteration:
-
the average sample coverage: This is obtained from the summary log in the
ustacks
section ofdenovo_map.log
. If samples have a coverage <10x, you will have to rethink the parameters you use here. -
the number of assembled loci shared by 80% of samples: This can be found in the
haplotypes.tsv
by counting the number of loci:cat populations.haplotypes.tsv | grep -v ^"#" | wc -l
-
the number of polymorphic loci shared by 80% of samples: This can be found in
populations.sumstats.tsv
or by countingpopulations.hapstats.tsv
:cat populations.hapstats.tsv | grep -v "^#" | wc -l
-
the number of SNPs per locus shared by 80% of samples: found in
denovo_map.log
or by counting the number of SNPs inpopulations.sumstats.tsv
:populations.sumstats.tsv | grep -v ^"#" | wc -l
The script 05a-param_opt-figures_script.R
was used to create plots for
assessing the change in shared loci across parameter iterations.
Based on this optimization step, we used the following parameters:
Species | M (locus mismatches) | n (catalog mismatches) | m (minimum reads) |
---|---|---|---|
CD | 8 | 8 | 3 |
DS | 10 | 10 | 3 |
EC | 8 | 8 | 3 |
LS | 7 | 7 | 3 |
PA | 5 | 5 | 3 |
TO | 6 | 6 | 3 |
Final parameter optimization values for the Stacks pipeline.
ustacks
builds de novo loci in each individual sample. We have
designed the ustacks
script so that the process requires three files:
05-ustacks_n.sh
: the shell script that executesustacks
05-ustacks_id_n.txt
: the sample ID number05-ustacks_samples_n.txt
: the sample names that correspond to the sample IDs
The sample ID should be derived from the order_id
column(first column)
on the master spreadsheet. It is unique (1-1736) across all of the
samples.
The sample name is the corresponding name for each sample ID in the spreadsheet. E.g., sample ID “9” corresponds to sample name “DS.BA.DHI.U.4”. Sample naming convention is species.city.site.management_type.replicate_plant.
05-ustacks_n.sh
should have an out_directory (-o
option) that will
be used for all samples (e.g., stacks/ustacks
). Files can be processed
piecemeal into this directory. There should be three files for every
sample in the output directory:
<samplename>.alleles.tsv.gz
<samplename>.snps.tsv.gz
<samplename>.tags.tsv.gz
Multiple versions of the 05-ustacks_n.sh
script can be run in parallel
(simply replace n in the three files above with the correct number).
A small number of samples (13) were discarded at this stage as the
ustacks
tool was unable to form any primary stacks corresponding to
loci. See
output/ustacks-discarded_samples.csv.
This step contains a script 05b-fix_filenames.sh
which uses some
simple regex to fix filenames that are output in previous steps. Stacks
adds an extra “1” at some point at the end of the sample name which is
not meaningful. The following files:
- DS.MN.L02-DS.M.3.1.alleles.tsv.gz
- DS.MN.L03-DS.U.2.1.tags.tsv.gz
- DS.MN.L09-DS.U.1.1.snps.tsv.gz
become:
- DS.MN.L02-DS.M.3.alleles.tsv.gz
- DS.MN.L03-DS.U.2.tags.tsv.gz
- DS.MN.L09-DS.U.1.snps.tsv.gz
The script currently gives some strange log output, so it can probably be optimized/improved. The script should be run from the directory where the changes need to be made. Files that have already been fixed will not be changed.
In the next step, we will choose the files we want to go into the catalog. This involves a few steps:
-
Create a meaningful directory name. This could be the date (e.g.,
stacks_22_01_25
). -
Copy the
ustacks
output for all of the files you want to use in the reference from Step 5b. Remember this includes three files per sample. So if you have 20 samples you want to include in the reference catalog, you will transfer 3 x 20 = 60 files into the meaningful directory name. The three files per sample should follow this convention:
<samplename>.alleles.tsv.gz
<samplename>.snps.tsv.gz
<samplename>.tags.tsv.gz
- Remember the meaningful directory name. You will need it in Step 6.
Files can be found in the 06_cstacks/
directory.
cstacks
builds the locus catalog from all the samples specified. The
accompanying script, cstacks_SPECIES.sh
is relatively simple since it
points to the directory containing all the sample files. It follows this
format to point to that directory:
cstacks -P ~/directory ...
Make sure that you use the meaningful directory from Step 5c and that
you have copied all the relevant files over. Otherwise this causes
problems
downstream.
For example, you might edit the code to point to
~/scratch/stacks/stacks_22_01_25
.
cstacks -P ~/scratch/stacks/stacks_22_01_25 ...
The tricky thing is ensuring enough compute memory to run the entire process successfully. There is probably space to optimize this process.
The cstacks
method uses a “population map” file, which in this project
is cstacks_popmap_SPECIES.txt
. This file specifies which samples to
build the catalog from and categorizes them into your ‘populations’, or
in this case, cities using two tab-delimited columns, e.g.:
DS.BA.GA.U.1 Baltimore
DS.BA.GA.U.2 Baltimore
DS.BA.GA.U.3 Baltimore
DS.BA.GA.U.4 Baltimore
DS.BA.GA.U.5 Baltimore
...
Make sure the samples in this file correspond to the input files located
in e.g., ~/scratch/stacks/stacks_22_01_25
.
cstacks
builds three files for use in all your samples (in this
pipeline run), mirroring the sample files output
byustacks
:
catalog.alleles.tsv.gz
catalog.snps.tsv.gz
catalog.tags.tsv.gz
Sample | Species | City |
---|---|---|
DS.BA.PIK.U.1 | DS | BA |
DS.BA.GA.U.4 | DS | BA |
DS.BA.LH-1.M.4 | DS | BA |
DS.BA.LH-3.M.1 | DS | BA |
DS.BA.WB.U.2 | DS | BA |
DS.BA.LL-4.M.5 | DS | BA |
DS.BA.LH-2.M.5 | DS | BA |
DS.BA.TRC.U.3 | DS | BA |
DS.BA.W3.M.2 | DS | BA |
DS.BA.RG-1.M.1 | DS | BA |
DS.BA.LL-3.M.3 | DS | BA |
DS.BA.RG-2.M.4 | DS | BA |
DS.BO.HC1.M.3 | DS | BO |
DS.BO.HC4.M.5 | DS | BO |
DS.BO.LC1.M.3 | DS | BO |
DS.BO.LC2.M.2 | DS | BO |
DS.BO.LC3.M.5 | DS | BO |
DS.BO.WL1.M.2 | DS | BO |
DS.BO.WL2.M.1 | DS | BO |
DS.BO.WL3.M.5 | DS | BO |
DS.BO.I4.U.1 | DS | BO |
DS.BO.R1.U.4 | DS | BO |
DS.BO.R2.U.2 | DS | BO |
DS.BO.R4.U.4 | DS | BO |
DS.MN.L05-DS.M.3 | DS | MN |
DS.MN.L09-DS.M.3 | DS | MN |
DS.MN.L11-DS.M.1 | DS | MN |
DS.MN.L02-DS.U.1 | DS | MN |
DS.MN.L02-DS.M.4 | DS | MN |
DS.MN.L03-DS.U.3 | DS | MN |
DS.MN.L04-DS.U.5 | DS | MN |
DS.MN.L06-DS.U.3 | DS | MN |
DS.MN.L07-DS.U.3 | DS | MN |
DS.MN.L09-DS.U.3 | DS | MN |
DS.MN.L11-DS.U.1 | DS | MN |
DS.MN.L11-DS.U.5 | DS | MN |
DS.PX.BUF.M.1 | DS | PX |
DS.PX.PIE.M.2 | DS | PX |
DS.PX.ALA.M.1 | DS | PX |
DS.PX.MTN.M.6 | DS | PX |
DS.PX.LAP.M.3 | DS | PX |
DS.PX.NUE.M.4 | DS | PX |
DS.PX.WES.M.2 | DS | PX |
DS.PX.DF1.M.1 | DS | PX |
DS.PX.ENC.M.1 | DS | PX |
DS.PX.DOW.M.1 | DS | PX |
DS.PX.DOW.M.4 | DS | PX |
DS.PX.DF2.M.3 | DS | PX |
CD.BA.LA.U.2 | CD | BA |
CD.BA.TRC.U.3 | CD | BA |
CD.BA.WGP.M.2 | CD | BA |
CD.BA.LH-2.M.2 | CD | BA |
CD.BA.LL-4.M.1 | CD | BA |
CD.BA.PIK.U.2 | CD | BA |
CD.BA.WB.U.2 | CD | BA |
CD.BA.CP.U.4 | CD | BA |
CD.BA.FH.U.1 | CD | BA |
CD.BA.PSP.M.4 | CD | BA |
CD.BA.AA.U.4 | CD | BA |
CD.BA.RG-1.M.2 | CD | BA |
CD.BA.W3.M.3 | CD | BA |
CD.BA.GA.U.3 | CD | BA |
CD.BA.WBO.U.5 | CD | BA |
CD.LA.WHI.M.3 | CD | LA |
CD.LA.SEP.M.3 | CD | LA |
CD.LA.SEP.M.4 | CD | LA |
CD.LA.ROS.M.5 | CD | LA |
CD.LA.MR2.M.2 | CD | LA |
CD.LA.ALL.M.2 | CD | LA |
CD.LA.ALL.M.5 | CD | LA |
CD.LA.VAL.M.5 | CD | LA |
CD.LA.HAR.M.4 | CD | LA |
CD.LA.LUB.M.3 | CD | LA |
CD.LA.GLO.M.4 | CD | LA |
CD.LA.ZOO.M.3 | CD | LA |
CD.LA.NWH.M.5 | CD | LA |
CD.LA.KIN.M.3 | CD | LA |
CD.LA.KIN.M.5 | CD | LA |
CD.PX.CAM.U.5 | CD | PX |
CD.PX.MON.U.5 | CD | PX |
CD.PX.PKW.U.5 | CD | PX |
CD.PX.LAP.M.4 | CD | PX |
CD.PX.NES.U.4 | CD | PX |
CD.PX.PAL.M.3 | CD | PX |
CD.PX.ASU.M.1 | CD | PX |
CD.PX.NUE.M.5 | CD | PX |
CD.PX.WES.M.3 | CD | PX |
CD.PX.MAN.M.4 | CD | PX |
CD.PX.CLA.M.3 | CD | PX |
CD.PX.DF1.M.5 | CD | PX |
CD.PX.COY.M.5 | CD | PX |
CD.PX.RPC.M.3 | CD | PX |
CD.PX.ENC.M.2 | CD | PX |
EC.BA.LH-2.M.2 | EC | BA |
EC.BA.WBO.U.4 | EC | BA |
EC.BA.WB.U.5 | EC | BA |
EC.BA.FH.U.3 | EC | BA |
EC.BA.CP.U.2 | EC | BA |
EC.BA.TRC.U.3 | EC | BA |
EC.BA.LL-4.M.4 | EC | BA |
EC.BA.WB.U.1 | EC | BA |
EC.BA.PIK.U.5 | EC | BA |
EC.BA.PSP.M.4 | EC | BA |
EC.BA.GA.U.2 | EC | BA |
EC.BA.LL-3.M.3 | EC | BA |
EC.BA.ML.U.1 | EC | BA |
EC.BA.TRC.U.5 | EC | BA |
EC.BA.ML.U.3 | EC | BA |
EC.LA.SGB.U.2 | EC | LA |
EC.LA.SGB.U.5 | EC | LA |
EC.LA.DUR.U.2 | EC | LA |
EC.LA.HOW.U.2 | EC | LA |
EC.LA.SAN.U.2 | EC | LA |
EC.LA.VER.U.1 | EC | LA |
EC.LA.VER.U.4 | EC | LA |
EC.LA.VB2.U.4 | EC | LA |
EC.LA.AC2.U.2 | EC | LA |
EC.LA.AC1.U.1 | EC | LA |
EC.LA.VB1.U.1 | EC | LA |
EC.LA.VB1.U.3 | EC | LA |
EC.LA.SGR.U.4 | EC | LA |
EC.LA.SGR.U.5 | EC | LA |
EC.LA.HOW.U.3 | EC | LA |
EC.PX.BUF.M.1 | EC | PX |
EC.PX.BUF.M.3 | EC | PX |
EC.PX.ALA.M.3 | EC | PX |
EC.PX.MTN.M.2 | EC | PX |
EC.PX.WES.M.1 | EC | PX |
EC.PX.WES.M.2 | EC | PX |
EC.PX.MAN.M.1 | EC | PX |
EC.PX.CLA.M.1 | EC | PX |
EC.PX.PSC.M.1 | EC | PX |
EC.PX.DF1.M.1 | EC | PX |
EC.PX.DOW.M.1 | EC | PX |
EC.PX.DOW.M.2 | EC | PX |
EC.PX.COY.M.2 | EC | PX |
EC.PX.COY.M.3 | EC | PX |
EC.PX.ALA.M.5 | EC | PX |
LS.BA.WB.U.1 | LS | BA |
LS.BA.WB.U.2 | LS | BA |
LS.BA.DHI.U.2 | LS | BA |
LS.BA.GA.U.1 | LS | BA |
LS.BA.PIK.U.3 | LS | BA |
LS.BA.PIK.U.5 | LS | BA |
LS.BA.CP.U.2 | LS | BA |
LS.BA.ML.U.2 | LS | BA |
LS.BA.WBO.U.3 | LS | BA |
LS.BO.WL3.M.4 | LS | BO |
LS.BO.I1.U.1 | LS | BO |
LS.BO.I2.U.1 | LS | BO |
LS.BO.WL2.M.2 | LS | BO |
LS.BO.R1.U.2 | LS | BO |
LS.BO.R2.U.4 | LS | BO |
LS.BO.R3.U.3 | LS | BO |
LS.BO.HC4.M.3 | LS | BO |
LS.BO.LC4.M.2 | LS | BO |
LS.LA.VET.M.4 | LS | LA |
LS.LA.SSV.M.1 | LS | LA |
LS.LA.NAV.M.4 | LS | LA |
LS.LA.SHO.M.2 | LS | LA |
LS.LA.WES.M.3 | LS | LA |
LS.LA.GLO.M.3 | LS | LA |
LS.LA.HOW.U.5 | LS | LA |
LS.LA.SAN.U.2 | LS | LA |
LS.LA.ARR.U.2 | LS | LA |
LS.MN.L06-LS.U.2 | LS | MN |
LS.MN.L06-LS.U.5 | LS | MN |
LS.MN.L07-LS.U.4 | LS | MN |
LS.MN.L08-LS.U.5 | LS | MN |
LS.MN.L09-LS.U.3 | LS | MN |
LS.MN.L01-LS.M.4 | LS | MN |
LS.MN.L01-LS.U.3 | LS | MN |
LS.MN.L02-LS.U.1 | LS | MN |
LS.MN.L05-LS.U.2 | LS | MN |
LS.PX.MON.U.2 | LS | PX |
LS.PX.PKW.U.5 | LS | PX |
LS.PX.PIE.M.4 | LS | PX |
LS.PX.ALA.M.3 | LS | PX |
LS.PX.PAL.M.3 | LS | PX |
LS.PX.MAN.M.2 | LS | PX |
LS.PX.NUE.M.1 | LS | PX |
LS.PX.ENC.M.4 | LS | PX |
LS.PX.COY.M.3 | LS | PX |
PA.BA.PIK.U.1 | PA | BA |
PA.BA.LH-3.M.2 | PA | BA |
PA.BA.LH-3.M.3 | PA | BA |
PA.BA.WB.U.1 | PA | BA |
PA.BA.AA.U.1 | PA | BA |
PA.BA.WGP.M.3 | PA | BA |
PA.BA.LL-4.M.3 | PA | BA |
PA.BA.LA.U.2 | PA | BA |
PA.BA.LH-2.M.2 | PA | BA |
PA.BA.W3.M.3 | PA | BA |
PA.BA.RG-1.M.2 | PA | BA |
PA.BA.LL-3.M.5 | PA | BA |
PA.BO.I2.U.3 | PA | BO |
PA.BO.HC1.M.4 | PA | BO |
PA.BO.R3.U.2 | PA | BO |
PA.BO.HC4.M.5 | PA | BO |
PA.BO.R4.U.2 | PA | BO |
PA.BO.WL2.M.5 | PA | BO |
PA.BO.WL4.M.4 | PA | BO |
PA.BO.LC4.M.4 | PA | BO |
PA.BO.HC2.M.1 | PA | BO |
PA.BO.R1.U.2 | PA | BO |
PA.BO.WL1.M.1 | PA | BO |
PA.BO.I1.U.5 | PA | BO |
PA.LA.ALL.M.5 | PA | LA |
PA.LA.SEP.M.1 | PA | LA |
PA.LA.SEP.M.5 | PA | LA |
PA.LA.WHI.M.2 | PA | LA |
PA.LA.ROS.M.5 | PA | LA |
PA.LA.LUB.M.2 | PA | LA |
PA.LA.GLO.M.2 | PA | LA |
PA.LA.ZOO.M.4 | PA | LA |
PA.LA.ZOO.M.5 | PA | LA |
PA.LA.NWH.M.2 | PA | LA |
PA.LA.KIN.M.4 | PA | LA |
PA.LA.POP.M.4 | PA | LA |
PA.PX.BUF.M.3 | PA | PX |
PA.PX.PIE.M.4 | PA | PX |
PA.PX.LAP.M.5 | PA | PX |
PA.PX.ALA.M.1 | PA | PX |
PA.PX.PAP.M.2 | PA | PX |
PA.PX.PAP.M.5 | PA | PX |
PA.PX.DF1.M.2 | PA | PX |
PA.PX.RPP.U.3 | PA | PX |
PA.PX.ENC.M.4 | PA | PX |
PA.PX.ENC.M.5 | PA | PX |
PA.PX.COY.M.1 | PA | PX |
PA.PX.BUF.M.2 | PA | PX |
TO.BA.WBO.U.4 | TO | BA |
TO.BA.CP.U.1 | TO | BA |
TO.BA.FH.U.1 | TO | BA |
TO.BA.LH-3.M.4 | TO | BA |
TO.BA.WGP.M.3 | TO | BA |
TO.BA.GA.U.4 | TO | BA |
TO.BA.PIK.U.4 | TO | BA |
TO.BA.PSP.M.1 | TO | BA |
TO.BA.RG-2.M.2 | TO | BA |
TO.BO.HC1.M.4 | TO | BO |
TO.BO.HC2.M.5 | TO | BO |
TO.BO.HC3.M.1 | TO | BO |
TO.BO.HC4.M.5 | TO | BO |
TO.BO.LC1.M.1 | TO | BO |
TO.BO.LC2.M.5 | TO | BO |
TO.BO.LC3.M.1 | TO | BO |
TO.BO.WL2.M.1 | TO | BO |
TO.BO.I2.U.3 | TO | BO |
TO.LA.WHI.M.5 | TO | LA |
TO.LA.HAR.M.4 | TO | LA |
TO.LA.MR1.M.1 | TO | LA |
TO.LA.GLO.M.5 | TO | LA |
TO.LA.ZOO.M.1 | TO | LA |
TO.LA.NWH.M.4 | TO | LA |
TO.LA.VNS.M.2 | TO | LA |
TO.LA.PEP.M.5 | TO | LA |
TO.LA.COM.M.4 | TO | LA |
TO.MN.L11-TO.M.3 | TO | MN |
TO.MN.L02-TO.U.1 | TO | MN |
TO.MN.L04-TO.U.1 | TO | MN |
TO.MN.L06-TO.U.2 | TO | MN |
TO.MN.L08-TO.U.5 | TO | MN |
TO.MN.L09-TO.U.2 | TO | MN |
TO.MN.L11-TO.U.3 | TO | MN |
TO.MN.L05-TO.M.5 | TO | MN |
TO.MN.L08-TO.M.5 | TO | MN |
TO.PX.BUF.M.1 | TO | PX |
TO.PX.ALA.M.2 | TO | PX |
TO.PX.LAP.M.4 | TO | PX |
TO.PX.WES.M.1 | TO | PX |
TO.PX.CLA.M.1 | TO | PX |
TO.PX.DF1.M.1 | TO | PX |
TO.PX.DF2.M.1 | TO | PX |
TO.PX.COY.M.1 | TO | PX |
TO.PX.COY.M.6 | TO | PX |
Subset of samples used in SNP catalog creation.
Files can be found in the 07_sstacks/
directory.
All samples in the population (or all samples you want to include in the
analysis) are matched against the catalog produced in
cstacks
with sstacks
, run in script
stacks_SPECIES.sh
and stacks_SPECIES_additional.sh
. It runs off of
the samples based in the output directory and the listed samples in
sstacks_samples_SPECIES.txt
and
sstacks_samples_SPECIES_additional.txt
(respectively), so make sure
all your files (sample and catalog, etc.) are there and match.
sstacks_samples_SPECIES.txt
takes the form:
DS.BA.GA.U.1
DS.BA.GA.U.2
DS.BA.GA.U.3
DS.BA.GA.U.4
DS.BA.GA.U.5
...
There should be a new file produced at this step for every sample in the output directory:
<samplename>.matches.tsv.gz
A small number of samples generated very few matches to the catalog (such as only 4 loci matching, obviously not enough to draw any conclusions) and therefore aren’t used in the next step. See output/sstacks-discarded_samples.csv.
Files can be found in the 08_polyRAD/
directory.
We used the polyRAD package to
call genotypes because many of our species are polyploid or have
historical genome duplication. PolyRAD takes the catalog output
(catalog.alleles.tsv.gz
) and accompanying matches to the catalog
(e.g., CD.BA.AA.U.1.matches.tsv.gz
) to create genotype likelihoods for
species with diploidy and/or polyploidy. We used the catalog and match
files to create a RADdata object class in R for each species. We ran
this on the Rockfish compute cluster, with the make_polyRAD_<spp>.R
script doing the brunt of the work. The R script was wrapped by
polyrad_make_<spp>.sh
to submit the script to the SLURM scheduler.
Relevant Parameters:
min.ind.with.reads
was set to 20% of samples. This means we discarded any loci not found in at least 20% of samples for each species.min.ind.with.minor.allele
was set to2
. This means a locus must have at least this many samples with reads for the minor allele in order to be retained.
Requires:
popmap_<spp>_polyrad.txt
, a list of samples and population- output from sstacks
Outputs:
<spp>_polyRADdata.rds
, RDS object (the RADdata object)
Next, we calculated overdispersion using the
polyRAD_overdispersion_<spp>.R
script, wrapped by
polyrad_overd_<spp>.sh
to submit the script to the SLURM scheduler.
Requires:
popmap_<spp>_polyrad.txt
, a list of samples and population<spp>_polyRADdata.rds
, RDS object (the RADdata object) output from the previous step
Outputs:
<spp>_overdispersion.rds
, RDS object (the overdispersion test output)
Next, we calculated filtered loci based on the expected Hind/He
statistic and estimated population structure/genotypes using the
polyRAD_filter_<spp>.R
script, wrapped by polyrad_filt_<spp>.sh
to
submit the script to the SLURM scheduler.
We used the table in this
tutorial,
which estimated an inbreeding based on the ploidy, optimal
overdispersion value, and mean Hind/He. These values are hardcoded in
polyRAD_filter_<spp>.R
.
Requires:
popmap_<spp>_polyrad.txt
, a list of samples and population<spp>_polyRADdata.rds
, RDS object (the RADdata object) output from the previous step<spp>_overdispersion.rds
, RDS object (the overdispersion test output) output from the previous step
Outputs:
<spp>_filtered_RADdata.rds
, RDS object (RADdata object filtered for appropriate Hind/He)<spp>_IteratePopStructPCA.csv
, data output from the genotype estimate PCA, suitable for plotting<spp>_estimatedgeno_RADdata.rds
, RDS object (RADdata object with genotype estimates)
The output <spp>_estimatedgeno_RADdata.rds
needs to be converted to
genind and structure format for further analysis and steps. There is a
little cleanup involved so the population information is retained. For
example, Structure needs the population identity to be an integer, not a
string. This set of functions can be run on a laptop.
At this stage, we also visually assessed the check_coverage
inside the
convert_genomics.R
script). We removed the following samples from
further analysis:
Sample |
---|
CD.BA.PSP.M.1 |
CD.BA.DHI.U.2 |
CD.BA.DHI.U.3 |
CD.BA.RG-1.M.5 |
CD.BA.RG-1.M.4 |
DS.BO.WL1.M.4 |
DS.BO.I1.U.3 |
EC.BO.R4.U.1 |
LS.BO.HC2.M.5 |
LS.BO.LC4.M.3 |
LS.BO.R2.U.4 |
LS.BO.R2.U.1 |
PA.BA.LH-3.M.4 |
PA.BA.AA.U.3 |
PA.BA.AA.U.4 |
PA.PX.RPP.U.2 |
PA.BO.HC2.M.4 |
PA.PX.RPP.U.1 |
TO.BA.TRC.U.1 |
TO.BA.TRC.U.3 |
TO.BO.R4.U.1 |
TO.BA.TRC.U.2 |
TO.BO.R4.U.2 |
TO.BO.R2.U.2 |
Subset of samples discarded after genotype estimation using polyRAD.
source("08_polyRAD/convert_genomics.R")
convert_all()
Files, inlcuding model parameters, can be found in the 09_structure/
directory.
Structure documentation can be found here.
polyRAD
outputs genotype probabilites in a format suitable for
Structure. These files were named as:
CD_estimatedgeno.structure
DS_estimatedgeno.structure
EC_estimatedgeno.structure
LS_estimatedgeno.structure
PA_estimatedgeno.structure
TO_estimatedgeno.structure
We ran all species using a naive approach (not using prior information)
with MAXPOPS
argument). To search for the most
appropriate K, We ran Structure through 5 replicate runs for each
combination of species and K, with 10000 iterations discarded as burn-in
and retained 20000 iterations. These runs created files that look like:
structure_out_CD1_naive_f // K = 1, rep 1
structure_out_CD1_naive_rep2_f // K = 1, rep 2
structure_out_CD1_naive_rep3_f // K = 1, rep 3
structure_out_CD1_naive_rep4_f // K = 1, rep 4
structure_out_CD1_naive_rep5_f // K = 1, rep 5
structure_out_CD2_naive_f // K = 2, rep 1
structure_out_CD2_naive_rep2_f // K = 2, rep 2
structure_out_CD2_naive_rep3_f // K = 2, rep 3
...
Within each species, we compressed the result files for all K and reps and submitted to Structure Harvester to choose the optimal K using the Delta-K method (see https://link.springer.com/article/10.1007/s12686-011-9548-7). Once the optimal K was selected per species, we re-ran Structure using a greater number of iterations (100000) for final output and plotting.
From the USGS:
The U.S. Geological Survey (USGS), in partnership with several federal agencies, has developed and released four National Land Cover Database (NLCD) products over the past two decades: NLCD 1992, 2001, 2006, and 2011. This one is for data from 2016 and describes urban imperviousness.
NLCD imperviousness products represent urban impervious surfaces as a percentage of developed surface over every 30-meter pixel in the United States. NLCD 2016 updates all previously released versions of impervious products for CONUS (NLCD 2001, NLCD 2006, NLCD 2011) along with a new date of impervious surface for 2016. New for NLCD 2016 is an impervious surface descriptor layer. This descriptor layer identifies types of roads, core urban areas, and energy production sites for each impervious pixel to allow deeper analysis of developed features.
First, we trimmed the large data. This makes a smaller .rds
file for
each city.
source("R/trim_NLCD_spatial_data.R")
create_spatial_rds_files()
Next, we made plots for each city’s sampling locations. Note that these only include sites that had viable SNPs.
source("R/plot_map_of_samples.R")
make_all_urban_site_plots()
We used polyrad::calcPopDiff()
to calculate continental population
statistics for each species.
source("R/calc_continental_stats.R")
do_all_continental_stats()
# CD as an example
read.csv("output/population_stats/CD_continental_stats.csv")
## X statistic value
## 1 1 JostD 0.30579677
## 2 2 Gst 0.02735719
## 3 3 Fst 0.02812163
Within each species, we compressed the result files for all K and reps and submitted to Structure Harvester to choose the optimal K using the Delta-K method (see https://link.springer.com/article/10.1007/s12686-011-9548-7).
The results were:
CD: K=3
DS: K=3
EC: K=2
LS: K=3
PA: K=4
TO: K=3
# This file contains output from various K from Structure..
read_csv("output/structure/structure_k_Pr.csv")
The code below generates plots for Structure results.
source("R/plot_structure.R")
make_structure_multi_plot()
We ran sNMF as an alternative to Structure to validate the results. We
coerced all polyploid data to diploid data to make the file types
compatible with the sNMF function in R. The snmf() function computes an
entropy criterion that evaluates the quality of fit of the statistical
model to the data by using a cross-validation technique. We plotted the
cross-entropy criterion for K=[2:10] for all species. Using the best
K, we then selected the best of 10 runs in each K using the
which.min()
function.
source("R/sNMF.R")
The following runs sNMF and generates the figure. Note that in the pdf version of this document, the figure might appear on the next page.
do_all_sNMF()
snmf()
. As with the Structure analysis, horseweed and
prickly lettuce appear to have the most population structure. Phoenix
crabgrass, horseweed, and prickly lettuce appear unique. In general,
sNMF produced larger K for most species, which will create more
sensitivity to admixture.
We performed hierarchical analysis of molecular variance (AMOVA; using GenoDive 3.06) based on the Rho-statistics, which is based on a Ploidy independent Infinite Allele Model. AMOVA is under the “Analysis” menu.
We used GenoDive v. 3.0.6 to
calculate
This can be run in GenoDive by selecting Analysis > Hardy-Weinberg > Heterozygosity-based (Nei) method.
head(read.csv("output/population_stats/genodive_output_Fis.csv"))
## Species Population n Fis
## 1 CD BA 55 0.166
## 2 CD LA 48 0.186
## 3 CD PX 82 0.200
## 4 CD Overall NA 0.187
## 5 DS BA 55 0.208
## 6 DS BO 52 0.252
We used GenoDive v. 3.0.6 to
calculate pairise
This can be run in GenoDive by selecting Pairwise Differentiation from the Analysis menu and selecting the “rho” statistic from the dropdown.
We used the following script to clean up the results.
source("R/rho.R")
compile_rho_table()
Species | City1 | City2 | rho | p-value | adjusted p-value |
---|---|---|---|---|---|
CD | PX | BA | 0.050 | 0.001 | 0.0010 |
CD | LA | BA | 0.046 | 0.001 | 0.0010 |
CD | PX | LA | 0.015 | 0.001 | 0.0010 |
DS | MN | BA | 0.031 | 0.001 | 0.0015 |
DS | BO | BA | 0.018 | 0.001 | 0.0015 |
DS | PX | BA | 0.012 | 0.001 | 0.0015 |
DS | MN | BO | 0.007 | 0.001 | 0.0015 |
DS | PX | BO | -0.002 | 0.875 | 0.9550 |
DS | PX | MN | -0.002 | 0.955 | 0.9550 |
EC | PX | BA | 0.098 | 0.001 | 0.0010 |
EC | PX | LA | 0.087 | 0.001 | 0.0010 |
EC | LA | BA | 0.038 | 0.001 | 0.0010 |
LS | PX | BA | 0.077 | 0.001 | 0.0011 |
LS | PX | MN | 0.069 | 0.001 | 0.0011 |
LS | PX | LA | 0.061 | 0.001 | 0.0011 |
LS | PX | BO | 0.056 | 0.001 | 0.0011 |
LS | MN | LA | 0.039 | 0.001 | 0.0011 |
LS | LA | BA | 0.038 | 0.001 | 0.0011 |
LS | BO | BA | 0.032 | 0.001 | 0.0011 |
LS | MN | BO | 0.021 | 0.001 | 0.0011 |
LS | LA | BO | 0.010 | 0.001 | 0.0011 |
LS | MN | BA | 0.009 | 0.002 | 0.0020 |
PA | PX | BO | 0.028 | 0.001 | 0.0015 |
PA | LA | BO | 0.024 | 0.001 | 0.0015 |
PA | PX | BA | 0.015 | 0.001 | 0.0015 |
PA | LA | BA | 0.011 | 0.001 | 0.0015 |
PA | BO | BA | 0.008 | 0.002 | 0.0024 |
PA | PX | LA | -0.002 | 0.972 | 0.9720 |
TO | PX | BA | 0.023 | 0.001 | 0.0014 |
TO | PX | MN | 0.015 | 0.001 | 0.0014 |
TO | PX | BO | 0.013 | 0.002 | 0.0025 |
TO | LA | BA | 0.011 | 0.001 | 0.0014 |
TO | LA | BO | 0.009 | 0.001 | 0.0014 |
TO | MN | LA | 0.009 | 0.001 | 0.0014 |
TO | PX | LA | 0.009 | 0.027 | 0.0300 |
TO | BO | BA | 0.008 | 0.001 | 0.0014 |
TO | MN | BO | 0.008 | 0.001 | 0.0014 |
TO | MN | BA | 0.001 | 0.098 | 0.0980 |
Rho statistics for pairwise comparison between cities.
We used poppr::ia()
to calculate the standardized index of association
of loci in the dataset (rbarD
). We use the standardized
index of association to avoid the influence of different sample sizes,
as described by Agapow and Burt
2001.
When p.rD
is small (<0.05) and rbarD is (relatively) higher, that is
a sign that the population could be in linkage disequilibrium.
An interesting note from the documentation:
It has been widely used as a tool to detect clonal reproduction within populations. Populations whose members are undergoing sexual reproduction, whether it be selfing or out-crossing, will produce gametes via meiosis, and thus have a chance to shuffle alleles in the next generation. Populations whose members are undergoing clonal reproduction, however, generally do so via mitosis. This means that the most likely mechanism for a change in genotype is via mutation. The rate of mutation varies from species to species, but it is rarely sufficiently high to approximate a random shuffling of alleles. The index of association is a calculation based on the ratio of the variance of the raw number of differences between individuals and the sum of those variances over each locus. You can also think of it as the observed variance over the expected variance.
There is a nice description here.
source("R/rbarD.R")
calc_rbarD()
head(read.csv("output/population_stats/rbarD.csv"))
## spp city n Ia p.Ia rbarD p.rD
## 1 CD BA 55 664.7655 0.001 0.2950534 0.001
## 2 CD LA 48 470.5913 0.001 0.2070064 0.001
## 3 CD PX 82 634.5787 0.001 0.2792566 0.001
## 4 DS BA 55 557.7334 0.001 0.2123881 0.001
## 5 DS BO 52 896.0906 0.001 0.3398873 0.001
## 6 DS MN 81 578.2494 0.001 0.2192197 0.001
We assessed isolation by distance by comparing genetic distance to
geographic distance. Specifically, we took the traditional approach of
comparing a geographic dissimilarity matrix (based on latitude and
longitude) to a genetic dissimilarity matrix. We calculated the genetic
dissimilarity matrix with the dist.genpop
function int the adegenet
package. We use the Cavalli-Sforza
distance metric, or method = 2
argument for the dist.genpop
function.
Note that for this analysis, we treated each sampling site as a distinct location. There would not be enough power to do a distance matrix among 3-5 cities. Code for generating stats and figures from the Mantel test can be found in the source code below.
source("R/isolation_by_distance.R")
extract_ibd_stats_and_plots()
Below are the results of the mantel test. Note that there is a p-value correction for testing multiple cities (species are treated as independent, however).
Species | Observation | Std.Obs | Expectation | Variance | p-value |
---|---|---|---|---|---|
Bermuda grass (CD) | 0.4476430 | 11.782345 | 0.0004658 | 0.0014404 | 1e-04 |
crabgrass (DS) | 0.3299992 | 8.210204 | 0.0009116 | 0.0016066 | 1e-04 |
horseweed (EC) | 0.4028339 | 9.240682 | -0.0000991 | 0.0019013 | 1e-04 |
prickly lettuce (LS) | 0.1939607 | 7.794608 | -0.0000002 | 0.0006192 | 1e-04 |
bluegrass (PA) | 0.2821562 | 8.264578 | -0.0006003 | 0.0011705 | 1e-04 |
dandelion (TO) | 0.3101457 | 7.250584 | -0.0000930 | 0.0018308 | 1e-04 |
Statistics from running 9999 permutations (‘Reps’) via mantel test, limited to genomic versus distance comparisons. Hypothesis for all tests is ‘greater’.
We also repeated this within city. Note that there is a p-value correction for testing multiple environmental variables and cities (species are treated as independent, however).
Species | Observation | Std.Obs | Expectation | Variance | p-value | Adjusted p-value | City |
---|---|---|---|---|---|---|---|
Bermuda grass (CD) | 0.2945315 | 1.9546056 | -0.0029856 | 0.0231689 | 0.0436 | 0.0654000 | BA |
Bermuda grass (CD) | -0.1400172 | -0.8703840 | -0.0020079 | 0.0251417 | 0.7907 | 0.7907000 | LA |
Bermuda grass (CD) | 0.3063556 | 2.5384776 | 0.0026243 | 0.0143164 | 0.0128 | 0.0384000 | PX |
crabgrass (DS) | 0.1864927 | 1.3126728 | 0.0012624 | 0.0199118 | 0.1142 | 0.4568000 | BA |
crabgrass (DS) | -0.1472240 | -0.9380806 | 0.0012481 | 0.0250501 | 0.8230 | 0.9188000 | BO |
crabgrass (DS) | 0.0756460 | 0.5534710 | -0.0004163 | 0.0188864 | 0.2807 | 0.5614000 | MN |
crabgrass (DS) | -0.2825720 | -1.2882736 | -0.0015179 | 0.0475953 | 0.9188 | 0.9188000 | PX |
horseweed (EC) | 0.1702378 | 0.9531766 | 0.0005356 | 0.0316977 | 0.1752 | 0.5256000 | BA |
horseweed (EC) | -0.1334229 | -0.8704015 | -0.0005180 | 0.0233154 | 0.8025 | 0.8025000 | LA |
horseweed (EC) | 0.0106386 | 0.0858594 | -0.0031786 | 0.0258977 | 0.4472 | 0.6708000 | PX |
prickly lettuce (LS) | 0.1456205 | 0.5659122 | 0.0018806 | 0.0645144 | 0.2994 | 0.5571667 | BA |
prickly lettuce (LS) | -0.0725099 | -0.4629508 | -0.0004308 | 0.0242409 | 0.6496 | 0.7684000 | BO |
prickly lettuce (LS) | 0.4413406 | 2.2995233 | 0.0001766 | 0.0368065 | 0.0530 | 0.2650000 | LA |
prickly lettuce (LS) | 0.0845050 | 0.4603606 | -0.0014208 | 0.0348378 | 0.3343 | 0.5571667 | MN |
prickly lettuce (LS) | -0.1679338 | -0.8275834 | 0.0000745 | 0.0412133 | 0.7684 | 0.7684000 | PX |
bluegrass (PA) | -0.0149195 | -0.1337558 | 0.0007146 | 0.0136622 | 0.5099 | 0.6981333 | BA |
bluegrass (PA) | -0.2168452 | -1.6352688 | 0.0023646 | 0.0179697 | 0.9988 | 0.9988000 | BO |
bluegrass (PA) | -0.0431048 | -0.2310866 | 0.0020106 | 0.0381153 | 0.5236 | 0.6981333 | LA |
bluegrass (PA) | 0.3830110 | 1.7229224 | -0.0029655 | 0.0501869 | 0.0286 | 0.1144000 | PX |
dandelion (TO) | -0.0020598 | -0.0233426 | 0.0016880 | 0.0257782 | 0.4539 | 0.7635000 | BA |
dandelion (TO) | -0.1937755 | -1.2563379 | -0.0007760 | 0.0235993 | 0.9342 | 0.9342000 | BO |
dandelion (TO) | 0.1682753 | 1.2689800 | 0.0003895 | 0.0175032 | 0.1023 | 0.5115000 | LA |
dandelion (TO) | 0.0036501 | 0.0644593 | -0.0009095 | 0.0050035 | 0.4581 | 0.7635000 | MN |
dandelion (TO) | -0.1823350 | -0.7473653 | 0.0000244 | 0.0595375 | 0.7378 | 0.9222500 | PX |
Statistics from running 9999 permutations (‘Reps’) via mantel test, limited to within city for genomic versus distance comparisons. Hypothesis for all tests is ‘greater’.
Environmental variables include the monthly averages in the middle of the day for:
- air temperature at 5cm above ground
- air temperature at 1.2m above ground
- soil temperature at 2.5cm below ground
- RH (relative humidity) at 5cm above ground
- RH at 1.2m above ground
Variables were extracted from historic datasets and modeled using a microclimate model. More information can be found on the NicheMapR website (how the model works, what variables can be manipulated and what you can model, vignettes for running models in R).
This method was chosen because it takes data from global datasets (you can use both historic and current or pick specific years) but then accounts for site-specific variables (we can change the % shade, the slope or aspect of the landscape, and it considers elevation, average cloud cover, etc.). Here’s the list of all the different models/datasets we’re able to can pull from. It’s meant for mechanistic niche modeling.
Variables in the file site_data_DUC_environvars.csv
are all for the
monthly averages at noon (12pm - hottest part of the day!) and are
extreme. In other words, they are maximums.
Note that this Stack Overflow
post
is helpful with installing NicheMapR
.
# devtools::install_github('mrke/NicheMapR')
# library(NicheMapR)
#
# test_site_coords <- c(sites[1,]$lat, sites[1,]$long)
# test_distance_to_city_center_km <- sites[1,]$distance_to_city_center_km
# micros_ <- micro_usa(loc = test_site_coords)
#
# loc <- c(-89.40, 43.07)
# micro <- micro_global(loc = loc)
We assessed isolation by environment by comparing genetic distance to environmental distance, or the difference among sites. Genetic distance was generated the same way as isolation by distance (IBD) above. Code for generating stats from the Mantel test can be found in the source code below.
source("R/isolation_by_environment.R")
The following functions are used to generate statistics for all environmental variables. Default of the function runs for “nlcd_urban_pct” which is the percent urban cover of a site within city. These are the four environmental variables mentioned in the main manuscript, although more environmental variables are present in the raw data.
extract_ibe_stats_and_plots()
extract_ibe_stats_and_plots(env_var_to_use = "distance_to_city_center_km")
extract_ibe_stats_and_plots(env_var_to_use = "soiltemp_2.5cm_Jul_12pm")
extract_ibe_stats_and_plots(env_var_to_use = "soiltemp_2.5cm_Apr_12pm")
Below are the results of the mantel test. Note that there is a p-value correction for testing multiple environmental variables (species are treated as independent, however).
Species | Observation | Std.Obs | Expectation | Variance | p-value | Adjusted p-value | Env.Var |
---|---|---|---|---|---|---|---|
Bermuda grass (CD) | 0.0063593 | 0.1835877 | -0.0005072 | 0.0013989 | 0.3941 | 0.3941000 | nlcd_urban_pct |
Bermuda grass (CD) | 0.1376533 | 3.0448142 | 0.0002070 | 0.0020377 | 0.0040 | 0.0053333 | distance_to_city_center_km |
Bermuda grass (CD) | 0.4752489 | 13.8424955 | 0.0006095 | 0.0011757 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Apr_12pm |
Bermuda grass (CD) | 0.4323268 | 14.6854153 | 0.0004020 | 0.0008651 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Jul_12pm |
bluegrass (PA) | 0.0002138 | 0.0011500 | 0.0001657 | 0.0017543 | 0.4588 | 0.4737000 | nlcd_urban_pct |
bluegrass (PA) | -0.0069195 | -0.0996185 | 0.0000543 | 0.0049007 | 0.4737 | 0.4737000 | distance_to_city_center_km |
bluegrass (PA) | 0.2551951 | 6.3497889 | -0.0000994 | 0.0016165 | 0.0001 | 0.0004000 | soiltemp_2.5cm_Apr_12pm |
bluegrass (PA) | 0.2065044 | 3.9951073 | 0.0002376 | 0.0026656 | 0.0008 | 0.0016000 | soiltemp_2.5cm_Jul_12pm |
crabgrass (DS) | 0.0084529 | 0.2008775 | -0.0000271 | 0.0017821 | 0.3993 | 0.5324000 | nlcd_urban_pct |
crabgrass (DS) | -0.1048469 | -1.5781202 | 0.0001499 | 0.0044266 | 0.9724 | 0.9724000 | distance_to_city_center_km |
crabgrass (DS) | 0.3342972 | 6.0271540 | 0.0011639 | 0.0030550 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Apr_12pm |
crabgrass (DS) | 0.3007634 | 5.2868618 | 0.0011398 | 0.0032119 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Jul_12pm |
dandelion (TO) | -0.0799521 | -1.5694516 | -0.0002883 | 0.0025765 | 0.9486 | 0.9486000 | nlcd_urban_pct |
dandelion (TO) | 0.0153535 | 0.1977167 | -0.0012246 | 0.0070304 | 0.3992 | 0.5322667 | distance_to_city_center_km |
dandelion (TO) | 0.4279786 | 7.0596781 | 0.0001653 | 0.0036723 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Apr_12pm |
dandelion (TO) | 0.4638919 | 6.4145637 | 0.0006164 | 0.0052161 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Jul_12pm |
horseweed (EC) | 0.0159108 | 0.3024347 | 0.0008836 | 0.0024688 | 0.3163 | 0.3163000 | nlcd_urban_pct |
horseweed (EC) | 0.0946052 | 1.7061014 | 0.0003245 | 0.0030538 | 0.0619 | 0.0825333 | distance_to_city_center_km |
horseweed (EC) | 0.7172402 | 17.0412662 | 0.0002351 | 0.0017703 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Apr_12pm |
horseweed (EC) | 0.8921599 | 19.8503117 | 0.0003372 | 0.0020185 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Jul_12pm |
prickly lettuce (LS) | 0.0430785 | 1.0699199 | -0.0004009 | 0.0016515 | 0.1371 | 0.1828000 | nlcd_urban_pct |
prickly lettuce (LS) | -0.0891865 | -1.1853224 | -0.0007151 | 0.0055710 | 0.8831 | 0.8831000 | distance_to_city_center_km |
prickly lettuce (LS) | 0.4268723 | 14.7446772 | 0.0000286 | 0.0008380 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Apr_12pm |
prickly lettuce (LS) | 0.5214303 | 10.8973662 | -0.0000095 | 0.0022896 | 0.0001 | 0.0002000 | soiltemp_2.5cm_Jul_12pm |
Statistics from running 9999 permutations (‘Reps’) via mantel test, for genomic versus environmental comparisons. Hypothesis for all tests is ‘greater’.
We also repeated the mantel tests within city. Note that there is a p-value correction for testing multiple environmental variables and cities (species are treated as independent, however).
Species | Observation | Std.Obs | Expectation | Variance | p-value | Adjusted p-value | Env.Var | City |
---|---|---|---|---|---|---|---|---|
Bermuda grass (CD) | 0.3372190 | 2.8474161 | -0.0012071 | 0.0141262 | 0.0091 | 0.0546000 | nlcd_urban_pct | BA |
Bermuda grass (CD) | 0.1188409 | 0.9270429 | -0.0023793 | 0.0170982 | 0.1821 | 0.4370400 | distance_to_city_center_km | BA |
Bermuda grass (CD) | -0.2441035 | -1.4754900 | -0.0008758 | 0.0271740 | 0.9656 | 0.9656000 | soiltemp_2.5cm_Jul_12pm | BA |
Bermuda grass (CD) | -0.1570643 | -1.0130521 | 0.0002126 | 0.0241027 | 0.8431 | 0.9197455 | soiltemp_2.5cm_Apr_12pm | BA |
Bermuda grass (CD) | 0.0802221 | 0.4793677 | 0.0009244 | 0.0273643 | 0.3174 | 0.4761000 | nlcd_urban_pct | LA |
Bermuda grass (CD) | 0.1995251 | 1.4117523 | -0.0014271 | 0.0202613 | 0.0952 | 0.3808000 | distance_to_city_center_km | LA |
Bermuda grass (CD) | -0.0061106 | -0.0151598 | -0.0031778 | 0.0374262 | 0.4586 | 0.6114667 | soiltemp_2.5cm_Jul_12pm | LA |
Bermuda grass (CD) | -0.0805969 | -0.6239987 | -0.0007061 | 0.0163918 | 0.7126 | 0.8551200 | soiltemp_2.5cm_Apr_12pm | LA |
Bermuda grass (CD) | 0.0906880 | 0.9167417 | 0.0007042 | 0.0096346 | 0.1746 | 0.4370400 | nlcd_urban_pct | PX |
Bermuda grass (CD) | 0.3757125 | 3.0579121 | 0.0019998 | 0.0149357 | 0.0049 | 0.0546000 | distance_to_city_center_km | PX |
Bermuda grass (CD) | 0.0395164 | 0.3334686 | 0.0005637 | 0.0136448 | 0.3078 | 0.4761000 | soiltemp_2.5cm_Jul_12pm | PX |
Bermuda grass (CD) | 0.0417285 | 0.3405412 | 0.0005393 | 0.0146294 | 0.2844 | 0.4761000 | soiltemp_2.5cm_Apr_12pm | PX |
crabgrass (DS) | 0.1601026 | 1.2452588 | 0.0013308 | 0.0162565 | 0.1173 | 0.6256000 | nlcd_urban_pct | BA |
crabgrass (DS) | -0.1187959 | -0.6719382 | 0.0028106 | 0.0327533 | 0.7248 | 0.9411000 | distance_to_city_center_km | BA |
crabgrass (DS) | -0.1582690 | -0.8561975 | 0.0027774 | 0.0353797 | 0.7854 | 0.9411000 | soiltemp_2.5cm_Jul_12pm | BA |
crabgrass (DS) | -0.2451437 | -1.3695752 | 0.0005281 | 0.0321765 | 0.9337 | 0.9411000 | soiltemp_2.5cm_Apr_12pm | BA |
crabgrass (DS) | 0.0241445 | 0.2053813 | -0.0009766 | 0.0149608 | 0.3947 | 0.9411000 | nlcd_urban_pct | BO |
crabgrass (DS) | -0.1112305 | -0.6469967 | 0.0018740 | 0.0305602 | 0.7065 | 0.9411000 | distance_to_city_center_km | BO |
crabgrass (DS) | -0.0427338 | -0.2772193 | -0.0011589 | 0.0224914 | 0.5703 | 0.9411000 | soiltemp_2.5cm_Jul_12pm | BO |
crabgrass (DS) | -0.0600640 | -0.4084107 | -0.0003231 | 0.0213968 | 0.6217 | 0.9411000 | soiltemp_2.5cm_Apr_12pm | BO |
crabgrass (DS) | 0.0480884 | 0.4750376 | -0.0007043 | 0.0105500 | 0.3085 | 0.9411000 | nlcd_urban_pct | MN |
crabgrass (DS) | 0.0823907 | 0.5522340 | -0.0009400 | 0.0227701 | 0.2734 | 0.9411000 | distance_to_city_center_km | MN |
crabgrass (DS) | 0.3191330 | 2.1117687 | 0.0028693 | 0.0224288 | 0.0229 | 0.3664000 | soiltemp_2.5cm_Jul_12pm | MN |
crabgrass (DS) | 0.2069916 | 1.2882169 | 0.0053125 | 0.0245100 | 0.1127 | 0.6256000 | soiltemp_2.5cm_Apr_12pm | MN |
crabgrass (DS) | -0.0614409 | -0.2643071 | -0.0013096 | 0.0517588 | 0.4933 | 0.9411000 | nlcd_urban_pct | PX |
crabgrass (DS) | -0.2758803 | -1.3193515 | -0.0028122 | 0.0428372 | 0.9230 | 0.9411000 | distance_to_city_center_km | PX |
crabgrass (DS) | -0.2275997 | -1.0754253 | -0.0032313 | 0.0435274 | 0.8600 | 0.9411000 | soiltemp_2.5cm_Jul_12pm | PX |
crabgrass (DS) | -0.2533099 | -1.2102037 | -0.0027972 | 0.0428492 | 0.9411 | 0.9411000 | soiltemp_2.5cm_Apr_12pm | PX |
horseweed (EC) | -0.1903379 | -1.1707289 | 0.0019864 | 0.0269871 | 0.8784 | 0.9861000 | nlcd_urban_pct | BA |
horseweed (EC) | 0.1148969 | 0.6372828 | 0.0008399 | 0.0320316 | 0.2403 | 0.9861000 | distance_to_city_center_km | BA |
horseweed (EC) | -0.2773508 | -1.7132248 | -0.0010091 | 0.0260174 | 0.9805 | 0.9861000 | soiltemp_2.5cm_Jul_12pm | BA |
horseweed (EC) | -0.3260510 | -1.7866924 | -0.0014809 | 0.0330003 | 0.9861 | 0.9861000 | soiltemp_2.5cm_Apr_12pm | BA |
horseweed (EC) | -0.0694567 | -0.5193465 | 0.0002874 | 0.0180344 | 0.6319 | 0.9861000 | nlcd_urban_pct | LA |
horseweed (EC) | -0.1854615 | -1.2016425 | 0.0001369 | 0.0238560 | 0.9058 | 0.9861000 | distance_to_city_center_km | LA |
horseweed (EC) | -0.3092654 | -1.5703327 | 0.0019847 | 0.0392858 | 0.9581 | 0.9861000 | soiltemp_2.5cm_Jul_12pm | LA |
horseweed (EC) | 0.0814879 | 0.4078862 | 0.0014242 | 0.0385294 | 0.2957 | 0.9861000 | soiltemp_2.5cm_Apr_12pm | LA |
horseweed (EC) | 0.0533446 | 0.3463454 | -0.0014662 | 0.0250445 | 0.3303 | 0.9861000 | nlcd_urban_pct | PX |
horseweed (EC) | 0.2460485 | 1.4670912 | -0.0023983 | 0.0286783 | 0.0795 | 0.9540000 | distance_to_city_center_km | PX |
horseweed (EC) | -0.0941541 | -0.4764958 | -0.0008660 | 0.0383296 | 0.6275 | 0.9861000 | soiltemp_2.5cm_Jul_12pm | PX |
horseweed (EC) | -0.1007252 | -0.5128665 | -0.0014423 | 0.0374749 | 0.6366 | 0.9861000 | soiltemp_2.5cm_Apr_12pm | PX |
prickly lettuce (LS) | 0.2275756 | 0.8970634 | -0.0005680 | 0.0646801 | 0.2049 | 0.8528421 | nlcd_urban_pct | BA |
prickly lettuce (LS) | -0.0007177 | -0.0050873 | 0.0004469 | 0.0524045 | 0.5032 | 0.8528421 | distance_to_city_center_km | BA |
prickly lettuce (LS) | -0.0230189 | -0.1469705 | -0.0002974 | 0.0239010 | 0.5440 | 0.8528421 | soiltemp_2.5cm_Jul_12pm | BA |
prickly lettuce (LS) | -0.2125019 | -0.8382982 | -0.0005803 | 0.0639077 | 0.7748 | 0.8528421 | soiltemp_2.5cm_Apr_12pm | BA |
prickly lettuce (LS) | -0.1076683 | -0.8169817 | 0.0019126 | 0.0179906 | 0.7839 | 0.8528421 | nlcd_urban_pct | BO |
prickly lettuce (LS) | -0.0764992 | -0.4279106 | -0.0003930 | 0.0316325 | 0.6318 | 0.8528421 | distance_to_city_center_km | BO |
prickly lettuce (LS) | 0.0019421 | 0.0116429 | -0.0000526 | 0.0293529 | 0.4734 | 0.8528421 | soiltemp_2.5cm_Jul_12pm | BO |
prickly lettuce (LS) | -0.1207036 | -0.7310420 | 0.0008312 | 0.0276386 | 0.7451 | 0.8528421 | soiltemp_2.5cm_Apr_12pm | BO |
prickly lettuce (LS) | -0.0772755 | -0.5237519 | -0.0025257 | 0.0203690 | 0.6499 | 0.8528421 | nlcd_urban_pct | LA |
prickly lettuce (LS) | 0.2393495 | 1.4098461 | 0.0000782 | 0.0288030 | 0.1068 | 0.8100000 | distance_to_city_center_km | LA |
prickly lettuce (LS) | 0.5090590 | 3.0381437 | -0.0005762 | 0.0281386 | 0.0042 | 0.0840000 | soiltemp_2.5cm_Jul_12pm | LA |
prickly lettuce (LS) | 0.0502661 | 0.3156152 | -0.0040791 | 0.0296488 | 0.3191 | 0.8528421 | soiltemp_2.5cm_Apr_12pm | LA |
prickly lettuce (LS) | 0.0596842 | 0.3218968 | 0.0017251 | 0.0324198 | 0.3751 | 0.8528421 | nlcd_urban_pct | MN |
prickly lettuce (LS) | 0.2307428 | 1.2084445 | -0.0013445 | 0.0368850 | 0.1215 | 0.8100000 | distance_to_city_center_km | MN |
prickly lettuce (LS) | -0.1463485 | -0.9646066 | -0.0015993 | 0.0225181 | 0.8102 | 0.8528421 | soiltemp_2.5cm_Jul_12pm | MN |
prickly lettuce (LS) | -0.1569714 | -0.9013417 | -0.0006673 | 0.0300720 | 0.7874 | 0.8528421 | soiltemp_2.5cm_Apr_12pm | MN |
prickly lettuce (LS) | 0.0048140 | 0.0371630 | -0.0000159 | 0.0168916 | 0.4261 | 0.8528421 | nlcd_urban_pct | PX |
prickly lettuce (LS) | -0.2384487 | -1.5636104 | 0.0012078 | 0.0234921 | 0.9841 | 0.9841000 | distance_to_city_center_km | PX |
prickly lettuce (LS) | -0.0884762 | -0.4310103 | -0.0017385 | 0.0404986 | 0.5993 | 0.8528421 | soiltemp_2.5cm_Jul_12pm | PX |
prickly lettuce (LS) | -0.1131637 | -0.5322267 | -0.0015588 | 0.0439717 | 0.6554 | 0.8528421 | soiltemp_2.5cm_Apr_12pm | PX |
bluegrass (PA) | 0.2590523 | 1.7915057 | -0.0003314 | 0.0209628 | 0.0504 | 0.6586667 | nlcd_urban_pct | BA |
bluegrass (PA) | 0.0837403 | 0.4954254 | -0.0006743 | 0.0290321 | 0.3140 | 0.7319273 | distance_to_city_center_km | BA |
bluegrass (PA) | 0.0474162 | 0.2515628 | 0.0006727 | 0.0345263 | 0.3951 | 0.7319273 | soiltemp_2.5cm_Jul_12pm | BA |
bluegrass (PA) | -0.0469166 | -0.3069994 | 0.0000839 | 0.0234385 | 0.5914 | 0.7885333 | soiltemp_2.5cm_Apr_12pm | BA |
bluegrass (PA) | -0.0211154 | -0.1686075 | -0.0016706 | 0.0133001 | 0.5032 | 0.7319273 | nlcd_urban_pct | BO |
bluegrass (PA) | -0.1839908 | -1.2359879 | 0.0020825 | 0.0226641 | 0.9591 | 0.9591000 | distance_to_city_center_km | BO |
bluegrass (PA) | 0.0497172 | 0.3308238 | 0.0014752 | 0.0212646 | 0.3519 | 0.7319273 | soiltemp_2.5cm_Jul_12pm | BO |
bluegrass (PA) | 0.1285994 | 0.8795620 | 0.0008966 | 0.0210799 | 0.1880 | 0.7319273 | soiltemp_2.5cm_Apr_12pm | BO |
bluegrass (PA) | -0.1798066 | -1.1444740 | -0.0016225 | 0.0242396 | 0.8360 | 0.8917333 | nlcd_urban_pct | LA |
bluegrass (PA) | 0.0298291 | 0.1604297 | 0.0022274 | 0.0296008 | 0.4126 | 0.7319273 | distance_to_city_center_km | LA |
bluegrass (PA) | -0.0614415 | -0.2935904 | 0.0015465 | 0.0460291 | 0.4510 | 0.7319273 | soiltemp_2.5cm_Jul_12pm | LA |
bluegrass (PA) | -0.1578118 | -0.8920280 | 0.0013053 | 0.0318183 | 0.7664 | 0.8917333 | soiltemp_2.5cm_Apr_12pm | LA |
bluegrass (PA) | -0.1804107 | -0.7813029 | -0.0041644 | 0.0508863 | 0.8055 | 0.8917333 | nlcd_urban_pct | PX |
bluegrass (PA) | -0.0252203 | -0.1243909 | 0.0012535 | 0.0452954 | 0.4579 | 0.7319273 | distance_to_city_center_km | PX |
bluegrass (PA) | 0.4310886 | 1.5503133 | -0.0027320 | 0.0783035 | 0.1186 | 0.6586667 | soiltemp_2.5cm_Jul_12pm | PX |
bluegrass (PA) | 0.4553606 | 1.5631826 | -0.0006383 | 0.0850958 | 0.1235 | 0.6586667 | soiltemp_2.5cm_Apr_12pm | PX |
dandelion (TO) | -0.2054537 | -1.6876982 | -0.0019329 | 0.0145421 | 0.9770 | 0.9841000 | nlcd_urban_pct | BA |
dandelion (TO) | -0.1162303 | -0.6877884 | 0.0009578 | 0.0290307 | 0.7455 | 0.9841000 | distance_to_city_center_km | BA |
dandelion (TO) | -0.2552180 | -1.7044053 | -0.0004910 | 0.0223359 | 0.9841 | 0.9841000 | soiltemp_2.5cm_Jul_12pm | BA |
dandelion (TO) | -0.1781216 | -1.2260365 | -0.0000668 | 0.0210912 | 0.9035 | 0.9841000 | soiltemp_2.5cm_Apr_12pm | BA |
dandelion (TO) | 0.0264439 | 0.2249632 | 0.0005884 | 0.0132094 | 0.3798 | 0.7596000 | nlcd_urban_pct | BO |
dandelion (TO) | -0.1317608 | -0.7825541 | -0.0007820 | 0.0280139 | 0.7562 | 0.9841000 | distance_to_city_center_km | BO |
dandelion (TO) | 0.0495102 | 0.3167951 | -0.0017255 | 0.0261570 | 0.3347 | 0.7596000 | soiltemp_2.5cm_Jul_12pm | BO |
dandelion (TO) | 0.0566830 | 0.3766265 | -0.0016718 | 0.0240066 | 0.3146 | 0.7596000 | soiltemp_2.5cm_Apr_12pm | BO |
dandelion (TO) | 0.0429612 | 0.3098228 | 0.0005572 | 0.0187321 | 0.3507 | 0.7596000 | nlcd_urban_pct | LA |
dandelion (TO) | 0.1619736 | 1.2080728 | 0.0005408 | 0.0178565 | 0.1118 | 0.7596000 | distance_to_city_center_km | LA |
dandelion (TO) | 0.0498208 | 0.3481755 | -0.0008254 | 0.0211592 | 0.3337 | 0.7596000 | soiltemp_2.5cm_Jul_12pm | LA |
dandelion (TO) | 0.0396090 | 0.2603843 | 0.0008168 | 0.0221953 | 0.3376 | 0.7596000 | soiltemp_2.5cm_Apr_12pm | LA |
dandelion (TO) | -0.0259176 | -0.1965684 | -0.0002874 | 0.0170011 | 0.5574 | 0.9841000 | nlcd_urban_pct | MN |
dandelion (TO) | -0.0689391 | -0.7453822 | -0.0003591 | 0.0084652 | 0.7590 | 0.9841000 | distance_to_city_center_km | MN |
dandelion (TO) | -0.1163927 | -0.9129396 | 0.0016218 | 0.0167104 | 0.8052 | 0.9841000 | soiltemp_2.5cm_Jul_12pm | MN |
dandelion (TO) | 0.1077444 | 0.8015352 | 0.0004217 | 0.0179283 | 0.2205 | 0.7596000 | soiltemp_2.5cm_Apr_12pm | MN |
dandelion (TO) | -0.2898283 | -1.4806284 | 0.0001428 | 0.0383546 | 0.9451 | 0.9841000 | nlcd_urban_pct | PX |
dandelion (TO) | -0.2290678 | -1.2581586 | 0.0012770 | 0.0335186 | 0.9379 | 0.9841000 | distance_to_city_center_km | PX |
dandelion (TO) | 0.0834362 | 0.2591188 | 0.0014711 | 0.1000599 | 0.3051 | 0.7596000 | soiltemp_2.5cm_Jul_12pm | PX |
dandelion (TO) | 0.1422441 | 0.4269951 | 0.0004292 | 0.1103057 | 0.2530 | 0.7596000 | soiltemp_2.5cm_Apr_12pm | PX |
Statistics from running 9999 permutations (‘Reps’) via mantel test, limited to within city for genomic environmental comparisons. Hypothesis for all tests is ‘greater’.
The following function makes the main text figure.
ibe_mega_plot()
We tested the relationship between urbanness and the extent of admixture
by running a correlation test between percent impervious surface and
cor.test()
function.
source("R/plot_structure.R")
run_make_urban_admix_corr()
estimate | statistic | p.value | parameter | conf.low | conf.high | alternative |
---|---|---|---|---|---|---|
-0.1107532 | -1.5279720 | 0.1282002 | 188 | -0.2491780 | 0.0321063 | two.sided |
-0.0460642 | -0.6901589 | 0.4908087 | 224 | -0.1755096 | 0.0849468 | two.sided |
-0.1165779 | -1.2084826 | 0.2295507 | 106 | -0.2989656 | 0.0740269 | two.sided |
-0.0569564 | -0.7780453 | 0.4375309 | 186 | -0.1984491 | 0.0868618 | two.sided |
-0.0498366 | -0.6731691 | 0.5016936 | 182 | -0.1931055 | 0.0955130 | two.sided |
0.0285504 | 0.4443212 | 0.6572074 | 242 | -0.0973846 | 0.1535855 | two.sided |
Pearson’s product-moment correlation test results comparing percent impervious surface and extent of admixture.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] adegenet_2.1.10 ade4_1.7-22 LEA_3.16.0 ggh4x_0.2.8
## [5] here_1.0.1 lubridate_1.9.3 forcats_1.0.0 purrr_1.0.2
## [9] tibble_3.2.1 tidyverse_2.0.0 polysat_1.7-7 cowplot_1.1.3
## [13] viridis_0.6.5 viridisLite_0.4.2 raster_3.6-26 sp_2.1-4
## [17] stringr_1.5.1 readr_2.1.5 polyRAD_2.0.0 dplyr_1.1.4
## [21] magrittr_2.0.3 tidyr_1.3.1 ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 fastmap_1.2.0 promises_1.3.0
## [5] digest_0.6.35 timechange_0.3.0 mime_0.12 lifecycle_1.0.4
## [9] cluster_2.1.6 terra_1.7-78 compiler_4.4.0 rlang_1.1.4
## [13] tools_4.4.0 igraph_2.0.3 utf8_1.2.4 yaml_2.3.8
## [17] knitr_1.47 labeling_0.4.3 bit_4.0.5 plyr_1.8.9
## [21] withr_3.0.0 grid_4.4.0 fansi_1.0.6 xtable_1.8-4
## [25] colorspace_2.1-0 scales_1.3.0 MASS_7.3-60.2 cli_3.6.2
## [29] rmarkdown_2.27 vegan_2.6-6.1 crayon_1.5.2 ragg_1.3.2
## [33] generics_0.1.3 rstudioapi_0.16.0 reshape2_1.4.4 tzdb_0.4.0
## [37] ape_5.8 splines_4.4.0 parallel_4.4.0 vctrs_0.6.5
## [41] Matrix_1.7-0 hms_1.1.3 bit64_4.0.5 seqinr_4.2-36
## [45] systemfonts_1.1.0 glue_1.7.0 codetools_0.2-20 stringi_1.8.4
## [49] gtable_0.3.5 later_1.3.2 munsell_0.5.1 pillar_1.9.0
## [53] htmltools_0.5.8.1 R6_2.5.1 textshaping_0.4.0 rprojroot_2.0.4
## [57] vroom_1.6.5 evaluate_0.24.0 shiny_1.8.1.1 lattice_0.22-6
## [61] highr_0.11 httpuv_1.6.15 Rcpp_1.0.12 fastmatch_1.1-4
## [65] permute_0.9-7 gridExtra_2.3 nlme_3.1-165 mgcv_1.9-1
## [69] xfun_0.44 pkgconfig_2.0.3
All data files for the Macrosystems project are permanently stored under Meghan Avolio’s group resources in the Johns Hopkins University Rockfish computing cluster. Files are stored under the ‘data’ directory under the following subdirectories:
-
01-raw_data
: This folder contains the raw, unprocessed data files that were obtained directly from the sequencing server. There are eightfastq.gz
files per sublibrary that correspond to the four sequencing lanes for each read direction. -
02-concatenated_data
: This folder contains the concatenated, unprocessed files for each sublibrary (i.e., the files containing the sequences for each lane were combined to create one file per read direction). -
03-pcr_filtered_data
: Here, you will find the resulting data files from theclone_filter
program, where pcr replicates/clones have been removed from the raw sequences. There are twofq.gz
files per sublibrary. -
04-process_radtags
: This folder contains various subdirectories that correspond to theprocess_radtags
program that demultiplexes and cleans the data. Thedemux_txt_files
folder contains the .txt files used to identify barcodes and separate out the individual samples from each sublibrary. The resulting data files from theprocess-radtags
program are separated by individual and can be found in the relevant species folder(i.e., CD, DS, EC, LS, PA, TE, TO). Each individual sample has four data files;sampleID.1.fq.gz
andsampleID.2.fq.gz
correspond to the forward and reverse reads for each sample andsampleID.rem,1/2.fq.gz
contain the remainder reads that were cleaned and removed from the data sequence. -
05-ustacks-denovo_data
: This folder contains species subdirectories that store the resulting data files from theustacks
program for each individual. There are three files per individual;sampleID.allelles.tsv.gz
,sampleID.snps.tsv.gz
, and,sampleID.tags.tsv.gz
. These files should be permanently stored here and copied to a new directory for any new catalogs and/or when a group of samples are being aligned to a new catalog. -
catalogs_by_city
: For any given species within city, there is likely to be a slightly different set of SNPs compared to the whole metapopulation of five cities. We examined 24 sets of species-city combinations. These catalogs are permanently stored here. -
catalogs_by_species
: Metapopulation catalogs are stored within this folder for each species. The metapopulation catalog was created using samples from all populations to create a national catalog.Some notes about catalog directories:
- Catalogs contain three files;
catalog.alleles.tsv.gz
,catalog.snps.tsv.gz
, andcatalog.tafs.tsv.gz
. If you would like to use the catalog on a new project, you will need to copy all three files to a new project folder. - You can determine which individuals were used to create the catalog
by looking at the
cstacks_popmap.txt
found within each folder. Specifically for the metapopulation catalogs, this information is also found in the cstacks-metapop-catalog_samples-included.csv - You can determine which individuals were subsequently aligned to the
catalog and used in the subsequent stacks analysis by looking at the
popmap*.txt
found within each folder. - Each folder also contains the relevant ustacks and stacks pipeline
scripts and output files (i.e., from
cstacks
,gstacks
,stacks
,tsv2bam
, andpopulations
),
- Catalogs contain three files;
See data/aspera_transfer_file_names.csv. Preview:
readLines("data/aspera_transfer_file_names.csv", 10)
## [1] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L001_R1_001.fastq.gz"
## [2] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L001_R2_001.fastq.gz"
## [3] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L002_R1_001.fastq.gz"
## [4] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L002_R2_001.fastq.gz"
## [5] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L003_R1_001.fastq.gz"
## [6] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L003_R2_001.fastq.gz"
## [7] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L004_R1_001.fastq.gz"
## [8] "/Hoffman_macrosystems/AMH_macro_1_1_12px_S1_L004_R2_001.fastq.gz"
## [9] "/Hoffman_macrosystems/AMH_macro_1_10_8px_S10_L001_R1_001.fastq.gz"
## [10] "/Hoffman_macrosystems/AMH_macro_1_10_8px_S10_L001_R2_001.fastq.gz"
See data/clone_filter_file_names.csv. Preview:
readLines("data/clone_filter_file_names.csv", 10)
## [1] "AMH_macro_1_1_12px_S1" "AMH_macro_1_10_8px_S10" "AMH_macro_1_11_8px_S11"
## [4] "AMH_macro_1_12_8px_S12" "AMH_macro_1_13_8px_S13" "AMH_macro_1_14_8px_S14"
## [7] "AMH_macro_1_2_12px_S2" "AMH_macro_1_3_12px_S3" "AMH_macro_1_4_12px_S4"
## [10] "AMH_macro_1_5_8px_S5"