Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
avanlinden committed Apr 3, 2024
2 parents b4ed364 + 0699b14 commit 295b0e8
Show file tree
Hide file tree
Showing 11 changed files with 196 additions and 211 deletions.
5 changes: 2 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,8 @@ vignettes/*.pdf
# R project
*.Rproj
# downloaded files
code-tutorials/files/*
files/*
DESeq2_tutorial_R/outputs/*
files/
outputs/
.DS_Store

/.quarto/
144 changes: 69 additions & 75 deletions 5XFAD_data_R_tutorial.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'AD Knowledge Portal Workshop: Download and explore 5XFAD mouse data in RStudio'
author: "Laura Heath & Abby Vander Linden, Sage Bionetworks"
author: "Abby Vander Linden & Laura Heath, Sage Bionetworks"
date: "`r Sys.Date()`"
format:
html:
Expand Down Expand Up @@ -31,28 +31,22 @@ If you haven't already, install `synapser` (the [Synapse R client](https://r-doc

```{r install-synapser, eval = FALSE}
# install synapser
install.packages("synapser", repos = c("http://ran.synapse.org", "http://cran.fhcrc.org"))
install.packages("synapser", repos = c("http://ran.synapse.org"))
# install tidyverse if you don't already have it
install.packages("tidyverse")
```

We will also use the BioconductoR package manager to install `biomaRt`, which will help us with gene count data later.

```{r install-biomaRt, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
install.packages("dplyr", "purrr", "readr", "lubridate", "stringr", "tibble")
```

Load libraries

```{r load-libraries, message=FALSE, warning=FALSE}
```{r load-libraries, message=FALSE, warning=TRUE}
library(synapser)
library(tidyverse)
library(biomaRt)
library(dplyr)
library(purrr)
library(readr)
library(lubridate)
library(stringr)
library(tibble)
```

### Login to Synapse
Expand All @@ -63,7 +57,7 @@ Next, you will need to log in to your Synapse account.

If you are logged into the [Synapse](https://www.synapse.org/) web browser, `synapser` will automatically use your login credentials to log you in during your R session! All you have to do is:

```{r synlogin, results = "hide"}
```{r synlogin, include = FALSE}
synLogin()
```
Expand Down Expand Up @@ -95,7 +89,7 @@ This filters the table to a single file. In the "Id" column for this `htseqcount

![](images/Screenshot%202023-05-07%20at%2011.10.31%20AM.png){width="448"}

We can then use that synID to download the file.
We can then use that synID to download the file. Some system about the file and its storage location within Synapse is printed to the R console when we call `synGet`.

```{r single-synGet, results = "hide"}
Expand Down Expand Up @@ -158,18 +152,12 @@ We can see from the `download_table` we got during the bulk download step that w

```{r explore-download-table}
#
download_table %>%
download_table |>
dplyr::select(name, metadataType, assay)
```

We are only interested in RNAseq data, so we will only read in the individual, biospecimen, and RNAseq assay metadata files.

```{r read-data-files, message=FALSE, warning=FALSE}
# counts matrix
counts <- read_tsv("files/htseqcounts_5XFAD.txt", show_col_types = FALSE)
```

##### A note on file versions!

All files in the AD Portal are versioned, meaning that if the file represented by a particular synID changes, a new version will be created. You can access a specific versions by using the `version` argument in `synGet()`. This will come in handy here, because the Jax.IU.Pitt_5XFAD study recently made some changes to their individual metadata file that included an inadvertent error. While our curators sort out that error, we will use a a previous version of this file (version 3) to ensure that our metadata joins properly. More info on versions in the AD Portal and the Synapse platform can be found [here](https://help.synapse.org/docs/Versioning-Files.2667708547.html).
Expand All @@ -190,7 +178,7 @@ ind_meta_object$properties$isLatestVersion

Now we can read all the metadata files in to R as data frames.

```{r}
```{r read-metadata-files}
#individual metadata
ind_meta <- read_csv("files/Jax.IU.Pitt_5XFAD_individual_metadata.csv", show_col_types = FALSE)
Expand All @@ -206,12 +194,17 @@ Let's examine the data and metadata files a bit before we begin our analyses.

#### Counts data

```{r read-data-files, message=FALSE, warning=FALSE}
# counts matrix
counts <- read_tsv("files/htseqcounts_5XFAD.txt", show_col_types = FALSE)
```

```{r view-counts}
# Calling a tibble object will print the first ten rows in a nice tidy output; doing the same for a base R dataframe will print the whole thing until it runs out of memory. If you want to inspect a large dataframe, use `head(df)`
head(counts)
```

The data file has a column of ENSEMBL gene ids and then a bunch of columns with count data, where the column headers correspond to the specimenIDs. These specimenIDs should all be in the RNAseq assay metadata file, so let's check.
The counts data file has a column of ENSEMBL gene ids and then a bunch of columns with count data, where the column headers correspond to the specimenIDs. These specimenIDs should all be in the RNAseq assay metadata file, so let's check.

```{r view-assay}
# what does the RNAseq assay metadata look like?
Expand Down Expand Up @@ -273,12 +266,12 @@ distinct(ind_meta, genotype)

#### Joining metadata

We use the three-file structure for our metadata because it allows us to store metadata for each study in a tidy format. Every line in the assay and biospecimen files represents a unique specimen, and every line in the individual file represents a unique individual. This means the files can be easily joined by specimenID and individualID to get all levels of metadata that apply to a particular data file. We will use the `left_join()` function from the `dplyr` package, and the `%>%` operator from the `magrittr` package. *If you are unfamiliar with the pipe, think of it as a shorthand for "take this (the preceding object) and do that (the subsequent command)". See [here](https://magrittr.tidyverse.org/) for more info on piping in R.*
We use the three-file structure for our metadata because it allows us to store metadata for each study in a tidy format. Every line in the assay and biospecimen files represents a unique specimen, and every line in the individual file represents a unique individual. This means the files can be easily joined by specimenID and individualID to get all levels of metadata that apply to a particular data file. We will use the `left_join()` function from the `dplyr` package, and the base R pipe operator `|>`. *If you are unfamiliar with the pipe, think of it as a shorthand for "take this (the preceding object) and do that (the subsequent command)". See [here](https://r4ds.hadley.nz/data-transform.html#sec-the-pipe) for more info on piping in R.*

```{r join-metadata}
# join all the rows in the assay metadata that have a match in the biospecimen metadata
joined_meta <- rna_meta %>% #start with the rnaseq assay metadata
left_join(bio_meta, by = "specimenID") %>% #join rows from biospecimen that match specimenID
joined_meta <- rna_meta |> #start with the rnaseq assay metadata
left_join(bio_meta, by = "specimenID") |> #join rows from biospecimen that match specimenID
left_join(ind_meta, by = "individualID") # join rows from individual that match individualID
joined_meta
Expand Down Expand Up @@ -313,7 +306,7 @@ The file annotations let us see which study the file is associated with (Jax.IU.
```{r join-annotations-to-metadata}
# find records belonging to the individual this file maps to in our joined metadata
joined_meta %>%
joined_meta |>
filter(individualID == fastq_annotations$individualID[[1]])
```

Expand Down Expand Up @@ -350,7 +343,7 @@ Once you've downloaded all the files in the `id` column, you can link those file
fastq <- synGet(random_fastq, downloadFile = FALSE)
# filter the annotations table to rows that match the fastq filename
annotations_table %>%
annotations_table |>
filter(name == fastq$properties$name)
```

Expand All @@ -362,19 +355,19 @@ If we look at the processed data files in the table of 5XFAD RNAseq file annotat

```{r filter-multispecimen-files}
#
annotations_table %>%
filter(fileFormat == "txt") %>%
annotations_table |>
filter(fileFormat == "txt") |>
dplyr::select(name, individualID, specimenID, isMultiSpecimen)
```

The multispecimen file should contain a row or column of specimenIDs that correspond to the specimenIDs used in a study's metadata, as we have seen with the 5XFAD counts file.

```{r join-multispecimen-metadata}
# In this example, we take a slice of the counts data to reduce computation, transpose it so that each row represents a single specimen, and then join it to the joined metadata by the specimenID
counts %>%
slice_head(n = 5) %>%
t() %>%
as_tibble(rownames = "specimenID") %>%
counts |>
slice_head(n = 5) |>
t() |>
as_tibble(rownames = "specimenID") |>
left_join(joined_meta, by = "specimenID")
```

Expand Down Expand Up @@ -409,10 +402,10 @@ The MODEL-AD individual mouse metadata contains columns with birth date and deat
library(lubridate)
# convert columns of strings to month-date-year format
joined_meta_time <- joined_meta %>%
mutate(dateBirth = mdy(dateBirth), dateDeath = mdy(dateDeath)) %>%
joined_meta_time <- joined_meta |>
mutate(dateBirth = mdy(dateBirth), dateDeath = mdy(dateDeath)) |>
# create a new column that subtracts dateBirth from dateDeath in days, then divide by 30 to get months
mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth, units ="days"))/30) %>%
mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth, units ="days"))/30) |>
# convert numeric ages to timepoint categories
mutate(timepoint = case_when(timepoint > 10 ~ "12 mo",
timepoint < 10 & timepoint > 5 ~ "6 mo",
Expand All @@ -423,8 +416,8 @@ We now have balanced samples across sex, genotype, and age:

```{r group-metadata-by-covars}
# check that the timepoint column looks ok
joined_meta_time %>%
group_by(sex, genotype, timepoint) %>%
joined_meta_time |>
group_by(sex, genotype, timepoint) |>
count()
```

Expand All @@ -434,7 +427,7 @@ To reduce the width of the dataframe, we will subset only the columns that conta

```{r subset-covars}
# many packages have a "select" function that masks dplyr so we have to specify
covars <- joined_meta_time %>%
covars <- joined_meta_time |>
dplyr::select(individualID, specimenID, sex, genotype, timepoint)
# check the result
Expand All @@ -447,7 +440,7 @@ Return to the gene counts matrix we read in earlier.

```{r check-non-mouse-genes}
# check how many gene_ids are NOT from the mouse genome by searching for the string "MUS" (as in Mus musculus) in the gene_id column
counts %>%
counts |>
filter(!str_detect(gene_id, "MUS"))
```
Expand All @@ -458,6 +451,18 @@ counts %>%

OPTIONAL: Transform the ensemblIDs in the matrix to common gene names, using the R package `biomaRt` (note: must specify to use the mouse database, although the two genes in the 5XFAD model we identified above are humanized and won't be translated by the program).

For this option, use the BioconductoR package manager to install `biomaRt`.

```{r install-biomaRt, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
library(biomaRt)
```

**The next two code chunks will not automatically execute in this notebook because they can take a long time -- the code is included if you'd like to try this on your own.**

We will use the two custom functions below to convert ensemblIDs to gene names:
Expand Down Expand Up @@ -494,17 +499,15 @@ Call the `Make.Gene.Symb()` function to add a new column with short gene names t

```{r convert-gene-ids, eval=FALSE, include=FALSE}
# use the mutate function from dplyr
named_counts <- counts %>%
named_counts <- counts |>
mutate(gene_name = Make.Gene.Symb(gene_id))
```

##### Pre-translated option:

For this demonstration, instead of running biomaRt, which can be unreliable at times and take a long time to render, we will append a dataframe to our counts matrix with short gene names already translated

```{r}
#ensembl_to_gene <- subset(named_counts, select=c(gene_id, gene_name))
#write.csv(ensembl_to_gene, file="ensembl_translation_key.csv", row.names=FALSE)
```{r read-preconverted-ensembl-counts}
ensembl_to_gene <- read.csv(file="ensembl_translation_key.csv")
named_counts <- dplyr::left_join(counts, ensembl_to_gene, by="gene_id")
Expand All @@ -527,9 +530,9 @@ named_counts[1, "gene_name"] <- "PSEN1"
named_counts[2, "gene_name"] <- "APP"
#make all gene names unique and remove unneeded column
named_counts <- named_counts %>%
mutate(gene_name = make.unique(gene_name)) %>%
dplyr::select(-gene_id) %>%
named_counts <- named_counts |>
mutate(gene_name = make.unique(gene_name)) |>
dplyr::select(-gene_id) |>
column_to_rownames(var = "gene_name")
```
Expand All @@ -540,9 +543,9 @@ Now we can transpose the dataframe so that each row contains count data cross al

```{r transpose-counts}
#
counts_tposed <- named_counts %>%
t() %>% #transposing forces the df to a matrix
as_tibble(rownames = "specimenID") %>% #reconvert to tibble and specify rownames
counts_tposed <- named_counts |>
t() |> #transposing forces the df to a matrix
as_tibble(rownames = "specimenID") |> #reconvert to tibble and specify rownames
left_join(covars, by = "specimenID") #join covariates by specimenID
```

Expand All @@ -562,26 +565,26 @@ counts_tposed$timepoint <- factor(counts_tposed$timepoint, levels=c("4 mo","6 mo

Use ggplot to plot gene counts for each specimen by age, sex, and genotype.

```{r plot-trem2}
```{r plot-app}
# load ggplot2
library(ggplot2)
#Look at Trem2 levels
g <- counts_tposed %>%
ggplot(aes(x=timepoint, y=Trem2, color=genotype)) +
#Look at APP levels -- this model is the 5X FAD mutant, so we expect it to be high!
g <- counts_tposed |>
ggplot(aes(x=timepoint, y=APP, color=genotype)) +
geom_boxplot() +
geom_point(position=position_jitterdodge()) +
facet_wrap(~sex)
g
```

Examine any gene of interest by setting the y argument in the `ggplot(aes()` mapping equal to the gene name. Ex: `y = Cst7`
Examine any gene of interest by setting the y argument in the `ggplot(aes()` mapping equal to the gene name. Ex: `y = Trem2`

```{r plot-Cst7}
```{r plot-Trem2}
#
g <- counts_tposed %>%
ggplot(aes(x=timepoint, y=Cst7, color=genotype)) +
g <- counts_tposed |>
ggplot(aes(x=timepoint, y=Trem2, color=genotype)) +
geom_boxplot() +
geom_point(position=position_jitterdodge()) +
facet_wrap(~sex)
Expand All @@ -593,7 +596,7 @@ Ex: `y = Apoe`

```{r plot-Apoe}
#
g <- counts_tposed %>%
g <- counts_tposed |>
ggplot(aes(x=timepoint, y=Apoe, color=genotype)) +
geom_boxplot() +
geom_point(position=position_jitterdodge()) +
Expand All @@ -602,15 +605,6 @@ g <- counts_tposed %>%
g
```

Ex: `y = Kirrel2`
\-\--

```{r plot-Kirrel2}
#
g <- counts_tposed %>%
ggplot(aes(x=timepoint, y=Kirrel2, color=genotype)) +
geom_boxplot() +
geom_point(position=position_jitterdodge()) +
facet_wrap(~sex)
g
```
![](images/ADKP_logo.png){width="236"}
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@

- [Working with AD portal Data in R](https://sage-bionetworks.github.io/ADPortalWorkshops/5XFAD_data_R_tutorial.html)
- [Working with AD portal Data in python](https://sage-bionetworks.github.io/ADPortalWorkshops/5XFADdata_python_tutorial.html)
- [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)
- [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.html)

### Versions of the tutorials you can download and run interactively from RStudio or VSCode

- [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)

- [Quarto notebook - Working with AD portal Data in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/5XFAD_data_R_tutorial.qmd)
- [Quarto notebook - Working with AD portal Data in python](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/5XFADdata_python_tutorial.qmd)
- [R notebook - Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)

### Next upcoming workshop: July 14, 2023 at AAIC!

[Synapse workshop project](https://www.synapse.org/#!Synapse:syn26165425/wiki/612752)
Loading

0 comments on commit 295b0e8

Please sign in to comment.