Merge branch 'main' of https://github.com/Sage-Bionetworks/ADPortalWo…

…rkshops
Sage-Bionetworks · Apr 3, 2024 · 295b0e8 · 295b0e8
2 parents b4ed364 + 0699b14
commit 295b0e8
Show file tree

Hide file tree

Showing 11 changed files with 196 additions and 211 deletions.
diff --git a/.gitignore b/.gitignore
@@ -29,9 +29,8 @@ vignettes/*.pdf
 # R project
 *.Rproj
 # downloaded files
-code-tutorials/files/*
-files/*
-DESeq2_tutorial_R/outputs/*
+files/
+outputs/
 .DS_Store
 
 /.quarto/
diff --git a/5XFAD_data_R_tutorial.qmd b/5XFAD_data_R_tutorial.qmd
@@ -1,6 +1,6 @@
 ---
 title: 'AD Knowledge Portal Workshop: Download and explore 5XFAD mouse data in RStudio'
-author: "Laura Heath & Abby Vander Linden, Sage Bionetworks"
+author: "Abby Vander Linden & Laura Heath, Sage Bionetworks"
 date: "`r Sys.Date()`"
 format: 
   html: 
@@ -31,28 +31,22 @@ If you haven't already, install `synapser` (the [Synapse R client](https://r-doc
 
 ```{r install-synapser, eval = FALSE}
 # install synapser
-install.packages("synapser", repos = c("http://ran.synapse.org", "http://cran.fhcrc.org"))
+install.packages("synapser", repos = c("http://ran.synapse.org"))
 
 # install tidyverse if you don't already have it
-install.packages("tidyverse")
-```
-
-We will also use the BioconductoR package manager to install `biomaRt`, which will help us with gene count data later.
-
-```{r install-biomaRt, eval = FALSE}
-
-if (!requireNamespace("BiocManager", quietly = TRUE))
-    install.packages("BiocManager")
-
-BiocManager::install("biomaRt")
+install.packages("dplyr", "purrr", "readr", "lubridate", "stringr", "tibble")
 ```
 
 Load libraries
 
-```{r load-libraries, message=FALSE, warning=FALSE}
+```{r load-libraries, message=FALSE, warning=TRUE}
 library(synapser)
-library(tidyverse)
-library(biomaRt)
+library(dplyr)
+library(purrr)
+library(readr)
+library(lubridate)
+library(stringr)
+library(tibble)
 ```
 
 ### Login to Synapse
@@ -63,7 +57,7 @@ Next, you will need to log in to your Synapse account.
 
 If you are logged into the [Synapse](https://www.synapse.org/) web browser, `synapser` will automatically use your login credentials to log you in during your R session! All you have to do is:
 
-```{r synlogin, results = "hide"}
+```{r synlogin, include = FALSE}
 
 synLogin()
 ```
@@ -95,7 +89,7 @@ This filters the table to a single file. In the "Id" column for this `htseqcount
 
 ![](images/Screenshot%202023-05-07%20at%2011.10.31%20AM.png){width="448"}
 
-We can then use that synID to download the file.
+We can then use that synID to download the file. Some system about the file and its storage location within Synapse is printed to the R console when we call `synGet`.
 
 ```{r single-synGet, results = "hide"}
 
@@ -158,18 +152,12 @@ We can see from the `download_table` we got during the bulk download step that w
 
 ```{r explore-download-table}
 #
-download_table %>% 
+download_table |> 
   dplyr::select(name, metadataType, assay)
 ```
 
 We are only interested in RNAseq data, so we will only read in the individual, biospecimen, and RNAseq assay metadata files.
 
-```{r read-data-files, message=FALSE, warning=FALSE}
-# counts matrix
-counts <- read_tsv("files/htseqcounts_5XFAD.txt", show_col_types = FALSE)
-
-```
-
 ##### A note on file versions!
 
 All files in the AD Portal are versioned, meaning that if the file represented by a particular synID changes, a new version will be created. You can access a specific versions by using the `version` argument in `synGet()`. This will come in handy here, because the Jax.IU.Pitt_5XFAD study recently made some changes to their individual metadata file that included an inadvertent error. While our curators sort out that error, we will use a a previous version of this file (version 3) to ensure that our metadata joins properly. More info on versions in the AD Portal and the Synapse platform can be found [here](https://help.synapse.org/docs/Versioning-Files.2667708547.html).
@@ -190,7 +178,7 @@ ind_meta_object$properties$isLatestVersion
 
 Now we can read all the metadata files in to R as data frames.
 
-```{r}
+```{r read-metadata-files}
 #individual metadata
 ind_meta <- read_csv("files/Jax.IU.Pitt_5XFAD_individual_metadata.csv", show_col_types = FALSE)
 
@@ -206,12 +194,17 @@ Let's examine the data and metadata files a bit before we begin our analyses.
 
 #### Counts data
 
+```{r read-data-files, message=FALSE, warning=FALSE}
+# counts matrix
+counts <- read_tsv("files/htseqcounts_5XFAD.txt", show_col_types = FALSE)
+```
+
 ```{r view-counts}
 # Calling a tibble object will print the first ten rows in a nice tidy output; doing the same for a base R dataframe will print the whole thing until it runs out of memory. If you want to inspect a large dataframe, use `head(df)`
 head(counts)
 ```
 
-The data file has a column of ENSEMBL gene ids and then a bunch of columns with count data, where the column headers correspond to the specimenIDs. These specimenIDs should all be in the RNAseq assay metadata file, so let's check.
+The counts data file has a column of ENSEMBL gene ids and then a bunch of columns with count data, where the column headers correspond to the specimenIDs. These specimenIDs should all be in the RNAseq assay metadata file, so let's check.
 
 ```{r view-assay}
 # what does the RNAseq assay metadata look like?
@@ -273,12 +266,12 @@ distinct(ind_meta, genotype)
 
 #### Joining metadata
 
-We use the three-file structure for our metadata because it allows us to store metadata for each study in a tidy format. Every line in the assay and biospecimen files represents a unique specimen, and every line in the individual file represents a unique individual. This means the files can be easily joined by specimenID and individualID to get all levels of metadata that apply to a particular data file. We will use the `left_join()` function from the `dplyr` package, and the `%>%` operator from the `magrittr` package. *If you are unfamiliar with the pipe, think of it as a shorthand for "take this (the preceding object) and do that (the subsequent command)". See [here](https://magrittr.tidyverse.org/) for more info on piping in R.*
+We use the three-file structure for our metadata because it allows us to store metadata for each study in a tidy format. Every line in the assay and biospecimen files represents a unique specimen, and every line in the individual file represents a unique individual. This means the files can be easily joined by specimenID and individualID to get all levels of metadata that apply to a particular data file. We will use the `left_join()` function from the `dplyr` package, and the base R pipe operator `|>`. *If you are unfamiliar with the pipe, think of it as a shorthand for "take this (the preceding object) and do that (the subsequent command)". See [here](https://r4ds.hadley.nz/data-transform.html#sec-the-pipe) for more info on piping in R.*
 
 ```{r join-metadata}
 # join all the rows in the assay metadata that have a match in the biospecimen metadata
-joined_meta <- rna_meta %>% #start with the rnaseq assay metadata
-  left_join(bio_meta, by = "specimenID") %>%  #join rows from biospecimen that match specimenID 
+joined_meta <- rna_meta |> #start with the rnaseq assay metadata
+  left_join(bio_meta, by = "specimenID") |>  #join rows from biospecimen that match specimenID 
   left_join(ind_meta, by = "individualID") # join rows from individual that match individualID
 
 joined_meta
@@ -313,7 +306,7 @@ The file annotations let us see which study the file is associated with (Jax.IU.
 ```{r join-annotations-to-metadata}
 
 # find records belonging to the individual this file maps to in our joined metadata
-joined_meta %>% 
+joined_meta |> 
   filter(individualID == fastq_annotations$individualID[[1]])
 ```
 
@@ -350,7 +343,7 @@ Once you've downloaded all the files in the `id` column, you can link those file
 fastq <- synGet(random_fastq, downloadFile = FALSE)
 
 # filter the annotations table to rows that match the fastq filename
-annotations_table %>% 
+annotations_table |> 
   filter(name == fastq$properties$name)
 ```
 
@@ -362,19 +355,19 @@ If we look at the processed data files in the table of 5XFAD RNAseq file annotat
 
 ```{r filter-multispecimen-files}
 #
-annotations_table %>% 
-  filter(fileFormat == "txt") %>% 
+annotations_table |> 
+  filter(fileFormat == "txt") |> 
   dplyr::select(name, individualID, specimenID, isMultiSpecimen)
 ```
 
 The multispecimen file should contain a row or column of specimenIDs that correspond to the specimenIDs used in a study's metadata, as we have seen with the 5XFAD counts file.
 
 ```{r join-multispecimen-metadata}
 # In this example, we take a slice of the counts data to reduce computation, transpose it so that each row represents a single specimen, and then join it to the joined metadata by the specimenID
-counts %>% 
-  slice_head(n = 5) %>% 
-  t() %>% 
-  as_tibble(rownames = "specimenID") %>% 
+counts |> 
+  slice_head(n = 5) |> 
+  t() |> 
+  as_tibble(rownames = "specimenID") |> 
   left_join(joined_meta, by = "specimenID")
 ```
 
@@ -409,10 +402,10 @@ The MODEL-AD individual mouse metadata contains columns with birth date and deat
 library(lubridate)
 
 # convert columns of strings to month-date-year format
-joined_meta_time <- joined_meta %>% 
-  mutate(dateBirth = mdy(dateBirth), dateDeath = mdy(dateDeath)) %>% 
+joined_meta_time <- joined_meta |> 
+  mutate(dateBirth = mdy(dateBirth), dateDeath = mdy(dateDeath)) |> 
   # create a new column that subtracts dateBirth from dateDeath in days, then divide by 30 to get months
-  mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth, units ="days"))/30) %>% 
+  mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth, units ="days"))/30) |> 
   # convert numeric ages to timepoint categories
   mutate(timepoint = case_when(timepoint > 10 ~ "12 mo",
                                timepoint < 10 & timepoint > 5 ~ "6 mo",
@@ -423,8 +416,8 @@ We now have balanced samples across sex, genotype, and age:
 
 ```{r group-metadata-by-covars}
 # check that the timepoint column looks ok
-joined_meta_time %>% 
-  group_by(sex, genotype, timepoint) %>% 
+joined_meta_time |> 
+  group_by(sex, genotype, timepoint) |> 
   count()
 ```
 
@@ -434,7 +427,7 @@ To reduce the width of the dataframe, we will subset only the columns that conta
 
 ```{r subset-covars}
 # many packages have a "select" function that masks dplyr so we have to specify
-covars <- joined_meta_time %>% 
+covars <- joined_meta_time |> 
   dplyr::select(individualID, specimenID, sex, genotype, timepoint)
 
 # check the result
@@ -447,7 +440,7 @@ Return to the gene counts matrix we read in earlier.
 
 ```{r check-non-mouse-genes}
 # check how many gene_ids are NOT from the mouse genome by searching for the string "MUS" (as in Mus musculus) in the gene_id column
-counts %>% 
+counts |> 
   filter(!str_detect(gene_id, "MUS"))
 
 ```
@@ -458,6 +451,18 @@ counts %>%
 
 OPTIONAL: Transform the ensemblIDs in the matrix to common gene names, using the R package `biomaRt` (note: must specify to use the mouse database, although the two genes in the 5XFAD model we identified above are humanized and won't be translated by the program).
 
+For this option, use the BioconductoR package manager to install `biomaRt`.
+
+```{r install-biomaRt, eval = FALSE}
+
+if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
+
+BiocManager::install("biomaRt")
+
+library(biomaRt)
+```
+
 **The next two code chunks will not automatically execute in this notebook because they can take a long time -- the code is included if you'd like to try this on your own.**
 
 We will use the two custom functions below to convert ensemblIDs to gene names:
@@ -494,17 +499,15 @@ Call the `Make.Gene.Symb()` function to add a new column with short gene names t
 
 ```{r convert-gene-ids, eval=FALSE, include=FALSE}
 # use the mutate function from dplyr
-named_counts <- counts %>% 
+named_counts <- counts |> 
   mutate(gene_name = Make.Gene.Symb(gene_id))
 ```
 
 ##### Pre-translated option:
 
 For this demonstration, instead of running biomaRt, which can be unreliable at times and take a long time to render, we will append a dataframe to our counts matrix with short gene names already translated
 
-```{r}
-#ensembl_to_gene <- subset(named_counts, select=c(gene_id, gene_name))
-#write.csv(ensembl_to_gene, file="ensembl_translation_key.csv", row.names=FALSE)
+```{r read-preconverted-ensembl-counts}
 
 ensembl_to_gene <- read.csv(file="ensembl_translation_key.csv")
 named_counts <- dplyr::left_join(counts, ensembl_to_gene, by="gene_id")
@@ -527,9 +530,9 @@ named_counts[1, "gene_name"] <- "PSEN1"
 named_counts[2, "gene_name"] <- "APP"
 
 #make all gene names unique and remove unneeded column
-named_counts <- named_counts %>% 
-  mutate(gene_name = make.unique(gene_name)) %>% 
-  dplyr::select(-gene_id) %>% 
+named_counts <- named_counts |> 
+  mutate(gene_name = make.unique(gene_name)) |> 
+  dplyr::select(-gene_id) |> 
   column_to_rownames(var = "gene_name")
 
 ```
@@ -540,9 +543,9 @@ Now we can transpose the dataframe so that each row contains count data cross al
 
 ```{r transpose-counts}
 #
-counts_tposed <- named_counts %>% 
-  t() %>%  #transposing forces the df to a matrix
-  as_tibble(rownames = "specimenID") %>% #reconvert to tibble and specify rownames
+counts_tposed <- named_counts |> 
+  t() |>  #transposing forces the df to a matrix
+  as_tibble(rownames = "specimenID") |> #reconvert to tibble and specify rownames
   left_join(covars, by = "specimenID") #join covariates by specimenID
 ```
 
@@ -562,26 +565,26 @@ counts_tposed$timepoint <- factor(counts_tposed$timepoint, levels=c("4 mo","6 mo
 
 Use ggplot to plot gene counts for each specimen by age, sex, and genotype.
 
-```{r plot-trem2}
+```{r plot-app}
 # load ggplot2
 library(ggplot2)
 
-#Look at Trem2 levels
-g <- counts_tposed %>% 
-  ggplot(aes(x=timepoint, y=Trem2, color=genotype)) +
+#Look at APP levels -- this model is the 5X FAD mutant, so we expect it to be high!
+g <- counts_tposed |> 
+  ggplot(aes(x=timepoint, y=APP, color=genotype)) +
     geom_boxplot() + 
     geom_point(position=position_jitterdodge()) +
     facet_wrap(~sex)
 
 g
 ```
 
-Examine any gene of interest by setting the y argument in the `ggplot(aes()` mapping equal to the gene name. Ex: `y = Cst7`
+Examine any gene of interest by setting the y argument in the `ggplot(aes()` mapping equal to the gene name. Ex: `y = Trem2`
 
-```{r plot-Cst7}
+```{r plot-Trem2}
 #
-g <- counts_tposed %>% 
-  ggplot(aes(x=timepoint, y=Cst7, color=genotype)) +
+g <- counts_tposed |> 
+  ggplot(aes(x=timepoint, y=Trem2, color=genotype)) +
     geom_boxplot() + 
     geom_point(position=position_jitterdodge()) +
     facet_wrap(~sex)
@@ -593,7 +596,7 @@ Ex: `y = Apoe`
 
 ```{r plot-Apoe}
 #
-g <- counts_tposed %>% 
+g <- counts_tposed |> 
   ggplot(aes(x=timepoint, y=Apoe, color=genotype)) +
     geom_boxplot() + 
     geom_point(position=position_jitterdodge()) +
@@ -602,15 +605,6 @@ g <- counts_tposed %>%
 g
 ```
 
-Ex: `y = Kirrel2`
+\-\--
 
-```{r plot-Kirrel2}
-#
-g <- counts_tposed %>% 
-  ggplot(aes(x=timepoint, y=Kirrel2, color=genotype)) +
-    geom_boxplot() + 
-    geom_point(position=position_jitterdodge()) +
-    facet_wrap(~sex)
-
-g
-```
+![](images/ADKP_logo.png){width="236"}
diff --git a/README.md b/README.md
@@ -6,12 +6,13 @@
 
 -   [Working with AD portal Data in R](https://sage-bionetworks.github.io/ADPortalWorkshops/5XFAD_data_R_tutorial.html)
 -   [Working with AD portal Data in python](https://sage-bionetworks.github.io/ADPortalWorkshops/5XFADdata_python_tutorial.html)
--   [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)
+-   [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.html)
 
 ### Versions of the tutorials you can download and run interactively from RStudio or VSCode
-
--   [Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)
-
+-   [Quarto notebook - Working with AD portal Data in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/5XFAD_data_R_tutorial.qmd)
+-   [Quarto notebook - Working with AD portal Data in python](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/5XFADdata_python_tutorial.qmd)
+-   [R notebook - Differential expression with DESeq2 in R](https://github.com/Sage-Bionetworks/ADPortalWorkshops/blob/main/Workshop2_DESeq.nb.html)
+
 ### Next upcoming workshop: July 14, 2023 at AAIC!
 
 [Synapse workshop project](https://www.synapse.org/#!Synapse:syn26165425/wiki/612752)