Merge pull request #72 from ARTbio/RNAseqIOC-1

RNAseq ioc 1
ARTbio · Sep 14, 2023 · 0e0eb4e · 0e0eb4e
2 parents af44065 + fb4b583
commit 0e0eb4e
Show file tree

Hide file tree

Showing 69 changed files with 1,567 additions and 0 deletions.
diff --git a/docs/bulk_RNAseq-IOC/01_IOC_RNAseq.md b/docs/bulk_RNAseq-IOC/01_IOC_RNAseq.md
@@ -0,0 +1,55 @@
+##  Introduction to the IOC ARTbio 064: Bulk RNAseq Analyses
+**November 2023**
+
+In this Interactive Online Companionship, we will train to perform RNAseq analyses
+of Bulk RNAseq
+
+### Program / Schedule
+
+### Week 1
+
+  **3-hours Zoom video-conference with**
+
+  1. Introduction of the Companions and Instructors (10 min)
+  - Presentation of the IOC general workflow (Scheme) (15 min)
+  - Presentation of the IOC tools (2 hours)
+      1. Zoom (5 min)
+      - Starbio (5 min)
+      - Slack (10 min)
+      - GitHub (20 min)
+      - Psilo storage (15 min)
+      - Galaxy (65 min)
+<!-- Ici on est à 2:25, faire un schedule sur google sheets -->
+<ol start=4>
+  <li> Import data from Psilo to Galaxy
+  <li> Program of the week 2
+  <ol start="a">
+    <li> Presentation of exercises with digital tools
+    <li> presentation of pretreatment and metadata organisations and of related tasks to be done
+  </ol>
+</ol>
+
+### Week 2
+1. Question on Week 2
+    1. Data upload
+    2. Quality control
+- Program of Week 3
+    1. reference datasets (GTF, genome, subset, ucsc tables, ensembl Biomart)
+### Week 3
+2. Questions on Week 2
+    1. reference
+    - GTF manipulation
+- Program of the Week 3
+    1. Mapping and mappers
+    2. Inspection of Bam files
+
+3. Analysis of the differential gene expression
+    1. Count the number of reads per annotated gene
+    2. Viewing datasets side by side using the Scratchbook
+    3. Identification of the differentially expressed features
+    4. Visualization of the differentially expressed genes
+    5. Analysis of functional enrichment among the differentially expressed genes
+
+Some parts of this IOC were inspired by
+[Reference-based RNAseq analysis](https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html)
+of the Galaxy Training Network (GTN)
diff --git a/docs/bulk_RNAseq-IOC/Cutadapt.md b/docs/bulk_RNAseq-IOC/Cutadapt.md
@@ -0,0 +1,64 @@
+![](images/galaxylogo.png)
+
+# Filtering datasets to remove or trim low quality sequences
+
+## This step is optional and should be performed by 50% of attendees.
+
+## Cutadapt with single reads ![](images/tool_small.png)
+
+----
+1. Create a new history `Cutapdapt` (`wheel` --> `Create New`) ![](images/wheel.png)
+2. Copy the fastq files from the RNAseq data library to this new history (`wheel` --> `Copy datasets`)
+3. Select the `Cutadapt` tool
+4. Start with selecting `Single-end` in the `Single-end or Paired-end reads?` menu
+5. Select the multiple datasets button for this menu
+6. Cmd-Click for discontinuous multiple selection of `single` fastq.gz files (3 datasets)
+7. `Filter Options`
+    - `Minimum length`: 20
+8. `Read Modification Options`
+    - `Quality cutoff`: 20
+9. `Output Options`
+    - `Report`: Yes
+10. Do not change the other available parameters and click `Execute`
+----
+
+## Cutadapt with paired-end reads ![](images/tool_small.png)
+
+----
+Repeat the same procedure as above, except that you select `Paired-end`in step 4:
+Re-Run the tool using the re-run button on one Cutadapt instance and just select `Paired-end`
+instead of `Single-end`
+
+- Then you have two input boxes, one for file #1 and one for file #2.
+
+- In the box `file #1` click the `multiple datasets` button and carefully Select
+the fastq.gz files with the `_1` suffix
+
+- In the box `file #2` click the `multiple datasets` button and carefully Select
+the fastq.gz files with the `_2` suffix
+
+- Do not change the other parameters (they are set to the same value as previously because
+you used the re-run button).
+
+- Click the `Execute` button
+
+----
+
+## Run MultiQC on Cutadapt jobs ![](images/tool_small.png)
+
+----
+1. Select `MultiQC` tool
+2. Select `Cutadapt/Trim Galore!` in the menu `Which tool was used generate logs?`
+3. Cmd-Select the `Report` datasets generated by Cutadapt
+4. Press `Execute`
+5. Now, the boring but essential job: Rename carefully the `Output` datasets generated
+by Cutadapt. To do so, help yourself to the `Info` button at the bottom of dataset green
+boxes. ![](images/info.png)
+
+    Example: Rename `Cutadapt on data 10 and data 9: Read 2 Output` in `GSM461181_2_treat_paired.fastq.gz`
+
+6. Trash the 11 unfiltered/trimmed fastq.gz files. This is important to avoid mixing
+filtered and non filtered datasets in the next steps.
+----
+
+
diff --git a/docs/bulk_RNAseq-IOC/DEDESeq2.md b/docs/bulk_RNAseq-IOC/DEDESeq2.md
@@ -0,0 +1,36 @@
+![](images/galaxylogo.png)
+
+# `DESeq2`
+
+----
+![](images/tool_small.png)
+
+  1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png)
+  2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history
+  (`wheel` --> `Copy datasets`)
+  3. Select the `DESeq2` tool with the following parameters:
+      1. `how`: Select group tags corresponding to levels
+      2. In `Factor`:
+          1. In `1: Factor`
+              - `Specify a factor name`: Treatment
+              - In `Factor level`:
+                  - In `1: Factor level`:
+                      - `Specify a factor level`: treated
+                      - `Counts file(s)`: the 3 gene count files with `treat` in their name
+                  - In `2: Factor level`:
+                      - `Specify a factor level`: untreated
+                      - `Counts file(s)`: the 4 gene count files with `untreat` in their name
+          2. Click on `Insert Factor` (not on `Insert Factor level`)
+          3. In `2: Factor`
+              - `Specify a factor name` to Sequencing
+              - In `Factor level`:
+                  - In `1: Factor level`:
+                      - `Specify a factor level`: Paired
+                      - `Counts file(s)`: the 4 gene count files with `paired` in their name
+                  - In `2: Factor level`:
+                      - `Specify a factor level`: Single
+                      - `Counts file(s)`: the 3 gene count files with `single` in their name
+      3. `Files have header?`: Yes
+      4. `Output normalized counts table`: Yes
+      5. `Execute`
+
diff --git a/docs/bulk_RNAseq-IOC/DE_intro.md b/docs/bulk_RNAseq-IOC/DE_intro.md
@@ -0,0 +1,36 @@
+# Analysis of the differential gene expression using `DESeq2`
+
+![](images/lamp.png)
+
+----
+
+DESeq2 is a great tool for Differential Gene Expression (DGE) analysis.
+It takes read counts and combines them into a table (with genes in the rows and samples in the columns).
+Importantly, it applies size factor normalization by:
+
+- Computing for each gene the geometric mean of read counts across all samples
+- Dividing every gene count by the geometric mean accross samples
+- Using the median of these ratios as a sample’s size factor for normalization
+
+Multiple factors with several levels can then be incorporated in the analysis.
+After normalization we can compare the response of the expression of any gene to
+the presence of different levels of a factor in a statistically reliable way.
+
+In our example, we have samples with two varying factors that can contribute to
+differences in gene expression:
+
+- Treatment (either treated or untreated)
+- Sequencing type (paired-end or single-end)
+
+Here, treatment is the primary factor that we are interested in.
+
+The sequencing type is further information we know about the data that might affect
+the analysis. Multi-factor analysis allows us to assess the effect of the treatment,
+while taking the sequencing type into account too.
+
+```
+We recommend that you add as many factors as you think may affect gene expression in
+your experiment. It can be the sequencing type like here, but it can also be the
+manipulation (if different persons are involved in the library preparation),
+other batch effects, etc…
+```
diff --git a/docs/bulk_RNAseq-IOC/DEseq2visu.md b/docs/bulk_RNAseq-IOC/DEseq2visu.md
@@ -0,0 +1,117 @@
+![](images/galaxylogo.png)
+
+# Visualisation of differential expression
+
+Now we would like to extract the most differentially expressed genes due to the treatment,
+and then visualize them using an heatmap of the normalized counts and also
+the z-score for each sample.
+
+We will proceed in several steps:
+
+- Extract the most differentially expressed genes using the DESeq2 summary file
+- Extract the normalized counts for these genes for each sample, using the normalized count file generated by DESeq2
+- Plot the heatmap of the normalized counts
+- Compute the Z score of the normalized counts
+- Plot the heatmap of the Z score of the genes
+
+## Extract the most differentially expressed genes
+
+----
+![](images/tool_small.png)
+
+1. Select the tool `Filter data on any column using simple expressions` to extract genes with a significant change in gene expression (adjusted p-value below 0.05) between treated and untreated samples:
+    1. `Filter`: the DESeq2 result file
+    2. `With following condition`: c7<0.05
+
+The file with the independent filtered results can be used for further downstream analysis
+as it excludes genes with only few read counts as these genes will not be considered as significantly differentially expressed.
+
+The generated file contains too many genes (632/STAR, ) to get a meaningful heatmap. Therefore, in the next step,
+we will take only the genes with an absolute fold change > 2 (log2(fold change) > 1)
+
+----
+![](images/tool_small.png)
+
+1. Select the tool `Filter data on any column using simple expressions`
+    1. `Filter`: the differentially expressed genes (output of previous `Filter` tool)
+    2. `With following condition`: abs(c3)>1
+
+We now have a table with 84/STAR, /HISAT2 lines corresponding to the most differentially expressed genes.
+And for each of the gene, we have its id, its mean normalized counts (averaged over all
+samples from both conditions), its log2FC and other information.
+
+We could plot the log2FC for the different genes, but here we would like to look at a
+heatmap of expression for these genes in the different samples. So we need to extract the
+normalized counts for these genes.
+
+We will join the normalized count table generated by DESeq2 with the table we just generated,
+to conserve only the lines corresponding to the most differentially expressed genes.
+
+## Extract the normalized counts of the most differentially expressed genes
+
+----
+![](images/tool_small.png)
+
+- Create a Pasted Entry from the header line of the Filter output:
+
+    1. Copy the header of the final Filter output
+    2. Using the Upload tool select Paste/Fetch data and paste the copied data
+    3. *Set the Type to tabular* and select Start to upload a new Pasted Entry
+
+----
+![](images/tool_small.png)
+
+- Concatenate datasets tool to add this header line to the Filter output:
+    1. select the `Concatenate datasets tail-to-head` tool
+    2. select the Pasted entry dataset
+    3. `+ Insert Dataset`
+    4. select the final `Filter output`
+
+This ensures that the table of most differentially expressed genes has a header line and can be used in the next step.
+
+----
+![](images/tool_small.png)
+
+- join the normalized count table generated by DESeq2 with the table we just generated,
+to conserve only the lines corresponding to the most differentially expressed genes
+
+    1. select the `Join two Datasets side by side on a specified field` tool
+        - `Join`: the Normalized counts file (output of DESeq2 tool)
+        - `using column`: Column: 1
+        - `with`: most differentially expressed genes (output of the Concatenate tool tool)
+        - `and column`: Column: 1
+        - `Keep lines of first input that do not join with second input`: No
+        - `Keep the header lines`: Yes
+        
+The generated file has more columns than we need for the heatmap. In addition to the columns
+with mean normalized counts, there is the log2FC and other information.
+We need to remove the extra columns.
+
+----
+![](images/tool_small.png)
+
+- Cut tool to extract the columns with the gene ids and normalized counts:
+
+    1. Select the `Cut columns from a table`tool
+        - `Cut columns`: c1-c8
+        - `Delimited by`: Tab
+        - `From`: the joined dataset (output of Join two Datasets tool)
+
+We now have a table with 85 lines (the most differentially expressed genes)
+and the normalized counts for these genes in the 7 samples.
+
+----
+![](images/tool_small.png)
+
+- Plot the heatmap of the normalized counts of these genes for the samples
+
+    1. Select the `heatmap2` tool to plot the heatmap:
+        - `Input should have column headers`: the generated table (output of Cut tool)
+        - `Data transformation`: 	**Log2(value+1)** transform my data
+        - `Enable data clustering`: Yes
+        - `Labeling columns and rows`: Label columns and not rows
+        - `Coloring groups`: Blue to white to red
+
+You should obtain something similar to:
+
+![](images/cluster.png)
diff --git a/docs/bulk_RNAseq-IOC/GO-intro.md b/docs/bulk_RNAseq-IOC/GO-intro.md
@@ -0,0 +1,17 @@
+![](images/lamp.png)
+
+# Analysis of functional enrichment among the differentially expressed genes
+
+We have extracted genes that are differentially expressed in treated (Pasilla gene-depleted)
+samples compared to untreated samples. We would like to know if there are categories of
+genes that are enriched among the differentially expressed genes.
+
+Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological
+processes in genome-wide expression studies.
+
+However, standard methods give biased results on RNA-seq data due to over-detection
+of differential expression for long and highly-expressed transcripts.
+
+The goseq tool provides methods for performing GO analysis of RNA-seq data,
+taking length bias into account. The methods and software used by goseq are equally
+applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis.