Project in R for the course of Bioinformatic Resources, held by Alessandro Romanel (A.Y. 2022-2023).
Topic: Perform an analysis of the selected dataset representing RNA-seq count data extracted from different cancer datasets from the Cancer Genome Atlas (TCGA). From the original TCGA data, 50 cases (tumor samples) and 50 controls (normal samples) were randomly selected.
Dataset selected: Thyroid carcinoma
Project developed by:
- Andrea Tonina @iamandreatonina
- Gloria Lugoboni @GloriaLugoboni
- Load the RData file.
- Extract only protein-coding genes.
- Perform differential expression analysis using
edgeR
package and select up-b and down-regulated genes using a p-value cutoff of 0.01, a log fold change ratio >1.5 for up-regulated genes and < (-1.5) for down-regulated genes, and a log CPM >1. - Perform gene set enrichment analysis using
clusterProfiler
. - Visualize one pathway you find enriched using the upregulated gene list by utilizing
pathview
. - Identify which transcription factors (TFs) have enriched scores in the promoters of all up-regulated genes.
- Select one among the top enriched TFs, compute the empirical distributions of scores for all PWMs that you find in MotifDB for the selected TF, and determine for all of them the distribution (log2) threshold cutoff at 99.75%.
- Identify which up-regulated genes have a region in their promoter with binding scores above the computed thresholds for any of the previously selected PWMs.
- Find PPI interactions among differentially expressed genes by using STRING database and export the network in TSV format.
- Import the network by using
igraph
package and identify and plot the largest connected component ( we also decided to useggnet2
fromGGally
package).