forked from ksiegmund/PM579
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathw5code1-qvalue-analysis.Rmd
91 lines (63 loc) · 3.09 KB
/
w5code1-qvalue-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Estimating FDR with qvalues"
author: "ks"
date: "6/17/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if(!require("qvalue")){BiocManager::install("qvalue")}
```
## p-values
We load the list of 3,701 p-values from Storey & Tibshirani's analysis of the Hedenfalk et al. data that are provided in the qvalue package.
```{r estfdr}
library(qvalue)
data(hedenfalk)
names(hedenfalk)
```
```{r pvalue}
length(hedenfalk$p)
```
There were 3,170 pvalues. Let's plot them in a histogram.
```{r histp}
hist(hedenfalk$p,probability=TRUE)
abline(1,0,lty=2)
```
The histogram of the p-values shows an excess of small p-values, with the rectangular shape of a uniform distribution for p>0.5.
How many p-values are < 0.05?
```{r p0.05}
sum(hedenfalk$p < 0.05)
```
Using a regular p<0.05 cutoff, we'd find 605 genes differentially expressed. We know that because of multiple testing, we'd expect 3,107 x 5\% = 159 significant tests (26% of 605). That's a high error rate. Let's control the false-discovery rate at 5\% using the Benjamini and Hochberg adjusted p-value.
```{r bhadj}
bhadjp <- p.adjust(hedenfalk$p,method="BH")
sum(bhadjp < 0.05)
```
Using a 5\% false-discovery rate cutoff, we'd find 94 genes significant (Benjamini & Hochberg adjusted p < 0.05).
## q-values
We compute the q-values from the list of pvalues, and create several figures summarizing the results.
```{r qvalue}
qv <- qvalue(hedenfalk$p)
plot(qv)
```
The analysis results are summarized in 4 figures. The top left fits a cubic spline (smooth curve) to the estimates of $\hat{\pi}_0 (\lambda)$. The dots become more variable as $\lambda$ approaches 1, but the estimate $\hat{\pi}_0 (\lambda)$ from the curve is more stable. This curve tells us $\hat{\pi}_0 (1)$ = 0.67.
The top right figure is the # significant tests as a function of the q-value cutoff (estimate of FDR). The large steps for the q-values <0.025 show cutoffs for which a number of tests become significant.
The bottom left figure tells us the q-value for each p-value cutoff (0<p<0.015). If you rejected your tests with unadjusted p<0.015, you would expect 10% of the tests to be false.
The final figure (bottom-right) gives the expected number of false-positive results given the number of top hits (0-300). If you reject 300 tests, you would expect fewer than 30 errors (false-discovery rate < 10%).
The summary function tabulates the results to count significant genes at a series of standard cut-offs.
```{r summaryqvalue}
summary(qv)
```
This gives the estimate that 67\% (pi0) of the genes are not differentially expressed. The row named p-value is the count for the raw (unadjusted) pvalues. We see 605 pvalues are less than 0.05, the same number obtained above. A q-value cutoff of 5% yields 162 significant tests.
We can also compute the number of significant tests directly, using the following command.
```{r qvals}
sum(qv$qvalues <0.05)
```
Since the q-value = BH-adjusted p x $\hat{\pi}_0$, we can check this result.
```{r computeqvals}
sum( bhadjp * qv$pi0 <0.05)
```
Yes, it's the same!
```{r sessioninfo}
sessionInfo()
```