-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathbios6643_01.Rmd
390 lines (234 loc) · 13 KB
/
bios6643_01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "BIOS6643 Longitudinal"
author: "EJC"
subtitle: "L1 Introduction"
institute: Department of Biostatistics & Informatics
output:
beamer_presentation:
theme: Goettingen
colortheme: rose
slide_level: 2
toc: true
keep_tex: true
latex_engine: xelatex
dev: cairo_pdf
fig_caption: no
fig_crop: no
fontsize: 9pt
make149: yes
header-includes:
- \AtBeginSubsection{}
- \AtBeginSection{}
---
```{r setup, include=FALSE, cache=F, message=F, warning=F, results="hide"}
knitr::opts_chunk$set(cache = TRUE,
echo = FALSE,
message = FALSE,
warning = FALSE)
knitr::opts_chunk$set(fig.height = 4,
fig.width = 5,
out.width = '50%',
fig.align='center')
knitr::opts_chunk$set(fig.path = 'figs_L1/',
cache.path = 'cache/')
```
# L1 Introduction
## Learning Objectives
1. Understand why we need special methods
2. Discuss example datasets that are longitudinal
3. Discuss time series vs. longitudinal; formats for longitudinal data
4. Understand the assumptions for longitudinal models
5. Review analyses of longitudinal data with two time points
\vspace{\baselineskip}
### Questions
- What makes longitudinal data different, so that we need special methods?
- What are clustered data?
- What are benefits of longitudinal models? (Or models for clustered data)
- Why are longitudinal methods not used more?
## Longitudinal designs
- Designed experiments and observational studies can be applied to cross-sectional or longitudinal settings. Here, they are defined for the latter.
- A controlled experiment involves an intervention, while an observational study does not.
- In many cases a controlled experiment will have one or more true treatment groups, along with a 'control' group that either receives some type of placebo, or does not receive any treatment.
- **See the course notes for more detail on designed experiments versus observational studies.**
# Time series and longitudinal data
## Time series and longitudinal data
### Time series methods (generally)...
- focus on modeling one process over time (i.e., one observation taken at each time point, across time).
- focus on predicting values of future occurrences.
Generally, time series data can be found everywhere, including: stock prices, temperature, birth and mortality rates, health data for individuals (e.g., blood pressure), just to name a few areas.
### Longitudinal methods (generally)...
- Involve measurements on multiple subjects.
- Assume that the correlation structure is the same across subjects but that responses are independent between subjects.
Often fewer time points for longitudinal data than time series data.
Although analytic methods for time series and longitudinal data differ, they do have common elements, and the underlying processes that generate the data are often similar.
# Time series data types and examples
## Time series data types and examples
### Stationary processes
- A stationary process $\{Y_t\}$ has a constant mean (expected value) and finite 2nd moment for all times $t$, and the correlation between $Y_t$ and $Y_{t+h}$ does not depend on $t$, for all $h$.
- Below, data for stationary processes were simulated for the model, $Y_t = \mu + \epsilon_t$ where $\mu$ is the mean and $\epsilon_t$ are errors that are identically but not necessarily independently distributed.
### Example 1: Stationary process (iid error)
For the simulated data, $\mu=0$ and $\epsilon_t \sim \mathcal {N} (0,\ 0.46)$ for all $t$.
```{r "stationary", fig.align='center'}
library("tidyverse")
set.seed(6643)
mu <- 0
t <- 1:100
beta0 <- 0
beta1 <- -0.05
e0 <- rnorm(100, mean = 0, sd = 0.46)
e1 <- arima.sim(list(order = c(1,0,0), ar = .25), n = 100)
e2 <- arima.sim(list(order = c(1,0,0), ar = .50), n = 100)
e3 <- arima.sim(list(order = c(1,0,0), ar = .75), n = 100)
e4 <- arima.sim(list(order = c(1,0,0), ar = .99), n = 100)
Y1 <- mu + e1
Y2 <- mu + e2
Y3 <- mu + e3
Y4 <- mu + e4
Y0 <- mu + e0
Y5 <- beta0 + beta1 * t + e2
data <- cbind(Y0, Y1, Y2, Y3, Y4, Y5, t) %>%
as.data.frame()
ggplot(data, aes(t, Y0)) +
geom_line() +
theme_classic()
```
##
### Example 2: Stationary process (correlated error)
- Data below were generated using $\mu=0$ and errors that followed a first-order autoregressive $\big(AR(1)\big)$ process: $\epsilon_t = \phi\epsilon_{t-1} + Z_t$ and Specifically, $Z_t \stackrel {iid} \sim \mathcal {N} (0,\ 0.46)$, for all $t$.
- **Notes on AR(1) processes**:
1. Errors $\epsilon_t$ are identically distributed but not independent
2. Must have $|\phi| < 1$ for stationary process
3. The higher the value of $|\phi|$, the higher degree of correlation between responses from day to day
##
### Example 2: Stationary process (correlated error)
```{r "AR1", out.width="90%", fig.align='center'}
ar25 <- ggplot(data, aes(t, Y1)) +
geom_line() +
theme_classic() +
ggtitle(expression(paste(phi, " = 0.25")))
ar50 <- ggplot(data, aes(t, Y2)) +
geom_line() +
theme_classic() +
ggtitle(expression(paste(phi, " = 0.5")))
ar75 <- ggplot(data, aes(t, Y3)) +
geom_line() +
theme_classic() +
ggtitle(expression(paste(phi, " = 0.75")))
ar99 <- ggplot(data, aes(t, Y4)) +
geom_line() +
theme_classic() +
ggtitle(expression(paste(phi, " = 0.99")))
gridExtra::grid.arrange(ar25, ar50, ar75, ar99, nrow = 2)
```
##
### Example 3: Processes with trend and correlated errors
- $AR(1)$ process with linear time trend.
- $Y_t = \beta_0 + \beta_1t + \epsilon_t$,
$\beta_0 = 0$, $\beta_1 = -0.05$, $\epsilon_t \sim AR(1)$
(as in **Example 2**, last page, with $\phi\ =\ 0.25$)
```{r "linear time trend", fig.align='center'}
ggplot(data, aes(t, Y5)) +
geom_line() +
theme_classic()
```
**Random walks - see course notes (includes Example 5)**
# Longitudinal data types and examples
##
### Example 4: Retrospective observational studies
- Longitudinal trajectories of pseudomonas (PA) in EPIC study
- 1734 children enrolled in the EPIC observational study who were pseudomonas negative (PA-) for early pseudomonas infection control observational study (EPIC)
- Median followup time is 7.8 years ($Q1 - Q3:\ 6.3 - 8.3$)
- One of the questions of interest was finding factors associated with progression of PA; A secondary outcome of interest: time to first pulmonary exacerbation in EPIC trial
```{r "pseudomonas", out.width='50%', fig.align='center'}
knitr::include_graphics('figs_L1/L1-f1.png')
```
##
### Example 5: Prospective observational studies
STEPPED-CARE randomized trial.
- A behavioral intervention was tested versus usual care in patients with lung or head and neck cancer.
\alert {*need some figures*}
##
### Example 6: Chronic fatigue syndrome study
Complement levels and Chronic fatigue syndrome (CFS) data (Sorensen et al., 2003).
- Involved measuring complement split products (biological markers) over time.
- In this case, groups involved those with or without CFS, and thus repeated measures only involved time.
- A special covariance structure was used to model the repeated measures since measurement times were unequally spaced.
##
### Example 7: Growth curve data
- Graphs for height as a function of age for boys and girls aged 2 to 20 years
- Constructed in R after obtaining growth data from the [CDC](http://www.cdc.gov/). For more information, please see [growcharts](http://www.cdc.gov/growthcharts/).
- These data show that girls approach their maximum height much more quickly than boys. The y-axis scales were made the same for easier comparison between graphs.
- Each curve is a percentile estimate as a function of age. We could create confidence bands for each percentile curve.
- If the curves are estimated using a lot of data, the widths of the bands should be narrow. Doctors look for dramatic changes between visits.
- The curves here may not be representative of all populations (e.g., differences due to race).
##
### Example 7: Growth curve data
```{r "growth", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics('figs_L1/L1-f2.png')
```
**See course notes for more data examples**
## Formats for longitudinal data
This section covers
- how to set up data for "univariate" versus "multivariate" analysis
- types of variables and notation for variables in longitudinal data versus "factorial" data
- **See course notes for more information**
##
### Example 8, 9, & 10: Cluster data
**Example 8**: After an exercise challenge performed on 20 subjects,
resting heart rates are monitored at 5 minute intervals for one hour.
How are data clustered?
**Example 9**: Families are selected to participate in a survey regarding health insurance.
Each member of the family will be included in the study.
**Example 10**: arm length and leg length growth are measured for subjects once a year for 10 years,
and then modeled with a linear mixed model.
# Simple clustered/longitudinal analyses
## What we've already done!
- Experiments with pre-post measurements have 2 measurements on each subject over time. When there are only 2 measurements, the analysis simplifies when the difference is considered, as the analysis is reduced to one measurement per subject. Simple methods can then be used (e.g., paired t-test).
- Longitudinal models can still be beneficial here! But we'll discuss that later. For now, we consider simplified models.
- Let's take a closer look at the underlying models when we use a difference score or take the baseline-as-covariate approach.
## Change-score model and Baseline-as-covariate model
### Change-score model
$$
Y_{i1} = Score_{pre}; \ Y_{i2} = Score_{post}
$$
\vspace{-5mm}
$$
\Delta_i = Y_{i1} - Y_{i2} = \beta_0 + \beta_1 x_i + \epsilon_i
$$
### Baseline-as-covariance model
$$
Y_{i2} = \beta_0 +\beta_1Y_{i1} + \beta_2 x_i + \epsilon_i
$$
We allow the slope of the baseline value to be anything (based on fit).
### Example for discussion: cholesterol data
Any other type of simple clustering, with 2 responses per cluster can be analyzed similarly.
(E.g., pairing by married couple, pairing by year of measurement.)
# Usual assumptions for longitudinal models
##
- Assumption 1: Responses between subjects are independent.
- If there are clear violations to the assumption, and data are available, then a random term could be added to deal with this non-independence.
- For example, if a class is used for the sample, and there are several pairs of siblings in the class, a random term identifying family could be added to the model. (Lack of fit and lack of independence are related!)
- Assumption 2: There is a common covariance structure between all subjects, and the covariance parameters have the same value between subjects.
- This assumption is usually not tested. However, to properly estimate covariance parameters, several subjects are needed (just as data for several subjects are needed to estimate a common population mean).
- In some cases, homogeneous groups within the study may be identified (but heterogeneous between groups). With sufficient group sample sizes, group-specific covariance parameters can be put in the model and estimated.
# Longitudinal designs and power - an initial glimpse
## Longitudinal designs and power - an initial glimpse
- Consider an experiment designed to compare two treatments. Two common approaches:
- A1: Use independent samples (randomly assign some subjects one treatment, and some the other). For A1, we often use a 2-independent sample t-test
- A2: Have all subjects have one treatment and then have them all take the other (e.g., use a crossover design to eliminate confounding effects related to time). For A2, a paired t-test.
- A study/experiment involving changes within subjects (e.g., analyzed with a paired t-test) is often more powerful than a study using independent samples.
##
- The general formula for the variance for the difference in means suggests why this may be expected (when correlations between responses within subjects are positive):
$$
Var[\bar Y_1 - \bar Y_2] = Var[\bar Y_1] + Var[\bar Y_2] - 2Cov[\bar Y_1,\ \bar Y_2]
$$
- Often there are many factors not of interest that distinguish the two independent samples, while for the paired data, the difference in responses is due more to the treatment alone and not to other factors, since we're using the same subjects.
- The same principle generalizing to multiple times and longitudinal data in general (e.g., air pollution study);
subject serve as their own controls.
- But paired/longitudinal designs may not always be better. In some cases a short cross-sectional study/experiment involving many subjects may be more feasible and cost-effective.
## Summary
1. Why do we need special methods?
2. Discussed example datasets that are longitudinal
3. Discuss time series vs. longitudinal; formats for longitudinal data
4. Assumptions for longitudinal models
5. Analyses of longitudinal data with two time points