-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-statistics.Rmd
358 lines (244 loc) · 12.3 KB
/
07-statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
# Statistics in R
In this chapter, we are going to use R to add the statistics in the plots and carry out statistical analysis.
We need ``dplyr``, ``ggplot2``, and the Minneapolis ACS dataset.
```{r, message=FALSE}
library(dplyr)
library(ggplot2)
minneapolis <- read.csv('minneapolis.csv')
```
## Plotting statistics
### Fitting the general trend of points
You could fit a line to see the general trend of the scatter plot by using ``stat_smooth()`` in ``ggplot2``.
```{r}
## select observations from 2017
minneapolis_2017 <- filter(minneapolis, YEAR == 2017)
p <- ggplot(minneapolis_2017, aes(AGE, INCTOT)) +
geom_point()
p + stat_smooth(method = 'lm') # fit a linear line for the points
```
```{r}
p + stat_smooth(method = 'loess') # non-linear line
```
It seems that the non-linear relationship makes more sense.
### Plotting distribution
You could draw a histogram for the dataset with ``geom_histogram()``. With the argument of ``binwidth``, we can set the width of the bin.
```{r}
p <- ggplot(minneapolis_2017, aes(AGE))
p + geom_histogram(binwidth = 5) # the width of each category is 5
```
Instead of the width of the bin, we could set number of the bins with ``bins``.
```{r}
p + geom_histogram(bins = 30) # the number of bins is 30
```
By setting ``aes(y = ..density..)`` in ``geom_histogram()``, the plot shows density instead of counts.
```{r}
p + geom_histogram(aes( y = ..density..), bins = 30) # use density instead of count
```
### Density plot of distribution
You could fit a density line for the histogram, which is a smooth line shows the distribution of the data.
```{r}
p + geom_histogram(aes( y = ..density..), binwidth = 5) +
geom_density(alpha = 0.2, fill = 'Grey')
```
``geom_freqpoly()`` can serve a same purpose.
```{r}
p + geom_freqpoly(binwidth = 5)
```
### Box plot of distribution
Another way to show more statistics about your data is box plot, which we have introduced before.
```{r}
p <- ggplot(minneapolis_2017, aes(factor(SEX), INCTOT))
p + geom_boxplot()
```
## Annotation of statistics
We can use ``geom_text()`` to annotate the statistics in the plot.
### bar plot
```{r}
ggplot(minneapolis_2017, aes(x = RACE)) +
geom_bar() +
geom_text(aes(label = ..count..), stat = 'count', vjust = 1.5, colour = 'white')
```
```{r}
## count the number of respondents based on RACE
minneaplis_race <- minneapolis_2017 %>%
group_by(RACE) %>%
summarise(count = n())
ggplot(minneaplis_race, aes(RACE, count)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.3)
```
### line plot
```{r}
## calculate the average personal income for each year
minneapolis_year <- minneapolis %>%
group_by(YEAR) %>%
summarise(AvgInc = mean(INCTOT, na.rm = T))
ggplot(minneapolis_year, aes(YEAR, AvgInc)) +
geom_line(linetype= 'dashed') +
geom_point() +
geom_text(aes(label = round(AvgInc)), vjust = -0.5)
```
```{r message=FALSE}
## calculate the average personal income for each racial group in each year
minneapolis_year_race <- minneapolis %>%
group_by(YEAR, RACE) %>%
summarise(AvgInc = mean(INCTOT, na.rm = T))
## line plot of the trend of average personal income
ggplot(minneapolis_year_race, aes(YEAR, AvgInc, linetype = factor(RACE))) +
geom_line() +
geom_text(aes(label = round(AvgInc)), vjust = -0.4)
```
## Simple statistics
We have covered some of the functions in this topic. For example, we could use ``mean()`` to compute the average value of a set of numbers.
### Mean and median
We could use the base functions to do some simple statistical analysis directly.
```{r}
mean(minneapolis_2017$AGE) # mean
median(minneapolis_2017$AGE) # median
```
### Minimum and maximum value
```{r}
min(minneapolis_2017$AGE) # minimum value
max(minneapolis_2017$AGE) # maximum value
```
### Quartile
```{r}
x <- quantile(minneapolis_2017$AGE)
x # list of quartiles
x[2] # select the value by its index
```
You could add value from 0 to 1 in the ``quantile()`` to find a specific value, for example, 40%.
```{r}
quantile(minneapolis_2017$AGE, 0.4) # 40% of the dataset
```
## Linear regression
Before we start, let's review some related knowledge first.
Regression is used to examine the linear relationships between dependent variable and independent variables, where dependent variable is the one to be explained and independent variables (also called regressors, predictors, explanatory variables) are those may have influences on the dependent variable. For example, the dependent variable is personal income, and the independent variables are education, gender, age, *etc*. Among those independent variables, there are two types, one is continuous variable and the other is categorical variable (including dummy variable). Continuous variable is variable with continuous values, such as income and age. Categorical variable contains levels of values, which are not continuous. Dummy variable is variable with two levels of values. For example, gender, and people could use 1 for male, and 0 for female.
Suppose we have a dependent variable $Y$, and two independent variables $X_1$ and $X_2$, the regression model in assumption could be expressed as below,
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$.
Where, $\beta_0$ is the intercept, $\beta_1$ and $\beta_2$ are coefficients for $X_1$ and $X_2$, $\epsilon$ is the error term which is the part of the dependent variable which cannot be explained by the intercept and independent variables. Linear regression is to estimate the value of $\beta_0$, $\beta_1$, and $\beta_2$, and test their significance. The coefficients for the independent variables stand for that if the independent variable change one unit, the dependent variable will change the amount same with the value of the corresponding coefficient.
The estimated model could be expressed as,
$$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2$$
Those variables with hat are estimated variables.
While regression provides the estimated values of intercepts and coefficients, it also provides the significance of these estimates with p-values. When p-value is smaller, the estimates tend to be more significant. In R, the function will use some marks to indicate the significance levels. The significance level is the probability that the estimates are ``true``.
| Mark | Descriptions of significance level |
|------|------------------------------------|
| . | 90% |
| * | 95% |
| ** | 99% |
| *** | 99.9% |
To quantify the fitness of the model, we use $R^2$ with a value from 0 to 1. When $R^2$ is close to 1, the model fits the dataset better. $R^2$ has a property that when adding more independent variables in the regression model, the $R^2$ will increase. There is another index called adjusted $R^2$, which considers the number of variables in the models.
Our example is the Minneapolis ACS population dataset, and we want to explore the relationship between ``EDUC`` (Education level) and ``AGE`` (Age). Let's draw a scatter plot to see their distribution.
```{r message = F}
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_point()
```
Because there are many overlapped points, we could add some noise to the positions of the points to help show all the points.
```{r}
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) ## add some noise to the positions of the points and set the size of the points to be 0.5
```
Based on the plot, it seems there is a positive relationship between these two variables. We add a linear line to fit the relationship with ``geom_smooth()``.
```{r}
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) +
geom_smooth(method = 'lm')
```
To quantify this linear relationship, We could use ``lm()`` function to fit this linear relationship and use ``summary()`` function to see the result. In the function, the formula indicates the model in assumption. Here, our model in assumption is,
$$EDUC = \beta_0 + \beta_1 \times AGE + \epsilon$$
When we code this model in R, we do
```
EDUC ~ AGE
```
We only need to write down the variable names of the dependent variable and independent variables, and use ``~`` to connect them. No need to write the intercept and error term.
We also need to indicate the name of the dataset in the function.
```{r}
lm_fit <- lm(EDUC ~ AGE, # formula
data = minneapolis_2017) # dataset
summary(lm_fit) # check result
```
The summarized result provides details about the model results, such as the coefficients and p-values, the model's $R^2$, *etc.*
Based on the information, we could know that the estimated coefficient for the intercept is 4.04, its p-value is $< 2e-16$ with a mark $***$, showing it is significant at 99.9% level. The estimated coefficient for ``AGE`` is 0.08, its p-value is $<2e-16$ with a mark $***$, showing it is significant at 99.9% level.
We could also know the $R^2$ is 0.224, and adjusted $R^2$ is 0.2237.
We could use the codes below to check the $R^2$ of the model directly.
```{r}
summary(lm_fit)$r.squared # value of R2
```
And get the values of the coefficients directly.
```{r}
coefficients(lm_fit) # only check the coefficient
```
The model can be improved. From the scatter plot between ``EDUC`` and ``AGE``, many respondents get to their highest level of education before the age of 25. After 25, ``AGE`` has trivial influence on ``EDUC``. We, then, are interested in the relationship before the age of 25.
```{r}
minneapolis_age <- filter(minneapolis_2017, AGE <= 25)
ggplot(minneapolis_age, aes(AGE, EDUC)) +
geom_jitter(size = 0.5) +
geom_smooth(method = 'lm')
```
```{r}
lm_fit <- lm(EDUC ~ AGE,
data = minneapolis_age)
summary(lm_fit)
```
The model provides better fitness to the dataset now.
Most of the time, we need to examine the relationship between the dependent variable and more than one independent variable. The example below examines the relationship between ``EDUC`` and ``AGE``, ``SEX``, and ``RACE``.
Education level and age.
```{r}
ggplot(minneapolis_2017, aes(AGE, EDUC)) +
geom_jitter(size = 0.5)
```
Education level and gender.
```{r}
ggplot(minneapolis_2017, aes(factor(SEX), EDUC)) +
geom_boxplot()
```
Education level and racial group.
```{r}
ggplot(minneapolis_2017, aes(factor(RACE), EDUC)) +
geom_boxplot()
```
When there is more than one independent variables, we use ``+`` to connect them in the formula. ``RACE`` and ``SEX`` are both categorical variables, so we transform it to factors.
```{r}
mlm_fit <- lm(EDUC ~ AGE + factor(SEX) + factor(RACE),
minneapolis_2017)
summary(mlm_fit)
summary(mlm_fit)$r.squared
```
Without careful research design, the relationships fitted by the regression model are all correlations, not causalities.
## Logistic regression
The above two examples both use continuous variables as their dependent variables. How about using a binomial variable (i.e., dummy variable, 0 or 1 as its value) a dependent variable? Then we need to do logistic regression. There are many functions to well accomplish this. When interpreting the coefficients of the logistic regression result, the coefficient stands for the change of the log odds of the dependent variable to 1.
```{r}
## create a variable to indicate the poverty status of the respondents
minneapolis_poverty <- minneapolis_2017 %>%
filter(!is.na(INCTOT)) %>%
mutate(poverty = ifelse(INCTOT <= 12228, 1, 0))
```
Poverty level and age.
```{r}
ggplot(minneapolis_poverty, aes(AGE, fill = factor(poverty))) +
geom_histogram()
```
Poverty and education level.
```{r}
ggplot(minneapolis_poverty, aes(EDUC, fill = factor(poverty))) +
geom_bar()
```
Poverty and gender.
```{r}
ggplot(minneapolis_poverty, aes(SEX, fill = factor(poverty))) +
geom_bar()
```
Poverty and racial group.
```{r}
ggplot(minneapolis_poverty, aes(RACE, fill = factor(poverty))) +
geom_bar()
```
Here, we introduce the ``glm()`` function. We need to indicate ``family = binomial`` in the function.
```{r}
## logistic regression
logit_reg <- glm(poverty ~ AGE + EDUC + factor(SEX) + factor(RACE),
minneapolis_poverty,
family = binomial)
summary(logit_reg)
```
Most of the information is similar with regression ones except that the logistic regression does not have $R^2$ and adjusted $R^2$. It uses ``AIC`` (Akaike information criterion) in indicate the goodness of the model. If one model has a smaller AIC, it is better.