-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-Exploratory_Data_Analysis.Rmd
185 lines (140 loc) · 6 KB
/
07-Exploratory_Data_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Exploratory Data Analysis
We are going to use what we have learnt so far to do some exploratory data analysis, which is also the practice for data manipulation and visualization.
## Introduction of the dataset and research question
The dataset we used here is from the General Social Survey (GSS) carried out by the National Opinion Research Center of the University of Chicago. The dataset shows the results of the vocabulary tests for several years. It contains eight variables [@GSSvocab]. Their descriptions are listed below.
| Variable | Descriptions |
|------------|--------------------------------------------------------|
| year | The test year |
| gender | Gender |
| nativeBorn | Was the respondent born in the US? |
| ageGroup | Grouped age of the respondent |
| educGroup | Grouped education level of the respondent |
| vocab | Number of words out of 10 correct on a vocabulary test |
| age | Age of the respondent in years |
| educ | Years of education of the respondent |
Our research question is **how do the average vocabulary test results change over time?**
Let's get start. We first library the packages we need for this task.
```{r message = F}
library(dplyr)
library(ggplot2)
library(readr)
```
## Data munipulation
We could the import tool to import our dataset and copy the code.
```{r}
GSSvocab <- read_csv("GSSvocab.csv")
```
There are some warning messages when we import the dataset. The reason for this warning message is that we did not specify the type of the variables. Therefore, R tries to give every variable a type based on its own understanding of the data.
Let's check the dataset to see what it looks like.
```{r}
head(GSSvocab, 5)
```
Except the first variable, all other variables seem to be imported in the right type. Since the first variable is not useful for our data analysis, we do not need to care about it.
Then, based on the initial research questions, we need to calculate the average vocabulary test results for each year. We apply the ``group_by()`` function here.
```{r}
year_vocab <- GSSvocab %>%
group_by(year) %>%
summarize(avgVocab = mean(vocab, na.rm = T)) # indicate na.rm = T to deal with missing value
head(year_vocab, 5)
```
## Data visualization
Then, we apply ``ggplot()`` to visualize the dataset ``year_vocab`` we have just generated.
```{r}
ggplot(year_vocab, aes(x = year, y = avgVocab)) +
geom_line()
```
For a complete plot, we have to add more infomation such as title, x label, and y label, etc.
```{r}
ggplot(year_vocab, aes(x = year, y = avgVocab)) +
geom_line() +
labs(title = 'Average vocabulary test result for each year',
y = 'Average vocabulary test score',
x = 'Year') +
theme(plot.title = element_text(size = 20),
axis.text = element_text(size = 10, colour = 'black'),
axis.title = element_text(size = 12))
```
## Combine them togethor
We could put those codes together.
```{r}
GSSvocab %>%
group_by(year) %>%
summarize(avgVocab = mean(vocab, na.rm = T)) %>%
ggplot(aes(x = year, y = avgVocab)) + # we do not put name of the dataset here
geom_line() +
labs(title = 'Average vocabulary test result for each year',
y = 'Average vocabulary test score',
x = 'Year') +
theme(plot.title = element_text(size = 20),
axis.text = element_text(size = 10, colour = 'black'),
axis.title = element_text(size = 12))
```
## More examples
Let's make the research more challenging. Let's study the topic that **how do the average vocabulary test results for male and female change over time?**
```{r}
year_gender_vocab <- GSSvocab %>%
group_by(year, gender) %>%
summarize(avgVocab = mean(vocab, na.rm = T))
head(year_gender_vocab, 5)
```
```{r}
ggplot(year_gender_vocab, aes(x = year, y = avgVocab, col = gender)) +
geom_line() +
labs(title = 'Average vocabulary test result for male and female in each year',
y = 'Average vocabulary test score',
x = 'Year') +
theme(plot.title = element_text(size = 15),
axis.text = element_text(size = 10, colour = 'black'),
axis.title = element_text(size = 12))
```
Again, we could put them together.
```{r}
GSSvocab %>%
group_by(year, gender) %>%
summarize(avgVocab = mean(vocab, na.rm = T)) %>%
ggplot(aes(x = year, y = avgVocab, col = gender)) +
geom_line() +
labs(title = 'Average vocabulary test result for male and female in each year',
y = 'Average vocabulary test score',
x = 'Year') +
theme(plot.title = element_text(size = 15),
axis.text = element_text(size = 10, colour = 'black'),
axis.title = element_text(size = 12))
```
## More more example
We could also do the same thing to each age group and even plot them in different plots with ``facet_wrap()`` function.
```{r}
GSSvocab %>%
group_by(year, ageGroup) %>%
filter(!is.na(ageGroup)) %>% # delete the the obs with missing values in ageGroup
summarize(avgVocab = mean(vocab, na.rm = T)) %>%
ggplot(aes(x = year, y = avgVocab)) +
geom_line() +
facet_wrap(~ageGroup, labeller = label_both) +
labs(title = 'Average vocabulary test result for different age group in each year',
y = 'Average vocabulary test score',
x = 'Year') +
theme(plot.title = element_text(size = 15),
axis.text = element_text(size = 10, colour = 'black'),
axis.title = element_text(size = 12))
```
Or boxplots.
```{r}
GSSvocab %>%
filter(!is.na(ageGroup)) %>%
ggplot(aes(x = ageGroup, y = vocab)) +
geom_boxplot() +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 90))
```
Or bar plots.
```{r}
GSSvocab %>%
filter(!is.na(ageGroup)) %>%
group_by(ageGroup, year) %>%
summarize(avgVocab = mean(vocab, na.rm = T)) %>%
ggplot(aes(x = ageGroup, y = avgVocab)) +
geom_col() +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 90))
```