generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data.Rmd
142 lines (102 loc) · 8.72 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Data
## Sources
The NHANES dataset is collected by employees in mobile surveying centers with the equipment needed to take measurements and collect samples as well as doing questionnaires. The approach is extremely well documented and can be found [here](https://www.cdc.gov/nchs/data/series/sr_02/sr02-184-508.pdf). There was significantly more data than we possibly could have used in our project so we had to take extremely small subsets of it. Out of the many surveys we chose the demographics, body measures, nutrients, alcohol, disabilities, conditions, work, physical activity, and smoking datasets. Not every person filled out every survey. The demographics dataset is the baseline with 9254 respondents, and the rest of the surveys have less with the alcohol dataset having the least respondents with 5533. One challenge we faced was in the design of the survey: not answering a question is not the same as saying no. For example: not answering the smoking survey does not necessarily mean a participant doesn't smoke. We had to factor this in to our analysis throughout the project.
### Dataset columns
We chose these columns based off our interest in questioning whether they were related to others. We took a much larger subset of the columns, then brainstormed possible associations in order to trim down the list.
The columns we use in the demographics dataset are: ID, country of birth, level of education, household size, age (topcoded at 80), gender, household income, military status, and marital status.
The columns used in the body measures dataset are: ID, weight (kg), and height (cm).
The columns used in the nutrients dataset are: ID, calories (kcal), protein (g), carbs (g), sugar (g), fiber (g), total fat (g), cholesterol (mg), vitamin C (mg), sodium (mg), caffeine (mg), and water (g).
The columns from the alcohol dataset are: ID and number of drinks per day.
The disabilities columns are: ID and how often do you feel depressed.
The columns we use from the conditions set are: ID and have you ever had cancer.
The columns used in the work dataset are: ID and hours worked a week.
From the physical activity dataset we use: ID, minutes recreational vigorous activity per day, miutes moderate per day, and minutes sedentary per day.
Finally, the columns we use from the smoking dataset are: ID, number of cigarettes per day last month and age started smoking regularly.
## Cleaning / transformation
We download our data with a script based on the starter one NHANES provides [here](https://wwwn.cdc.gov/nchs/data/Tutorials/Code/DB303_R.r). We download each dataset separately then deal with merging them. Every entry in each dataset has a corresponding SEQN respondent ID which we left join over to create our total dataset. This ensures that each person gets a row in our database even if they didn't answer every question. We do our preprocessing on a graph by graph basis because we use different portions of the dataset for each graph. The code is found dispersed throughout our graphs. Many of our columns are categorical with set responses which we find in codes in each dataset's corresponding documentation on the NHANES website [here](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017). We also process these individually for the sake of ease. There are also some codes like 77,777 and 99,999 which refer to "refuse to answer" or "didn't answer" which we also have to preprocess.
```{r}
#install.packages("survey")
library(dplyr)
library(survey)
library(redav)
library(tidyr)
library(ggplot2)
library(forcats)
```
```{r}
read_data <- function(file_link, var_names) {
download.file(file_link, tf <- tempfile(), mode="wb")
output <- foreign::read.xport(tf)[,var_names]
return(output)
}
```
```{r}
# Demographics
demographics <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT", c("SEQN","DMDBORN4","DMDEDUC2","DMDHHSIZ","RIDAGEYR","RIAGENDR", "INDHHIN2", "DMQMILIZ", "DMDMARTL")) %>%
rename(country_of_birth="DMDBORN4", level_education="DMDEDUC2", household_size="DMDHHSIZ", age="RIDAGEYR", gender="RIAGENDR", household_income="INDHHIN2", military_status="DMQMILIZ", marital_status="DMDMARTL")
# Body measures
body_measures <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.XPT", c("SEQN", "BMXWT", "BMXHT")) %>%
rename(weight="BMXWT", height="BMXHT")
# Nutrient intake
nutrients <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DR1TOT_J.XPT", c("SEQN", "DR1TKCAL", "DR1TPROT", "DR1TCARB", "DR1TSUGR", "DR1TFIBE", "DR1TTFAT", "DR1TCHOL", "DR1TVC", "DR1TSODI", "DR1TCAFF", "DR1_320Z")) %>%
rename(calories="DR1TKCAL", protein="DR1TPROT", carbs="DR1TCARB", sugar="DR1TSUGR", fiber="DR1TFIBE", fat="DR1TTFAT", cholesterol="DR1TCHOL", vitc="DR1TVC", sodium="DR1TSODI", caffeine="DR1TCAFF", water="DR1_320Z")
# Alcohol usage
alcohol <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/ALQ_J.XPT", c("SEQN", "ALQ130")) %>%
rename(drinks_day="ALQ130")
# Disabilities
disabilities <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DLQ_J.XPT", c("SEQN", "DLQ140")) %>%
rename(depression="DLQ140")
# Conditions
conditions <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/MCQ_J.XPT", c("SEQN", "MCQ220")) %>%
rename(cancer="MCQ220")
# Work
work <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/OCQ_J.XPT", c("SEQN", "OCQ180")) %>%
rename(work_hours="OCQ180")
# Physical Activity
physical <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/PAQ_J.XPT", c("SEQN", "PAD660", "PAD675", "PAD680")) %>%
rename(act_mins_vig="PAD660", act_mins_mod="PAD675", act_mins_sed="PAD680")
# Smoking
smoking <- read_data("https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/SMQ_J.XPT", c("SEQN", "SMD650", "SMD030")) %>%
rename(num_cigs_day="SMD650", smoke_age="SMD030")
```
```{r}
total = left_join(demographics, alcohol, by="SEQN")
total = left_join(total, body_measures, by="SEQN")
total = left_join(total, conditions, by="SEQN")
total = left_join(total, disabilities, by="SEQN")
total = left_join(total, nutrients, by="SEQN")
total = left_join(total, physical, by="SEQN")
total = left_join(total, smoking, by="SEQN")
total = left_join(total, work, by="SEQN")
```
```{r}
write.csv(total, "nhanesdata.csv", row.names=FALSE)
```
## Missing value analysis
### NA's per Column
In order to do missing value analysis we have to take a less coventional approach because we have so many columns merged together from different datasets. First, we show the number of NAs in each column.
```{r}
na_count <- total %>% summarise_all(~sum(is.na(.))) %>% pivot_longer(cols=1:ncol(total))
ggplot(na_count, aes(x = value, y = fct_reorder(name, value))) +
geom_point(color = "blue") +
ggtitle("Number of NAs for each Column") +
ylab("") +
xlab("Number of NA") +
theme_linedraw()
```
We see that things like number of cigarettes a day have higher counts of NA because they are more niche parts of the survey and are less likely to be filled out, whereas things like nutrition, body measurements, or demographics are either very low in NAs or not missing any. However, we cannot conclude from an NA that someone doesn't smoke, we just have to remove it because they may still have been a smoker and not filled out the question.
Next we take a look at a few of our datasets to see trends within them.
### Pattern Analysis for Demographics
Next we take a look at the demographics dataset specifically.
```{r}
dem_mis <- demographics %>% rename(country=country_of_birth, edu=level_education, hsize=household_size, age=age, gender=gender, inc=household_income, mil=military_status, mar=marital_status)
plot_missing(dem_mis, percent = TRUE)
```
Here we see that level of education, and marital status are tied for most missing and we can see 4/5 of our missing patterns are missing both these fields with none having one or the other. Perhaps this is due to the survey methods putting these two questions close together. We also see military status having a decent amount missing. Income also has a very tiny amount of values missing. The most common missing pattern is education, marital, and military which leads us to believe this was in fact, due to how the survey was given grouping these questions.
### Pattern Analysis for Nutrition
Next we look at the nutrition dataset.
```{r}
nut_mis <- nutrients %>% rename(kcal=calories, prot=protein, carbs=carbs, sugar=sugar, fiber=fiber, fat=fat, chol=cholesterol, vitc=vitc, sodium=sodium, caf=caffeine, water=water)
plot_missing(nut_mis, percent = TRUE)
```
We see that the vast majority of the data is complete and there are only 2 missing patterns, one where none of the fields are filled, and one where just water is filled. This could be due to the fact that people answering did not want to count up all their nutrients, and water is a simple thing to list as it's easy to estimate whereas the rest aren't necessarily.