-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lily McMullen Module4_Assignment1.Rmd
188 lines (125 loc) · 6.71 KB
/
Lily McMullen Module4_Assignment1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Module 4 Assignment 1"
author: "Ellen Bledsoe" 'Lily McMullen'
date: "`r Sys.Date()`"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Assignment Details
### Purpose
The goal of this assignment is to assess your ability to interpret correlation coefficents and regression analyses.
### Task
Write R code which produces the correct answers and correctly interpret the plots. Correctly interpret the results
### Criteria for Success
- Code is within the provided code chunks
- Code is commented with brief descriptions of what the code does
- Code chunks run without errors
- Code produces the correct result
- Code that produces the correct answer will receive full credit
- Code attempts with logical direction will receive partial credit
- Written answers address the questions in sufficient detail
### Due Date
December 6 at midnight MST
# Assignment Questions
In this assignment, we are going to continue using the hair grass data set from class. The first lesson [Roads and Regressions](https://posit.cloud/spaces/269799/content/4970753) will be particularly helpful to you in completing this assignment.
We are going to look at the relationship between hair grass density and two other variables: phosphorus content and the average summer temperature.
## Set-Up
As always, we must get organized before we can do anything!
First load the tidyverse and read in the hair grass data set.
```{r}
library(tidyverse)
hairgrass <- read_csv("hairgrass_data.csv")
```
## Phosphorus Content
1. Calculate the mean and standard deviation of the measured phosphorus content.
```{r}
hairgrass %>%
summarise(mean_P_content = mean(P_content),
sd_P_content = sd(P_content))
```
2. Which variable is the independent variable? Which is the dependent?
*Independent: Phosphorus content*
*Dependent: Hairgrass density*
3. Create a scatter plot of hair grass density and phosphorus content. Be sure to make the labels easier to understand.
```{r}
ggplot(hairgrass, aes(x = P_content, y = hairgrass_density_m2)) +
geom_point() +
xlab("Phosphorus Content") +
ylab("Hairgrass Density") +
theme_bw()
```
4. Write 1-2 sentences interpreting the plot above. Is this a positive relationship, negative relationship or no relationship at all? Based on your prediction, do you think the correlation coefficient will be positive, negative, or zero?
*Answer: I don't see a clear relationship between phosphorus content and hairgrass density. Because of this, I predict that the correlation coefficient will be close to 0.*
5. Calculate the correlation coefficient, `r`.
```{r}
r <- cor(x = hairgrass$P_content, y = hairgrass$hairgrass_density_m2)
r
```
6. Calculate the `r^2` value. Write a one sentence interpretation of what the `r^2` value means in the context of these two variables.
```{r}
r^2 * 100
```
*Interpretation: 0.58% of variation between p_content and hairgrass density is accounted for*
7. What are the null and alternative hypothesis regarding the relationship between these two variables? (2 pts)
**Null: There is no significant relationship between phosphorus content and hairgrass density.**
**Alternative: There is a significant relationship between phosphorus content and hairgrass density.**
8. Create the scatter plot that includes the line of best fit (have `ggplot2` calculate the linear equation for you)
```{r}
ggplot(hairgrass, aes(P_content, hairgrass_density_m2)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Phosphorus Content") +
ylab("Hairgrass Density") +
theme_bw()
```
9. Using code, create the regression model in R and obtain the summary of it.
```{r}
phosphorus_model <- lm(hairgrass_density_m2 ~ P_content, data = hairgrass)
summary(phosphorus_model)
```
10. Write out the equation for the line of best fit.
*Answer: y = 0.03x + 7.31*
11. Interpret the model summary. What is the p-value for our variable of interest? Do we accept or reject the null hypothesis regarding the relationship between these two variables? What can we conclude then about building a road? (2 pts)
*Answer:*
P-value: 0.0943, which is above our cutoff of 0.05. So, we accept the null hypothesis. There is no significant relationship between p-content and hairgrass density. This means we do not have to take p-content into consideration when deciding where to build our road.
## Summer Temperature
Now let's do the same thing for the average summer temperatures.
12. Create a scatter plot of hair grass density and average summer temperature. Remember to improve the axes labels!
```{r}
ggplot(hairgrass, aes(x = avg_summer_temp, y = hairgrass_density_m2)) +
geom_point() +
xlab("Average Summer Temperature") +
ylab("Hairgrass Density") +
theme_bw()
```
13. Write 1-2 sentences interpreting the plot above. Is this a positive relationship, negative relationship or no relationship at all? Based on your prediction, do you think the correlation coefficient will be positive, negative, or zero?
*Answer: There appears to be a positive relationship between average summer temperature and hairgrass density. Because of this, I predict the correlation coefficient will be positive.*
14. Calculate the correlation coefficient, `r`.
```{r}
r <- cor(x = hairgrass$avg_summer_temp, y = hairgrass$penguin_density_m2)
r
```
15. Calculate the `r^2` value. Write a one sentence interpretation of what the `r^2` value means in the context of these two variables.
```{r}
r^2 * 100
```
*Interpretation:* 15.46% of variation between average summer temperature and hairgrass density is accounted for.
16\. Create the scatter plot that includes the line of best fit (have `ggplot2` calculate the linear equation for you)
```{r}
ggplot(hairgrass, aes(avg_summer_temp, hairgrass_density_m2)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Average Summer Temp") +
ylab("Hairgrass Density") +
theme_bw()
```
17. Using code, create the regression model in R and obtain the summary of it.
```{r}
summer_temp_model <- lm(hairgrass_density_m2 ~ avg_summer_temp, data = hairgrass)
summary(summer_temp_model)
```
18. Interpret the model summary. What is the p-value for our variable of interest? Do we accept or reject the null hypothesis regarding the relationship between these two variables? What can we conclude then about building a road? (2 points)
*Answer:*
Our p-value for our variable of interest is \<2e-16, which is very small. This means that there is a significant positive relationship between average summer temperature and hairgrass density. Because of this, we can reject the null hypothesis. We should avoid areas with high average summer temperatures when building our road so that we do not disturb areas with high average hairgrass density.