diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd
index 2ba4bec4..867c1a0f 100644
--- a/modules/Statistics/Statistics.Rmd
+++ b/modules/Statistics/Statistics.Rmd
@@ -90,9 +90,11 @@ Function `cor()` computes correlation in R.
cor(x, y = NULL, use = c("everything", "complete.obs"),
method = c("pearson", "kendall", "spearman"))
```
-- provide two numeric vectors of the same length (arguments `x`, `y`), or
-- provide a data.frame / tibble with numeric columns only
-- by default, Pearson correlation coefficient is computed
+
+
+- provide two numeric vectors of the same length (arguments `x`, `y`), or
+- provide a data.frame / tibble with numeric columns only
+- by default, Pearson correlation coefficient is computed
## Correlation test
@@ -111,27 +113,28 @@ cor.test(x, y = NULL, alternative(c("two.sided", "less", "greater")),
- less means true correlation coefficient is < 0 (negative relationship)
-## Correlation {.small}
+## Correlation {.codesmall}
Let's look at the dataset of yearly CO2 emissions by country.
```{r cor1, comment="", message = FALSE}
-yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
+yearly_co2 <-
+ read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
```
## Correlation for two vectors
-First, we compute correlation by providing two vectors.
+First, we create two vectors.
```{r}
# x and y must be numeric vectors
-y1980 <- yearly_co2_emissions %>% pull(`1980`)
-y1985 <- yearly_co2_emissions %>% pull(`1985`)
+y1980 <- yearly_co2 %>% pull(`1980`)
+y1985 <- yearly_co2 %>% pull(`1985`)
```
-Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.
+Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify `use = "complete.obs"`, then it will give you correlation using the non-missing data.
```{r}
cor(y1980, y1985, use = "complete.obs")
@@ -156,32 +159,16 @@ glimpse(cor_result)
## Correlation for two vectors with plot{.codesmall}
-In plot form... `geom_smooth()` and `annotate()` can help.
+In plot form... `geom_smooth()` and `annotate()` can look very nice!
```{r, warning = F}
corr_value <- pull(cor_result, estimate) %>% round(digits = 4)
cor_label <- paste0("R = ", corr_value)
-yearly_co2_emissions %>%
+yearly_co2 %>%
ggplot(aes(x = `1980`, y = `1985`)) + geom_point(size = 1) + geom_smooth() +
annotate("text", x = 2000000, y = 4000000, label = cor_label)
```
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
## Correlation for data frame columns
We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.
@@ -189,13 +176,14 @@ We can compute correlation for all pairs of columns of a data frame / matrix. Th
Columns must be all numeric!
```{r}
-co2_subset <- yearly_co2_emissions %>%
+co2_subset <- yearly_co2 %>%
select(c(`1950`, `1980`, `1985`, `2010`))
head(co2_subset)
```
## Correlation for data frame columns
+
We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.
```{r}
@@ -226,12 +214,13 @@ knitr::include_graphics(here::here("images/lyme_and_fried_chicken.png"))
## T-test
-The commonly used are:
+The commonly used t-tests are:
-- **one-sample t-test** -- used to test mean of a variable in one group
-- **two-sample t-test** -- used to test difference in means of a variable between two groups (if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)
+- **one-sample t-test** -- used to test mean of a variable in one group
+- **two-sample t-test** -- used to test difference in means of a variable between two groups
+ - if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)
-The `t.test()` function in R is one to address the above.
+The `t.test()` function does both.
```
t.test(x, y = NULL,
@@ -302,7 +291,7 @@ See [here](https://www.nature.com/articles/nbt1209-1135) for more about multiple
## Some other statistical tests
- `wilcox.test()` -- Wilcoxon signed rank test, Wilcoxon rank sum test
-- `shapiro.test()` -- Shapiro test
+- `shapiro.test()` -- Test normality assumptions
- `ks.test()` -- Kolmogorov-Smirnov test
- `var.test()`-- Fisher’s F-Test
- `chisq.test()` -- Chi-squared test
@@ -548,8 +537,10 @@ Maybe we want to use the age group "65+ years" as our reference. We can relevel
Relative to the level is not listed.
```{r}
-er_temps <- er_temps %>% mutate(age = factor(age,
- levels = c("65+ years old", "35-64 years old", "15-34 years old", "5-14 years old", "0-4 years old")
+er_temps <-
+ er_temps %>%
+ mutate(age = factor(age,
+ levels = c("65+ years", "35-64 years", "15-34 years", "5-14 years", "0-4 years")
))
fit4 <- glm(visits ~ highest_temp + year + age, data = er_temps)