diff --git a/help.html b/help.html index a3023f49..eec0cc9c 100644 --- a/help.html +++ b/help.html @@ -353,14 +353,14 @@

Why are my changes not taking effect? It’s making my results look

Here we are creating a new object from an existing one:

new_rivers <- sample(rivers, 5)
 new_rivers
-
## [1] 314 529 710 450 605
+
## [1]  135  470  407  610 1171

Using just this will only print the result and not actually change new_rivers:

new_rivers + 1
-
## [1] 315 530 711 451 606
+
## [1]  136  471  408  611 1172

If we want to modify new_rivers and save that modified version, then we need to reassign new_rivers like so:

new_rivers <- new_rivers + 1
 new_rivers
-
## [1] 315 530 711 451 606
+
## [1]  136  471  408  611 1172

If we forget to reassign this can cause subsequent steps to not work as expected because we will not be working with the data that has been modified.


@@ -409,7 +409,7 @@

Error: object β€˜X’ not found

Make sure you run something like this, with the <- operator:

rivers2 <- new_rivers + 1
 rivers2
-
## [1] 316 531 712 452 607
+
## [1]  137  472  409  612 1173

diff --git a/index.html b/index.html index 2cc5a6eb..c9e6025e 100644 --- a/index.html +++ b/index.html @@ -351,7 +351,7 @@

Testimonials from our other courses:

Find an Error!?


Feel free to submit typos/errors/etc via the GitHub repository associated with the class: https://github.com/fhdsl/DaSEH

-

This page was last updated on 2024-10-03.

+

This page was last updated on 2024-10-08.

Creative Commons License

diff --git a/modules/Data_Visualization/Data_Visualization.Rmd b/modules/Data_Visualization/Data_Visualization.Rmd index 760b90c7..82ce8c7b 100644 --- a/modules/Data_Visualization/Data_Visualization.Rmd +++ b/modules/Data_Visualization/Data_Visualization.Rmd @@ -14,10 +14,10 @@ opts_chunk$set(echo = TRUE, fig.height = 4, fig.width = 7, comment = "") -library(dasehr) library(tidyverse) library(tidyr) -library(emo) +install.packages('emoji', repos='http://cran.us.r-project.org', dependencies=TRUE) +library(emoji) ``` ## Recap @@ -470,10 +470,26 @@ er_visits_4 %>% ggplot(aes(x = year, - `scale_x_continuous()` and `scale_y_continuous()` can modify the scale of the axes - by default, `ggplot()` removes points with missing values from plots. +## GUT CHECK: If we get an empty plot what might we need to do? + +A. Add a `plot_` layer like `plot_point()` + +B. Add a `geom_` layer like `geom_point()` + + +## GUT CHECK: How do we add more layers in ggplot2 plots? + +A. `%>%` + +B. `&` + +C. `+` + ## Lab 1 -🏠 [Class Website](https://daseh.org/)\ -πŸ’» [Lab](https://daseh.org/modules//Data_Visualization/lab/Data_Visualization_Lab.Rmd) +🏠 [Class Website](https://daseh.org/) +πŸ’» [Lab](https://daseh.org/modules//Data_Visualization/lab/Data_Visualization_Lab.Rmd) +πŸ“ƒ[Day 6 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-6.pdf) ## theme() function: @@ -729,7 +745,7 @@ er_bar + ## Tip - Check what you plot {.codesmall} -`r emo::ji("warning")` May not be plotting what you think you are! `r emo::ji("warning")` +`r emoji("warning")` May not be plotting what you think you are! `r emoji("warning")` ```{r, fig.width=5 , fig.height=3, fig.align='center'} ggplot(er_visits_4, aes(x = county, @@ -937,6 +953,19 @@ ggplotly(lots_of_lines) Also check out the [`ggiraph` package](https://www.rdocumentation.org/packages/ggiraph/versions/0.6.1) +## `patchwork` package + +Great for combining plots together + +Also check out the [`patchwork` package](https://patchwork.data-imaginist.com/) + +```{r, out.width= "80%", fig.align='center'} +#install.packages("patchwork") +library(patchwork) +lots_of_lines + rp_fac_plot + +``` + # Saving plots ## Saving a ggplot to file @@ -953,6 +982,20 @@ ggsave(filename = "saved_plot.png", # will save in working directory width = 6, height = 3.5) # by default in inches ``` +## GUT CHECK: How to we make sure that the boxplots are filled with color instead of just the outside boarder? + +A. Use the `fill` argument in the `aes` specification + +B. Use `color` argument in `geom_boxplot()` + +## GUT CHECK: If our plot is too complicated to read, what might be a good option to fix this? + +A. add more `theme()` layers + +B. Use `facet_grid()` to split the plot up + + + ## Summary - The `theme()` function helps you specify aspects about your plot @@ -973,6 +1016,8 @@ Check out this [guide](https://jhudatascience.org/tidyversecourse/dataviz.html#m 🏠 [Class Website](https://daseh.org/)\ πŸ’» [Lab](https://daseh.org/modules//Data_Visualization/lab/Data_Visualization_Lab.Rmd) +πŸ“ƒ[Day 6 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-6.pdf) +πŸ“ƒ[Posit's theme cheatsheet](https://github.com/claragranell/ggplot2/blob/main/ggplot_theme_system_cheatsheet.pdf) ```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'} knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg")) @@ -1150,15 +1195,4 @@ library(directlabels) direct.label(lots_of_lines, method = list("angled.boxes")) ``` -## `patchwork` package - -Great for combining plots together - -Also check out the [`patchwork` package](https://patchwork.data-imaginist.com/) -```{r, out.width= "50%", fig.align='center'} -#install.packages("patchwork") -library(patchwork) -(plt1 + plt2)/plt2 - -``` diff --git a/modules/Data_Visualization/Data_Visualization.html b/modules/Data_Visualization/Data_Visualization.html index 23e690b4..b0827b9a 100644 --- a/modules/Data_Visualization/Data_Visualization.html +++ b/modules/Data_Visualization/Data_Visualization.html @@ -311,23 +311,25 @@

Data to plot

-

Type ?er_CO_statewide for more information.

+

Let’s plot the CO heat-related ER visits dataset we’ve been working with. First, we’ll only consider data from Boulder county.

Is the data in tidy? Is it in long format?

-
er_state <- er_CO_statewide
+
er <- 
+  read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
+er_Boulder <- er %>% filter(county == "Boulder")
 
-head(er_state)
+head(er_Boulder)
-
# A tibble: 6 Γ— 5
-   rate lower95cl upper95cl visits  year
-  <dbl>     <dbl>     <dbl>  <dbl> <dbl>
-1  6.51      5.80      7.23    323  2011
-2  6.58      5.88      7.29    339  2012
-3  5.82      5.16      6.49    302  2013
-4  4.44      3.87      5.01    237  2014
-5  6.55      5.86      7.25    355  2015
-6  8.46      7.68      9.23    467  2016
+
# A tibble: 6 Γ— 6
+  county   rate lower95cl upper95cl visits  year
+  <chr>   <dbl>     <dbl>     <dbl>  <dbl> <dbl>
+1 Boulder  4.03      2.05      6.67     12  2011
+2 Boulder  4.08      2.15      6.62     13  2012
+3 Boulder  3.79      1.90      6.32     12  2013
+4 Boulder  6.29      3.71      9.54     19  2014
+5 Boulder  4.76      2.57      7.61     14  2015
+6 Boulder  5.68      3.31      8.67     18  2016

First plot with ggplot2 package

@@ -349,9 +351,9 @@

ggplot({data_to plot}, aes(x = {var in data to plot}, y = {var in data to plot}))
-
ggplot(er_state, aes(x = year, y = rate))
+
ggplot(er_Boulder, aes(x = year, y = rate))
-

+

Next layer code with ggplot2 package

@@ -389,10 +391,10 @@

y = {var in data to plot})) + geom_{type of plot}</div> -
ggplot(er_state, aes(x = year, y = rate)) +
+
ggplot(er_Boulder, aes(x = year, y = rate)) +
   geom_point()
-

+

Read as: using CO statewide ER heat visits data, and provided aesthetic mapping, add points to the plot

@@ -400,60 +402,63 @@

Having the + sign at the beginning of a line will not work!

-
ggplot(er_state, aes(x = year,
+
ggplot(er_Boulder, aes(x = year,
                            y = rate,
                            fill = item_categ))  
  + geom_boxplot()

Pipes will also not work in place of +!

-
ggplot(er_state, aes(x = year,
+
ggplot(er_Boulder, aes(x = year,
                            y = rate,
                            fill = item_categ))  %>%
 geom_boxplot()
-

Plots can be assigned as an object

+

Plots can be assigned as an object

-
plt1 <- ggplot(er_state, aes(x = year, y = rate)) +
+
plt1 <- ggplot(er_Boulder, aes(x = year, y = rate)) +
           geom_point()
 
 plt1
-

+

Examples of different geoms

-
plt1 <- ggplot(er_state, aes(x = year, y = rate)) +
+
plt1 <- ggplot(er_Boulder, aes(x = year, y = rate)) +
           geom_point()
 
-plt2 <- ggplot(er_state, aes(x = year, y = rate)) +
+plt2 <- ggplot(er_Boulder, aes(x = year, y = rate)) +
           geom_line()
 
 plt1 # fig.show = "hold" makes plots appear
 plt2 # next to one another in the chunk settings
-

+

Specifying plot layers: combining multiple layers

Layer a plot on top of another plot with +

-
ggplot(er_state, aes(x = year, y = rate)) +
+
ggplot(er_Boulder, aes(x = year, y = rate)) +
   geom_point() +
   geom_line()
-

+

Adding color - can map color to a variable

-
er_visits_4 <- er_CO_county %>%
+

Let’s map ER visit rates for four CO counties on the same plot

+ +
set.seed(123)
+er_visits_4 <- er %>% 
   filter(county %in% c("Denver", "Weld", "Pueblo", "Jackson"))
 
 ggplot(er_visits_4, aes(x = year, y = rate, color = county)) +
   geom_point() +
   geom_line()
-

+

@@ -475,18 +480,18 @@

-

Customize the look of the plot

+

Customize the look of the plot

You can change the look of whole plot using theme_*() functions.

There are also size, color, alpha, and linetype arguments.

-
ggplot(er_state, aes(x = year, y = rate)) +
+
ggplot(er_Boulder, aes(x = year, y = rate)) +
   geom_point(size = 5, color = "green", alpha = 0.5) +
   geom_line(size = 0.8, color = "blue", linetype = 2) +
   theme_dark()
-

+

More themes!

@@ -510,18 +515,18 @@

-

Adding labels

+

Adding labels

The labs() function can help you add or modify titles on your plot. The title argument specifies the title. The x argument specifies the x axis label. The y argument specifies the y axis label.

-
ggplot(er_state, aes(x = year, y = rate)) +
+
ggplot(er_Boulder, aes(x = year, y = rate)) +
             geom_point(size = 5, color = "red", alpha = 0.5) +
             geom_line(size = 0.8, color = "brown", linetype = 2) +
-            labs(title = "My plot of Heat-Related ER Visits in CO",
+            labs(title = "Heat-Related ER Visits:Boulder",
               x = "Year",
               y = "Age-adjusted Visit Rate")
-

+

@@ -543,21 +548,17 @@

-

Changing axis: specifying axis scale

+

Changing axis: specifying axis scale

scale_x_continuous() and scale_y_continuous() can change how the axis is plotted. Can use the breaks argument to specify how you want the axis ticks.

-
range(pull(er_visits_4, year))
- -
[1] 2011 2022
- -
plot_scale <- ggplot(er_state, aes(x = year, y = rate)) +
-                geom_point(size = 5, color = "green", alpha = 0.5) +
+
plot_scale <- ggplot(er_Boulder, aes(x = year, y = rate)) + 
+                geom_point(size = 5, color = "green", alpha = 0.5) + 
                 geom_line(size = 0.8, color = "blue", linetype = 2) +
                 scale_x_continuous(breaks = seq(from = 2011, to = 2022, by = 1))
 plot_scale
-

+

@@ -581,17 +582,17 @@

-

Modifying plot objects

+

Modifying plot objects

You can add to a plot object to make changes! Note that we can save our plots as an object like plt1 below. And now if we reference plt1 again our plot will print out!

-
plt1 <- ggplot(er_state, aes(x = year, y = rate,)) +
-            geom_point(size = 5, color = "green", alpha = 0.5) +geom_line(size = 0.8, color = "blue", linetype = 2) +
-  labs(title = "My plot of Heat-Related ER Visits in CO", x = "Year", y = "Age-adjusted Visit Rate")
+
plt1 <- ggplot(er_Boulder, aes(x = year, y = rate,)) +
+            geom_point(size = 5, color = "green", alpha = 0.5) +geom_line(size = 0.8, color = "blue", linetype = 2) + 
+  labs(title = "Heat-Related ER Visits:Boulder", x = "Year", y = "Age-adjusted Visit Rate")
 
 plt1 + theme_minimal()
-

+

Removing the legend label

@@ -603,7 +604,7 @@

geom_line(size = 0.8) + theme(legend.position = "none") -

+

Overwriting specifications

@@ -614,7 +615,7 @@

color = county)) + geom_line(size = 0.8) -

+

Overwriting specifications

@@ -625,7 +626,7 @@

color = county)) + geom_line(size = 0.8, color = "black") -

+

Summary

@@ -640,21 +641,35 @@

  • by default, ggplot() removes points with missing values from plots.
  • +

    GUT CHECK: If we get an empty plot what might we need to do?

    + +

    A. Add a plot_ layer like plot_point()

    + +

    B. Add a geom_ layer like geom_point()

    + +

    GUT CHECK: How do we add more layers in ggplot2 plots?

    + +

    A. %>%

    + +

    B. &

    + +

    C. +

    +

    Lab 1

    theme() function:

    The theme() function can help you modify various elements of your plot. Here we will adjust the font size of the plot title.

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
                 geom_point(size = 5, color = "green", alpha = 0.5) +
    -            geom_line(size = 0.8, color = "blue", linetype = 2) +
    -  labs(title = "My plot of Heat-Related ER Visits in CO") +
    +            geom_line(size = 0.8, color = "blue", linetype = 2) + 
    +  labs(title = "Heat-Related ER Visits:Boulder") +
       theme(plot.title = element_text(size = 20))
    -

    +

    theme() function

    @@ -677,24 +692,25 @@

    The theme() function can help you modify various elements of your plot. Here we will adjust the horizontal justification (hjust) of the plot title.

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
                 geom_point(size = 5, color = "green", alpha = 0.5) +
    -            geom_line(size = 0.8, color = "blue", linetype = 2) +
    -  labs(title = "My plot of Heat-Related ER Visits in CO") +
    +            geom_line(size = 0.8, color = "blue", linetype = 2) + 
    +  labs(title = "Heat-Related ER Visits:Boulder") +
    +
       theme(plot.title = element_text(hjust = 0.5, size = 20))
    -

    +

    theme() function: change title and axis format

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
                 geom_point(size = 5, color = "green", alpha = 0.5) +
    -            geom_line(size = 0.8, color = "blue", linetype = 2) +
    -  labs(title = "My plot of Heat-Related ER Visits in CO") +
    +            geom_line(size = 0.8, color = "blue", linetype = 2) + 
    +  labs(title = "Heat-Related ER Visits: Boulder") +
       theme(plot.title = element_text(hjust = 0.5, size = 20),
             axis.title = element_text(size = 16))
    -

    +

    @@ -804,7 +820,7 @@

    y = rate)) + geom_line()
    -

    +

    If it looks confusing to you, try again

    @@ -819,7 +835,7 @@

    group = county)) + geom_line() -

    +

    Adding color will automatically group the data

    @@ -829,75 +845,72 @@

    geom_line()+ theme(legend.position = "bottom") -

    +

    Tips!

    Let’s talk additional tricks and tips for making ggplots!

    -

    We are going to use some other data about ER visits that has to do with gender. Note that gender was recorded as binary, which we know isn’t really accurate. This is something you might encounter. Please see this article about ways to measure gender in a more inclusive way: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526522/.

    - -

    Tips - Color vs Fill

    +

    Tips - Color vs Fill

    • color is needed for points and lines
    • fill is generally needed for boxes and bars
    -
    er_visits_gender <- CO_heat_ER_bygender
    -ggplot(er_visits_gender, aes(x = gender,
    -                 y = rate,
    -                 color = gender)) + #color creates an outline
    +
    ggplot(er_visits_4, aes(x = county,
    +                 y = visits,
    +                 color = county)) + #color creates an outline
       geom_boxplot()
     
    -ggplot(er_visits_gender, aes(x = gender,
    +ggplot(er_visits_4, aes(x = county,
                      y = rate,
    -                 fill = gender)) + # fills the boxplot
    +                 fill = county)) + # fills the boxplot
       geom_boxplot()
    -

    +

    Tip - Good idea to add jitter layer to top of box plots

    Can add width argument to make the jitter more narrow.

    -
    ggplot(er_visits_gender, aes(x = gender,
    +
    ggplot(er_visits_4, aes(x = county,
                      y = rate,
    -                 fill = gender)) +
    +                 fill = county)) +
       geom_boxplot() +
       geom_jitter(width = .06)
    -

    +

    Tip - be careful about colors for color vision deficiency

    scale_fill_viridis_d() for discrete /categorical data scale_fill_viridis_c() for continuous data

    -
    ggplot(er_visits_gender, aes(x = gender,
    +
    ggplot(er_visits_4, aes(x = county,
                      y = rate,
    -                 fill = gender)) +
    +                 fill = county)) +
       geom_boxplot() +
       geom_jitter(width = .06) +
       scale_fill_viridis_d()
    -

    +

    Tip - can pipe data after wrangling into ggplot()

    -
    er_bar <- er_visits_gender %>%
    -  group_by(gender) %>%
    +
    er_bar <- er_visits_4 %>%
    +  group_by(county) %>%
       summarize("max_rate" = max(rate, na.rm=T)) %>%
     
    -ggplot(aes(x = gender,
    +ggplot(aes(x = county,
                y = max_rate,
    -           fill = gender)) +
    +           fill = county)) +
       scale_fill_viridis_d()+
       geom_col() +
       theme(legend.position = "none")
     
     er_bar
    -

    +

    Tip - color outside of aes()

    @@ -906,91 +919,76 @@

    er_bar +
        geom_col(color = "black")
    -

    +

    -

    Tip - col vs bar

    +

    Tip - col vs bar

    geom_bar(x =) can only use one aes mapping geom_col(x = , y = ) can have two

    -

    Tip - Check what you plot

    +

    Tip - Check what you plot

    ⚠️ May not be plotting what you think you are! ⚠️

    -
    ggplot(er_visits_gender, aes(x = gender,
    +
    ggplot(er_visits_4, aes(x = county,
                      y = visits,
    -                 fill = gender)) +
    +                 fill = county)) +
       geom_col()
    -

    +

    -

    What did we plot? Always good to check it is correct!

    +

    What did we plot? Always good to check it is correct!

    -
    head(er_visits_gender, n = 3)
    +
    head(er_visits_4, n = 3)
    -
    # A tibble: 3 Γ— 7
    -  county  rate lower95cl upper95cl visits  year gender
    -  <chr>  <dbl>     <dbl>     <dbl>  <dbl> <dbl> <chr> 
    -1 Adams   7.60      4.38     11.7      17  2011 Female
    -2 Adams  NA        NA        NA        NA  2012 Female
    -3 Adams   6.22      3.37      9.93     14  2013 Female
    +
    # A tibble: 3 Γ— 6
    +  county  rate lower95cl upper95cl visits  year
    +  <chr>  <dbl>     <dbl>     <dbl>  <dbl> <dbl>
    +1 Denver  7.11      4.89      9.34     42  2011
    +2 Denver  6.79      4.62      8.97     40  2012
    +3 Denver  2.95      1.75      4.46     19  2013
    -
    er_visits_gender %>% group_by(gender) %>%
    +
    er_visits_4 %>% group_by(county) %>%
       summarize(sum = sum(visits, na.rm=T))
    -
    # A tibble: 2 Γ— 2
    -  gender   sum
    -  <chr>  <dbl>
    -1 Female  2556
    -2 Male    4331
    +
    # A tibble: 4 Γ— 2
    +  county    sum
    +  <chr>   <dbl>
    +1 Denver    402
    +2 Jackson     0
    +3 Pueblo    336
    +4 Weld      324
    -

    Try that again

    +

    Try that again

    -
    er_visits_gender %>% group_by(gender, county) %>%
    +
    er_visits_4 %>% group_by(county) %>%
       summarize(mean_visits = mean(visits, na.rm=T))
    -
    # A tibble: 20 Γ— 3
    -# Groups:   gender [2]
    -   gender county    mean_visits
    -   <chr>  <chr>           <dbl>
    - 1 Female Adams            15.8
    - 2 Female Arapahoe         14.4
    - 3 Female Cheyenne          0  
    - 4 Female Denver           14.4
    - 5 Female El Paso          15.3
    - 6 Female Jefferson        14.1
    - 7 Female Larimer          13.5
    - 8 Female Pueblo           12.7
    - 9 Female Statewide       142. 
    -10 Female Weld             15  
    -11 Male   Adams            18.9
    -12 Male   Arapahoe         17.3
    -13 Male   Cheyenne          0  
    -14 Male   Denver           22.5
    -15 Male   El Paso          23.1
    -16 Male   Jefferson        16.3
    -17 Male   Larimer          20.7
    -18 Male   Pueblo           17.1
    -19 Male   Statewide       225. 
    -20 Male   Weld             17.5
    - -

    Try that again

    - -
    er_visits_gender %>% group_by(gender, county) %>%
    +
    # A tibble: 4 Γ— 2
    +  county  mean_visits
    +  <chr>         <dbl>
    +1 Denver         33.5
    +2 Jackson         0  
    +3 Pueblo         28  
    +4 Weld           27  
    + +

    Try that again

    + +
    er_visits_4 %>% group_by(county) %>%
       summarize(mean_visits = mean(visits, na.rm=T)) %>%
     
    -ggplot(aes(x = gender,
    +ggplot(aes(x = county,
                y = mean_visits,
    -           fill = gender)) +
    +           fill = county)) +
       geom_col()
    -

    +

    Tip - make sure labels aren’t too small

    er_bar +
       theme(text = element_text(size = 20))
    -

    +

    @@ -1052,11 +1050,13 @@

    -

    Sometimes we have many lines and it is hard to see what is happening

    +

    Sometimes we have many lines and it is hard to see what is happening

    + +

    Let’s look at visit rates for 9 CO counties.

    -
    er_visits_9 <- er_CO_county %>%
    -  filter(county %in% c("Denver", "Weld", "Pueblo", "Jackson",
    -                       "San Juan", "Mesa", "Jefferson", "Larimer", "Statewide"))
    +
    er_visits_9 <- er %>% 
    +  filter(county %in% c("Denver", "Weld", "Pueblo", "Jackson", 
    +                       "San Juan", "Mesa", "Jefferson", "Larimer", "Boulder"))
     
     lots_of_lines <- ggplot(er_visits_9, aes(x = year,
                      y = rate,
    @@ -1064,9 +1064,9 @@ 

    geom_line() lots_of_lines
    -

    +

    -

    Adding a facet can help make it easier to see what is happening

    +

    Adding a facet can help make it easier to see what is happening

    Sometimes we have two many lines and can git difficult to see what is happening, facets can help!

    @@ -1085,9 +1085,9 @@

    theme(legend.position = "none") + theme(axis.text.x = element_text(angle = 90)) -

    +

    -

    facet_wrap()

    +

    facet_wrap()

    • more flexible - arguments ncol and nrow can specify layout
    • @@ -1100,7 +1100,7 @@

      rp_fac_plot
      -

      +

      @@ -1160,11 +1160,23 @@

      library(plotly) # creates interactive plots! ggplotly(lots_of_lines) -
      - +
      +

      Also check out the ggiraph package

      +

    patchwork package

    + +

    Great for combining plots together

    + +

    Also check out the patchwork package

    + +
    #install.packages("patchwork")
    +library(patchwork)
    +lots_of_lines + rp_fac_plot
    + +

    +

    Saving plots

    Saving a ggplot to file

    @@ -1181,6 +1193,18 @@

    plot = rp_fac_plot, width = 6, height = 3.5) # by default in inches +

    GUT CHECK: How to we make sure that the boxplots are filled with color instead of just the outside boarder?

    + +

    A. Use the fill argument in the aes specification

    + +

    B. Use color argument in geom_boxplot()

    + +

    GUT CHECK: If our plot is too complicated to read, what might be a good option to fix this?

    + +

    A. add more theme() layers

    + +

    B. Use facet_grid() to split the plot up

    +

    Summary

      @@ -1207,7 +1231,7 @@

    Lab 2

    -

    🏠 Class Website
    πŸ’» Lab

    +

    🏠 Class Website
    πŸ’» Lab πŸ“ƒDay 6 Cheatsheet πŸ“ƒPosit’s theme cheatsheet

    The End

    @@ -1219,11 +1243,11 @@

    You can change look of each layer separately. Note the arguments like linetype and alpha that allow us to change the opacity of the points and style of the line respectively.

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
       geom_point(size = 5, color = "red", alpha = 0.5) +
       geom_line(size = 0.8, color = "black", linetype = 2)
    -

    +

    linetype can be given as a number. See the docs for what numbers correspond to what linetype!

    @@ -1231,42 +1255,42 @@

    You can change the look of whole plot - specific elements, too - like changing font and font size - or even more fonts

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
       geom_point(size = 5, color = "green", alpha = 0.5) +
       geom_line(size = 0.8, color = "blue", linetype = 2) +
       theme_bw() +
       theme(text=element_text(size=16,  family="Comic Sans MS"))
    -

    +

    -

    Adding labels line break

    +

    Adding labels line break

    Line breaks can be specified using \n within the labs() function to have a label with multiple lines.

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
                 geom_point(size = 5, color = "red", alpha = 0.5) +
                 geom_line(size = 0.8, color = "brown", linetype = 2) +
    -  labs(title = "My plot of Heat-Related ER Visits in CO: \n age-adjusted visit rate by year",
    +  labs(title = "Heat-Related ER Visits in Boulder Co.: \n age-adjusted visit rate by year",
                   x = "Year",
                   y = "Age-adjusted Visit Rate")
    -

    +

    Changing axis: specifying axis limits

    xlim() and ylim() can specify the limits for each axis

    -
    ggplot(er_state, aes(x = year, y = rate)) +
    +
    ggplot(er_Boulder, aes(x = year, y = rate)) +
                 geom_point(size = 5, color = "green", alpha = 0.5) +
                 geom_line(size = 0.8, color = "blue", linetype = 2) +
    -  labs(title = "My plot of Heat-Related ER Visits in CO",
    +  labs(title = "Heat-Related ER Visits in Boulder Co.",
                   x = "Year",
                   y = "Age-adjusted Visit Rate") +
       ylim(0, max(pull(er_visits_4, rate)))
    -

    +

    -

    theme() function: moving (or removing) legend

    +

    theme() function: moving (or removing) legend

    If specifying position - use: “top”, “bottom”, “right”, “left”, “none”

    @@ -1277,7 +1301,7 @@

    geom_line() + theme(legend.position = "bottom") -

    +

    Keys for specifications

    @@ -1294,11 +1318,11 @@

  • these include values that are strings like “blank”, “solid”, “dashed”, “dotdash”, “longdash”, and “twodash”
  • -
    er_state %>% ggplot(aes(x = year,
    +
    er_Boulder %>% ggplot(aes(x = year,
                           y = rate)) +
       geom_line(size = 0.8, linetype = "twodash")
    -

    +

    Keys for specifications

    @@ -1315,11 +1339,11 @@

  • these include numeric values (don’t need quotes for these) and some characters values (need quotes for these)
  • -
    er_state %>% ggplot(aes(x = year,
    +
    er_Boulder %>% ggplot(aes(x = year,
                           y = rate)) +
       geom_point(size = 2, shape = 12)
    -

    +

    Can make your own theme to use on plots!

    @@ -1331,16 +1355,16 @@

    However, if can be helpful if your plot is getting stretched to accommodate plotting an outlier. You can always say in the figure legend what you removed.

    -
    er_no_out1 <- ggplot(er_visits_gender, aes(y = visits, x = gender)) +
    +
    er_no_out1 <- ggplot(er_visits_4, aes(y = visits, x = county)) +
       geom_boxplot()
     
    -er_no_out2 <- ggplot(er_visits_gender, aes(y = visits, x = gender)) +
    +er_no_out2 <- ggplot(er_visits_4, aes(y = visits, x = county)) +
       geom_boxplot(outlier.shape = NA) +
       ylim(0,40)
    -

    +

    -

    Tip - NA Values

    +

    Tip - NA Values

    • if it is a numeric value it will just get dropped from the graph and you will see a warning
    • @@ -1357,7 +1381,7 @@

      ggplot( aes(x = flavor)) + geom_bar() + theme(text=element_text(size=24)) -

      +

    Extensions

    @@ -1369,19 +1393,7 @@

    library(directlabels) direct.label(lots_of_lines, method = list("angled.boxes")) -

    - -

    patchwork package

    - -

    Great for combining plots together

    - -

    Also check out the patchwork package

    - -
    #install.packages("patchwork")
    -library(patchwork)
    -(plt1 + plt2)/plt2
    - -

    +

    diff --git a/modules/Data_Visualization/Data_Visualization.pdf b/modules/Data_Visualization/Data_Visualization.pdf index 52a23517..03a127b4 100644 Binary files a/modules/Data_Visualization/Data_Visualization.pdf and b/modules/Data_Visualization/Data_Visualization.pdf differ diff --git a/modules/Data_Visualization/lab/Data_Visualization_Lab.Rmd b/modules/Data_Visualization/lab/Data_Visualization_Lab.Rmd index 6ab57c68..39d8aad7 100644 --- a/modules/Data_Visualization/lab/Data_Visualization_Lab.Rmd +++ b/modules/Data_Visualization/lab/Data_Visualization_Lab.Rmd @@ -17,30 +17,24 @@ Load the libraries library(readr) library(ggplot2) library(dplyr) -library(dasehr) ``` -Open the Nitrate exposure via WA public waterways data from the `dasehr` package. - -(You can also access it at the link www.daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv) - -Then, use the provided code to compute a data frame `nitrate` with aggregate summary of exposure level: average exposed population (`pop_exposed_to_exceedances`) for each year (`year`). +Load the CalEnviroScreen data from the link www.daseh.org/data/CalEnviroScreen_data.csv) and subset it so that you only have data from Fresno, Merced, Placer, Sonoma, and Yolo counties. ```{r} - -nitrate_agg <- nitrate %>% - group_by(year) %>% - summarise(exposed_pop_avg = mean(pop_exposed_to_exceedances)) - -nitrate_agg +ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") +ces_sub <- ces %>% filter(CaliforniaCounty == c("Fresno", "Merced", "Placer", "Sonoma", "Yolo")) ``` ### 1.1 -Use `ggplot2` package make plot of average exposed population (`exposed_pop_avg`; y-axis) for each year (`year`; x-axis). You can use lines layer (`+ geom_line()`) or points layer (`+ geom_point()`), or both! +Use the `ggplot2` package to make a plot of how diesel particulate concentration (`DieselPM`; y-axis) is associated with traffic density values (`Traffic`; x-axis). You can use lines layer (`+ geom_line()`) or points layer (`+ geom_point()`), or both! Assign the plot to variable `my_plot`. Type `my_plot` in the console to have it displayed. +`DieselPM`: Diesel PM emissions from on-road and non-road sources +`Traffic`: Traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary + ``` # General format ggplot(???, aes(x = ???, y = ???)) + @@ -62,7 +56,8 @@ ggplot(???, aes(x = ???, y = ???)) + ### 1.3 -Use the `scale_x_continuous()` function to plot the x axis with the following breaks `c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019)`. +Use the `scale_x_continuous()` function to plot the x axis with the following breaks `c(250, 750, 1250, 1750, 2250)`. + ``` # General format @@ -92,7 +87,10 @@ my_plot + theme_bw() ### P.1 -Create a boxplot (with the `geom_boxplot()` function) using the `nitrate` data, where `quarter` is plotted on the x axis and `pop_on_sampled_PWS` is plotted on the y axis. +Create a boxplot (with the `geom_boxplot()` function) using the `ces_sub` data, where `CaliforniaCounty` is plotted on the x axis and `DrinkingWater` is plotted on the y axis. + +`DrinkingWater`: Drinking water contaminant index for selected contaminants. A higher value means drinking water contains a greater volume of contaminants. + ```{r P1response} @@ -102,21 +100,10 @@ Create a boxplot (with the `geom_boxplot()` function) using the `nitrate` data, # Part 2 ### 2.1 +Let's look at the plot of traffic density and diesel particulate matter again, -Use the provided code to compute a data frame `nitrate_agg_2` with aggregate summary of WA Nitrate data: population exposed to less than 10 ug/L of nitrate in the water (sum of `pop_0-3ug/L`, `pop_>3-5ug/L`, and `pop_>5-10ug/L`) -- separately for each year (`year`) and for each quarter (`quarter`. - -```{r} - -nitrate_agg_2 <- nitrate %>% - group_by(year, quarter) %>% - summarise(pop_less_than_10ug_perL = sum(`pop_0-3ug/L`, `pop_>3-5ug/L`, `pop_>5-10ug/L`)) - -nitrate_agg_2 -``` - -### 2.2 +Use `ggplot2` package make plot of how diesel particulate concentration (`DieselPM`; y-axis) is associated with traffic density values (`Traffic`; x-axis), where each county (`CaliforniaCounty`) has a different color (hint: use `color = type` in mapping). -Use `ggplot2` package to make a plot showing trajectories of total population exposed to less than 10 ug/L of nitrate (`pop_less_than_10ug_perL`; y-axis) over year (`year`; x-axis), where each quarter type has a different color (hint: use `color = type` in mapping). ``` # General format @@ -129,25 +116,26 @@ ggplot(???, aes( geom_point() ``` -```{r 2.2response} +```{r 2.1response} ``` -### 2.3 +### 2.2 + +Redo the above plot by adding a faceting (`+ facet_wrap( ~ CaliforniaCounty, ncol = 3)`) to have data for quarter in a separate plot panel. -Redo the above plot by adding a faceting (`+ facet_wrap( ~ quarter, ncol = 2)`) to have data for quarter in a separate plot panel. Assign the new plot as an object called `facet_plot`. -```{r 2.3response} +```{r 2.2response} ``` -### 2.4 +### 2.3 Observe what happens when you remove either `geom_line()` OR `geom_point()` from one of your plots above. -```{r 2.4response} +```{r 2.3response} ``` @@ -156,7 +144,8 @@ Observe what happens when you remove either `geom_line()` OR `geom_point()` from ### P.2 -Modify `facet_plot` to remove the legend (hint use `theme()` and the `legend.position` argument) and change the names of the axis titles to be "Population exposed to less than 10 ug/L of nitrate in water" for the y axis and "Year" for the x axis. +Modify `facet_plot` to remove the legend (hint use `theme()` and the `legend.position` argument) and change the names of the axis titles to be "Diesel particulate matter" for the y axis and "Traffic density" for the x axis. + ```{r P.2response} diff --git a/modules/Data_Visualization/lab/Data_Visualization_Lab.html b/modules/Data_Visualization/lab/Data_Visualization_Lab.html index 110516d4..7b9b76af 100644 --- a/modules/Data_Visualization/lab/Data_Visualization_Lab.html +++ b/modules/Data_Visualization/lab/Data_Visualization_Lab.html @@ -170,34 +170,23 @@

    Part 1

    Load the libraries

    library(readr)
     library(ggplot2)
    -library(dplyr)
    -library(dasehr)
    -

    Open the Nitrate exposure via WA public waterways data from the dasehr package.

    -

    (You can also access it at the link www.daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv)

    -

    Then, use the provided code to compute a data frame nitrate with aggregate summary of exposure level: average exposed population (pop_exposed_to_exceedances) for each year (year).

    -
    nitrate_agg <- nitrate %>%
    -  group_by(year) %>%
    -  summarise(exposed_pop_avg = mean(pop_exposed_to_exceedances))
    -
    -nitrate_agg
    -
    ## # A tibble: 22 Γ— 2
    -##     year exposed_pop_avg
    -##    <dbl>           <dbl>
    -##  1  1999            15  
    -##  2  2000             0  
    -##  3  2001            76.8
    -##  4  2002          3976. 
    -##  5  2003          3636. 
    -##  6  2004          3138. 
    -##  7  2005          2968. 
    -##  8  2006          2618. 
    -##  9  2007          1916. 
    -## 10  2008          1728. 
    -## # β„Ή 12 more rows
    +library(dplyr) +

    Load the CalEnviroScreen data from the link www.daseh.org/data/CalEnviroScreen_data.csv) and subset it so that you only have data from Fresno, Merced, Placer, Sonoma, and Yolo counties.

    +
    ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
    +
    ## Rows: 8035 Columns: 67
    +## ── Column specification ────────────────────────────────────────────────────────
    +## Delimiter: ","
    +## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
    +## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
    +## 
    +## β„Ή Use `spec()` to retrieve the full column specification for this data.
    +## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
    +
    ces_sub <- ces %>% filter(CaliforniaCounty == c("Fresno", "Merced", "Placer", "Sonoma", "Yolo"))

    1.1

    -

    Use ggplot2 package make plot of average exposed population (exposed_pop_avg; y-axis) for each year (year; x-axis). You can use lines layer (+ geom_line()) or points layer (+ geom_point()), or both!

    +

    Use the ggplot2 package to make a plot of how diesel particulate concentration (DieselPM; y-axis) is associated with traffic density values (Traffic; x-axis). You can use lines layer (+ geom_line()) or points layer (+ geom_point()), or both!

    Assign the plot to variable my_plot. Type my_plot in the console to have it displayed.

    +

    DieselPM: Diesel PM emissions from on-road and non-road sources Traffic: Traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary

    # General format
     ggplot(???, aes(x = ???, y = ???)) +
       ??? +
    @@ -209,7 +198,7 @@ 

    1.2

    1.3

    -

    Use the scale_x_continuous() function to plot the x axis with the following breaks c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019).

    +

    Use the scale_x_continuous() function to plot the x axis with the following breaks c(250, 750, 1250, 1750, 2250).

    # General format
     my_plot <- my_plot +
       scale_x_continuous(breaks = ???)
    @@ -225,39 +214,16 @@

    1.4

    Practice on Your Own!

    P.1

    -

    Create a boxplot (with the geom_boxplot() function) using the nitrate data, where quarter is plotted on the x axis and pop_on_sampled_PWS is plotted on the y axis.

    +

    Create a boxplot (with the geom_boxplot() function) using the ces_sub data, where CaliforniaCounty is plotted on the x axis and DrinkingWater is plotted on the y axis.

    +

    DrinkingWater: Drinking water contaminant index for selected contaminants. A higher value means drinking water contains a greater volume of contaminants.

    Part 2

    2.1

    -

    Use the provided code to compute a data frame nitrate_agg_2 with aggregate summary of WA Nitrate data: population exposed to less than 10 ug/L of nitrate in the water (sum of pop_0-3ug/L, pop_>3-5ug/L, and pop_>5-10ug/L) – separately for each year (year) and for each quarter (quarter.

    -
    nitrate_agg_2 <- nitrate %>%
    -  group_by(year, quarter) %>%
    -  summarise(pop_less_than_10ug_perL = sum(`pop_0-3ug/L`, `pop_>3-5ug/L`, `pop_>5-10ug/L`))
    -
    ## `summarise()` has grouped output by 'year'. You can override using the
    -## `.groups` argument.
    -
    nitrate_agg_2
    -
    ## # A tibble: 88 Γ— 3
    -## # Groups:   year [22]
    -##     year quarter pop_less_than_10ug_perL
    -##    <dbl> <chr>                     <dbl>
    -##  1  1999 Q1                        67807
    -##  2  1999 Q2                        55688
    -##  3  1999 Q3                       550650
    -##  4  1999 Q4                        26389
    -##  5  2000 Q1                         5996
    -##  6  2000 Q2                       157428
    -##  7  2000 Q3                        20752
    -##  8  2000 Q4                       360235
    -##  9  2001 Q1                        49702
    -## 10  2001 Q2                        46259
    -## # β„Ή 78 more rows
    -
    -
    -

    2.2

    -

    Use ggplot2 package to make a plot showing trajectories of total population exposed to less than 10 ug/L of nitrate (pop_less_than_10ug_perL; y-axis) over year (year; x-axis), where each quarter type has a different color (hint: use color = type in mapping).

    +

    Let’s look at the plot of traffic density and diesel particulate matter again,

    +

    Use ggplot2 package make plot of how diesel particulate concentration (DieselPM; y-axis) is associated with traffic density values (Traffic; x-axis), where each county (CaliforniaCounty) has a different color (hint: use color = type in mapping).

    # General format
     ggplot(???, aes(
       x = ???,
    @@ -267,13 +233,13 @@ 

    2.2

    geom_line() + geom_point()
    -
    -

    2.3

    -

    Redo the above plot by adding a faceting (+ facet_wrap( ~ quarter, ncol = 2)) to have data for quarter in a separate plot panel.

    +
    +

    2.2

    +

    Redo the above plot by adding a faceting (+ facet_wrap( ~ CaliforniaCounty, ncol = 3)) to have data for quarter in a separate plot panel.

    Assign the new plot as an object called facet_plot.

    -
    -

    2.4

    +
    +

    2.3

    Observe what happens when you remove either geom_line() OR geom_point() from one of your plots above.

    @@ -281,7 +247,7 @@

    2.4

    Practice on Your Own!

    P.2

    -

    Modify facet_plot to remove the legend (hint use theme() and the legend.position argument) and change the names of the axis titles to be β€œPopulation exposed to less than 10 ug/L of nitrate in water” for the y axis and β€œYear” for the x axis.

    +

    Modify facet_plot to remove the legend (hint use theme() and the legend.position argument) and change the names of the axis titles to be β€œDiesel particulate matter” for the y axis and β€œTraffic density” for the x axis.

    P.3

    diff --git a/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.Rmd b/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.Rmd index f0df3550..d0cbd429 100644 --- a/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.Rmd +++ b/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.Rmd @@ -17,30 +17,24 @@ Load the libraries library(readr) library(ggplot2) library(dplyr) -library(dasehr) ``` -Open the Nitrate exposure via WA public waterways data from the `dasehr` package. - -(You can also access it at the link www.daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv) - -Then, use the provided code to compute a data frame `nitrate` with aggregate summary of exposure level: average exposed population (`pop_exposed_to_exceedances`) for each year (`year`). +Load the CalEnviroScreen data from the link www.daseh.org/data/CalEnviroScreen_data.csv) and subset it so that you only have data from Fresno, Merced, Placer, Sonoma, and Yolo counties. ```{r} - -nitrate_agg <- nitrate %>% - group_by(year) %>% - summarise(exposed_pop_avg = mean(pop_exposed_to_exceedances)) - -nitrate_agg +ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") +ces_sub <- ces %>% filter(CaliforniaCounty == c("Fresno", "Merced", "Placer", "Sonoma", "Yolo")) ``` ### 1.1 -Use `ggplot2` package make plot of average exposed population (`exposed_pop_avg`; y-axis) for each year (`year`; x-axis). You can use lines layer (`+ geom_line()`) or points layer (`+ geom_point()`), or both! +Use the `ggplot2` package to make a plot of how diesel particulate concentration (`DieselPM`; y-axis) is associated with traffic density values (`Traffic`; x-axis). You can use lines layer (`+ geom_line()`) or points layer (`+ geom_point()`), or both! Assign the plot to variable `my_plot`. Type `my_plot` in the console to have it displayed. +`DieselPM`: Diesel PM emissions from on-road and non-road sources +`Traffic`: Traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary + ``` # General format ggplot(???, aes(x = ???, y = ???)) + @@ -51,7 +45,7 @@ ggplot(???, aes(x = ???, y = ???)) + ```{r 1.1response} my_plot <- - ggplot(nitrate_agg, aes(x = year, y = exposed_pop_avg)) + + ggplot(ces_sub, aes(x = Traffic, y = DieselPM)) + geom_line() + geom_point() @@ -65,9 +59,9 @@ my_plot ```{r 1.2response} my_plot <- my_plot + labs( - x = "Year", - y = "Average population exposed", - title = "Average population exposed to excess nitrate in public water sources, 1999-2020" + x = "Traffic density index", + y = "Diesel particulate matter", + title = "Relationship between traffic density and diesel particulate matter" ) my_plot @@ -75,7 +69,8 @@ my_plot ### 1.3 -Use the `scale_x_continuous()` function to plot the x axis with the following breaks `c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019)`. +Use the `scale_x_continuous()` function to plot the x axis with the following breaks `c(250, 750, 1250, 1750, 2250)`. + ``` # General format @@ -86,7 +81,7 @@ my_plot <- my_plot + ```{r 1.3response} my_plot <- my_plot + scale_x_continuous( - breaks = c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019) + breaks = c(250, 750, 1250, 1750, 2250) ) my_plot @@ -114,11 +109,14 @@ my_plot + theme_void() ### P.1 -Create a boxplot (with the `geom_boxplot()` function) using the `nitrate` data, where `quarter` is plotted on the x axis and `pop_on_sampled_PWS` is plotted on the y axis. +Create a boxplot (with the `geom_boxplot()` function) using the `ces_sub` data, where `CaliforniaCounty` is plotted on the x axis and `DrinkingWater` is plotted on the y axis. + +`DrinkingWater`: Drinking water contaminant index for selected contaminants. A higher value means drinking water contains a greater volume of contaminants. + ```{r P1response} -nitrate %>% - ggplot(aes(x = quarter, y = pop_on_sampled_PWS)) + +ces_sub %>% + ggplot(aes(x = CaliforniaCounty, y = DrinkingWater)) + geom_boxplot() ``` @@ -126,21 +124,10 @@ nitrate %>% # Part 2 ### 2.1 +Let's look at the plot of traffic density and diesel particulate matter again, -Use the provided code to compute a data frame `nitrate_agg_2` with aggregate summary of WA Nitrate data: population exposed to less than 10 ug/L of nitrate in the water (sum of `pop_0-3ug/L`, `pop_>3-5ug/L`, and `pop_>5-10ug/L`) -- separately for each year (`year`) and for each quarter (`quarter`. - -```{r} - -nitrate_agg_2 <- nitrate %>% - group_by(year, quarter) %>% - summarise(pop_less_than_10ug_perL = sum(`pop_0-3ug/L`, `pop_>3-5ug/L`, `pop_>5-10ug/L`)) - -nitrate_agg_2 -``` - -### 2.2 +Use `ggplot2` package make plot of how diesel particulate concentration (`DieselPM`; y-axis) is associated with traffic density values (`Traffic`; x-axis), where each county (`CaliforniaCounty`) has a different color (hint: use `color = type` in mapping). -Use `ggplot2` package to make a plot showing trajectories of total population exposed to less than 10 ug/L of nitrate (`pop_less_than_10ug_perL`; y-axis) over year (`year`; x-axis), where each quarter type has a different color (hint: use `color = type` in mapping). ``` # General format @@ -153,41 +140,42 @@ ggplot(???, aes( geom_point() ``` -```{r 2.2response} -ggplot(nitrate_agg_2, aes( - x = year, - y = pop_less_than_10ug_perL, - color = quarter +```{r 2.1response} +ggplot(ces_sub, aes( + x = Traffic, + y = DieselPM, + color = CaliforniaCounty )) + geom_line() + geom_point() ``` -### 2.3 +### 2.2 + +Redo the above plot by adding a faceting (`+ facet_wrap( ~ CaliforniaCounty, ncol = 3)`) to have data for quarter in a separate plot panel. -Redo the above plot by adding a faceting (`+ facet_wrap( ~ quarter, ncol = 2)`) to have data for quarter in a separate plot panel. Assign the new plot as an object called `facet_plot`. -```{r 2.3response} +```{r 2.2response} -facet_plot <- ggplot(nitrate_agg_2, aes( - x = year, - y = pop_less_than_10ug_perL, - color = quarter +facet_plot <- ggplot(ces_sub, aes( + x = Traffic, + y = DieselPM, + color = CaliforniaCounty )) + geom_line() + geom_point() + - facet_wrap(~quarter, ncol = 2) + facet_wrap(~CaliforniaCounty, ncol = 3) facet_plot ``` -### 2.4 +### 2.3 Observe what happens when you remove either `geom_line()` OR `geom_point()` from one of your plots above. -```{r 2.4response} +```{r 2.3response} # These elements are removed from the plot, like layers ``` @@ -196,14 +184,15 @@ Observe what happens when you remove either `geom_line()` OR `geom_point()` from ### P.2 -Modify `facet_plot` to remove the legend (hint use `theme()` and the `legend.position` argument) and change the names of the axis titles to be "Population exposed to less than 10 ug/L of nitrate in water" for the y axis and "Year" for the x axis. +Modify `facet_plot` to remove the legend (hint use `theme()` and the `legend.position` argument) and change the names of the axis titles to be "Diesel particulate matter" for the y axis and "Traffic density" for the x axis. + ```{r P.2response} facet_plot <- facet_plot + theme(legend.position = "none") + labs( - y = "Population exposed to less than 10 ug/L of nitrate in water", - x = "Year" + y = "Diesel particulate matter", + x = "Traffic density" ) facet_plot diff --git a/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.html b/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.html index b355285f..967b9e5a 100644 --- a/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.html +++ b/modules/Data_Visualization/lab/Data_Visualization_Lab_Key.html @@ -170,72 +170,61 @@

    Part 1

    Load the libraries

    library(readr)
     library(ggplot2)
    -library(dplyr)
    -library(dasehr)
    -

    Open the Nitrate exposure via WA public waterways data from the dasehr package.

    -

    (You can also access it at the link www.daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv)

    -

    Then, use the provided code to compute a data frame nitrate with aggregate summary of exposure level: average exposed population (pop_exposed_to_exceedances) for each year (year).

    -
    nitrate_agg <- nitrate %>%
    -  group_by(year) %>%
    -  summarise(exposed_pop_avg = mean(pop_exposed_to_exceedances))
    -
    -nitrate_agg
    -
    ## # A tibble: 22 Γ— 2
    -##     year exposed_pop_avg
    -##    <dbl>           <dbl>
    -##  1  1999            15  
    -##  2  2000             0  
    -##  3  2001            76.8
    -##  4  2002          3976. 
    -##  5  2003          3636. 
    -##  6  2004          3138. 
    -##  7  2005          2968. 
    -##  8  2006          2618. 
    -##  9  2007          1916. 
    -## 10  2008          1728. 
    -## # β„Ή 12 more rows
    +library(dplyr)
    +

    Load the CalEnviroScreen data from the link www.daseh.org/data/CalEnviroScreen_data.csv) and subset it so that you only have data from Fresno, Merced, Placer, Sonoma, and Yolo counties.

    +
    ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
    +
    ## Rows: 8035 Columns: 67
    +## ── Column specification ────────────────────────────────────────────────────────
    +## Delimiter: ","
    +## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
    +## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
    +## 
    +## β„Ή Use `spec()` to retrieve the full column specification for this data.
    +## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
    +
    ces_sub <- ces %>% filter(CaliforniaCounty == c("Fresno", "Merced", "Placer", "Sonoma", "Yolo"))

    1.1

    -

    Use ggplot2 package make plot of average exposed population (exposed_pop_avg; y-axis) for each year (year; x-axis). You can use lines layer (+ geom_line()) or points layer (+ geom_point()), or both!

    +

    Use the ggplot2 package to make a plot of how diesel particulate concentration (DieselPM; y-axis) is associated with traffic density values (Traffic; x-axis). You can use lines layer (+ geom_line()) or points layer (+ geom_point()), or both!

    Assign the plot to variable my_plot. Type my_plot in the console to have it displayed.

    +

    DieselPM: Diesel PM emissions from on-road and non-road sources Traffic: Traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary

    # General format
     ggplot(???, aes(x = ???, y = ???)) +
       ??? +
       ???
    my_plot <-
    -  ggplot(nitrate_agg, aes(x = year, y = exposed_pop_avg)) +
    +  ggplot(ces_sub, aes(x = Traffic, y = DieselPM)) +
       geom_line() +
       geom_point()
     
     my_plot
    -

    +

    1.2

    β€œUpdate” your plot by adding a title and changing the x and y axis titles.

    my_plot <- my_plot +
       labs(
    -    x = "Year",
    -    y = "Average population exposed",
    -    title = "Average population exposed to excess nitrate in public water sources, 1999-2020"
    +    x = "Traffic density index",
    +    y = "Diesel particulate matter",
    +    title = "Relationship between traffic density and diesel particulate matter"
       )
     
     my_plot
    -

    +

    1.3

    -

    Use the scale_x_continuous() function to plot the x axis with the following breaks c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019).

    +

    Use the scale_x_continuous() function to plot the x axis with the following breaks c(250, 750, 1250, 1750, 2250).

    # General format
     my_plot <- my_plot +
       scale_x_continuous(breaks = ???)
    my_plot <- my_plot +
       scale_x_continuous(
    -    breaks = c(1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019)
    +    breaks = c(250, 750, 1250, 1750, 2250)
       )
     
     my_plot
    -

    +

    1.4

    @@ -243,58 +232,37 @@

    1.4

    # General format
     my_plot + theme_bw()
    my_plot + theme_bw()
    -

    +

    my_plot + theme_classic()
    -

    +

    my_plot + theme_dark()
    -

    +

    my_plot + theme_gray()
    -

    +

    my_plot + theme_void()
    -

    +

    Practice on Your Own!

    P.1

    -

    Create a boxplot (with the geom_boxplot() function) using the nitrate data, where quarter is plotted on the x axis and pop_on_sampled_PWS is plotted on the y axis.

    -
    nitrate %>%
    -  ggplot(aes(x = quarter, y = pop_on_sampled_PWS)) +
    +

    Create a boxplot (with the geom_boxplot() function) using the ces_sub data, where CaliforniaCounty is plotted on the x axis and DrinkingWater is plotted on the y axis.

    +

    DrinkingWater: Drinking water contaminant index for selected contaminants. A higher value means drinking water contains a greater volume of contaminants.

    +
    ces_sub %>%
    +  ggplot(aes(x = CaliforniaCounty, y = DrinkingWater)) +
       geom_boxplot()
    -

    +
    ## Warning: Removed 1 row containing non-finite outside the scale range
    +## (`stat_boxplot()`).
    +

    Part 2

    2.1

    -

    Use the provided code to compute a data frame nitrate_agg_2 with aggregate summary of WA Nitrate data: population exposed to less than 10 ug/L of nitrate in the water (sum of pop_0-3ug/L, pop_>3-5ug/L, and pop_>5-10ug/L) – separately for each year (year) and for each quarter (quarter.

    -
    nitrate_agg_2 <- nitrate %>%
    -  group_by(year, quarter) %>%
    -  summarise(pop_less_than_10ug_perL = sum(`pop_0-3ug/L`, `pop_>3-5ug/L`, `pop_>5-10ug/L`))
    -
    ## `summarise()` has grouped output by 'year'. You can override using the
    -## `.groups` argument.
    -
    nitrate_agg_2
    -
    ## # A tibble: 88 Γ— 3
    -## # Groups:   year [22]
    -##     year quarter pop_less_than_10ug_perL
    -##    <dbl> <chr>                     <dbl>
    -##  1  1999 Q1                        67807
    -##  2  1999 Q2                        55688
    -##  3  1999 Q3                       550650
    -##  4  1999 Q4                        26389
    -##  5  2000 Q1                         5996
    -##  6  2000 Q2                       157428
    -##  7  2000 Q3                        20752
    -##  8  2000 Q4                       360235
    -##  9  2001 Q1                        49702
    -## 10  2001 Q2                        46259
    -## # β„Ή 78 more rows
    -
    -
    -

    2.2

    -

    Use ggplot2 package to make a plot showing trajectories of total population exposed to less than 10 ug/L of nitrate (pop_less_than_10ug_perL; y-axis) over year (year; x-axis), where each quarter type has a different color (hint: use color = type in mapping).

    +

    Let’s look at the plot of traffic density and diesel particulate matter again,

    +

    Use ggplot2 package make plot of how diesel particulate concentration (DieselPM; y-axis) is associated with traffic density values (Traffic; x-axis), where each county (CaliforniaCounty) has a different color (hint: use color = type in mapping).

    # General format
     ggplot(???, aes(
       x = ???,
    @@ -303,33 +271,33 @@ 

    2.2

    )) + geom_line() + geom_point()
    -
    ggplot(nitrate_agg_2, aes(
    -  x = year,
    -  y = pop_less_than_10ug_perL,
    -  color = quarter
    +
    ggplot(ces_sub, aes(
    +  x = Traffic,
    +  y = DieselPM,
    +  color = CaliforniaCounty
     )) +
       geom_line() +
       geom_point()
    -

    +

    -
    -

    2.3

    -

    Redo the above plot by adding a faceting (+ facet_wrap( ~ quarter, ncol = 2)) to have data for quarter in a separate plot panel.

    +
    +

    2.2

    +

    Redo the above plot by adding a faceting (+ facet_wrap( ~ CaliforniaCounty, ncol = 3)) to have data for quarter in a separate plot panel.

    Assign the new plot as an object called facet_plot.

    -
    facet_plot <- ggplot(nitrate_agg_2, aes(
    -  x = year,
    -  y = pop_less_than_10ug_perL,
    -  color = quarter
    +
    facet_plot <- ggplot(ces_sub, aes(
    +  x = Traffic,
    +  y = DieselPM,
    +  color = CaliforniaCounty
     )) +
       geom_line() +
       geom_point() +
    -  facet_wrap(~quarter, ncol = 2)
    +  facet_wrap(~CaliforniaCounty, ncol = 3)
     
     facet_plot
    -

    +

    -
    -

    2.4

    +
    +

    2.3

    Observe what happens when you remove either geom_line() OR geom_point() from one of your plots above.

    # These elements are removed from the plot, like layers
    @@ -338,16 +306,16 @@

    2.4

    Practice on Your Own!

    P.2

    -

    Modify facet_plot to remove the legend (hint use theme() and the legend.position argument) and change the names of the axis titles to be β€œPopulation exposed to less than 10 ug/L of nitrate in water” for the y axis and β€œYear” for the x axis.

    +

    Modify facet_plot to remove the legend (hint use theme() and the legend.position argument) and change the names of the axis titles to be β€œDiesel particulate matter” for the y axis and β€œTraffic density” for the x axis.

    facet_plot <- facet_plot +
       theme(legend.position = "none") +
       labs(
    -    y = "Population exposed to less than 10 ug/L of nitrate in water",
    -    x = "Year"
    +    y = "Diesel particulate matter",
    +    x = "Traffic density"
       )
     
     facet_plot
    -

    +

    P.3

    @@ -356,7 +324,7 @@

    P.3

    library(ThemePark) facet_plot + theme_grand_budapest()
    -

    +

    diff --git a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.Rmd b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.Rmd index ba9cddb7..cb1bc273 100644 --- a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.Rmd +++ b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.Rmd @@ -10,6 +10,7 @@ output: ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(here) +library(tidyverse) ``` ## Esquisse Package @@ -30,12 +31,23 @@ It's super **nifty**! knitr::include_graphics("https://c.tenor.com/DNUSO9MjrTEAAAAC/bob-ross.gif") ``` +## First, get some data.. + +We can use the CO heat-related ER visits dataset. This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age. + +```{r message=FALSE} +er <- + read_csv("https://daseh.org/data/CO_ER_heat_visits.csv") + +head(er) +``` + ## Starting a plot Using the `esquisser()` function you can start creating a plot for a `data.frame` or `tibble`. That's it! ```{r, eval = FALSE} -esquisser(mtcars) +esquisser(er) ``` ```{r, fig.alt="starting a plot", out.width = "90%", echo = FALSE, fig.align='center'} @@ -45,14 +57,14 @@ knitr::include_graphics("images/start_a_plot.png") ## Show the plot in the browser ```{r, eval = FALSE} -esquisse::esquisser(iris, viewer = "browser") +esquisse::esquisser(er, viewer = "browser") ``` ## Select Variables To select variables you can drag and drop variables to the respective axis that you would like the variable to be plotted on. -```{r, fig.alt="select variables", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="select variables", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/variables.gif") ``` @@ -60,7 +72,7 @@ knitr::include_graphics("images/variables.gif") To select variables you can drag and drop variables to the respective axis that you would like the variable to be plotted on. -```{r, fig.alt="select variables", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="select variables", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/get_code.gif") ``` @@ -68,7 +80,7 @@ knitr::include_graphics("images/get_code.gif") `esquisse` automatically assumes a plot type, but you might want to change this. -```{r, fig.alt="change plot type", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="change plot type", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/change_type_short.gif") ``` @@ -76,7 +88,7 @@ knitr::include_graphics("images/change_type_short.gif") Facets create multiple plots based on the different values of a variable. -```{r, fig.alt="add facets", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="add facets", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/facet.gif") ``` @@ -84,7 +96,7 @@ knitr::include_graphics("images/facet.gif") Sometimes it is useful to change the way points are plotted so that size represents a variable. This can especially be helpful if you need your plot to be black and white. -```{r, fig.alt="add color", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="add color", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/size.gif") ``` @@ -93,32 +105,24 @@ knitr::include_graphics("images/size.gif") For plots with points use the color region to change coloring according to a variable. (use "fill" for bar plots) -```{r, fig.alt="add color", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="add color", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/color.gif") ``` ## Appearance -You can change the overall appearance with the appearance tab. +You can change the overall appearance with "Geometries" and "Theme". -```{r, fig.alt="change overall appearance", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="change overall appearance", out.width = "70%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/appearance.gif") ``` -## Smooth Lines - -Especially when you have a scatter plot, it can be helpful to add a smooth/trend line. - -```{r, fig.alt="add smooth line", out.width = "100%", echo = FALSE, fig.align='center'} -knitr::include_graphics("images/smooth.gif") -``` - ## Change titles -To change titles on your plot, use the titles tab. +To change titles on your plot, use the "Labels & Titles" tab. -```{r, fig.alt="change titles", out.width = "100%", echo = FALSE, fig.align='center'} -knitr::include_graphics("images/titles.gif") +```{r, fig.alt="change titles", out.width = "70%", echo = FALSE, fig.align='center'} +knitr::include_graphics("images/title.gif") ``` ## View data @@ -137,19 +141,21 @@ Use the stop button or press ctrl+c to stop the Esquisse app. _If you don't see the stop button, you need to resize your window._ -```{r, fig.alt="Click the stop button to interrupt the Esquisse app.", out.width = "100%", echo = FALSE, fig.align='center'} +```{r, fig.alt="Click the stop button to interrupt the Esquisse app.", out.width = "50%", echo = FALSE, fig.align='center'} knitr::include_graphics("images/stop.png") ``` -## Wide & Long Data Example +## Wide & Long Data ? {.codesmall} -Let's look at the CO heat-related ER visits dataset again. This time we want to look at only Boulder and Denver counties, and only the visit and year data. +Let's look at why we might want long data using Esquisse. ```{r message=FALSE} -library(dplyr) +library(tidyverse) er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits.csv") -long_er <- er %>% filter(county == c("Denver", "Boulder")) %>% select(c("county", "year", "visits")) +long_er <- er %>% + filter(county == c("Denver", "Boulder")) %>% + select(c("county", "year", "visits")) glimpse(long_er) ``` @@ -177,6 +183,16 @@ esquisser(wide_er) # county as x...? Tricky! esquisser(long_er) #county as x, visit rate as y, year as fill ``` +## GUT CHECK! + +Why use Esquisse? + +A. Explore your data + +B. Get a "head start" on your code + +C. Both of these! + ## Some Alternatives to `esquisse` * `ggquickeda`: https://smouksassi.github.io/ggquickeda/ @@ -196,6 +212,8 @@ esquisser(long_er) #county as x, visit rate as y, year as fill πŸ’» [Lab](https://daseh.org/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.Rmd) +πŸ“ƒ [Day 6 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-6.pdf) + ```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'} knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg")) ``` diff --git a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.html b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.html index 7ee6eee8..b9ce2052 100644 --- a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.html +++ b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.html @@ -32,3217 +32,129 @@ }; - - - - - - - - - - - - - - + code span.al { color: #ff0000; font-weight: bold; } /* Alert */ + code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ + code span.at { color: #7d9029; } /* Attribute */ + code span.bn { color: #40a070; } /* BaseN */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ + code span.ch { color: #4070a0; } /* Char */ + code span.cn { color: #880000; } /* Constant */ + code span.co { color: #60a0b0; font-style: italic; } /* Comment */ + code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ + code span.do { color: #ba2121; font-style: italic; } /* Documentation */ + code span.dt { color: #902000; } /* DataType */ + code span.dv { color: #40a070; } /* DecVal */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.fl { color: #40a070; } /* Float */ + code span.fu { color: #06287e; } /* Function */ + code span.im { } /* Import */ + code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ + code span.kw { color: #007020; font-weight: bold; } /* Keyword */ + code span.op { color: #666666; } /* Operator */ + code span.ot { color: #007020; } /* Other */ + code span.pp { color: #bc7a00; } /* Preprocessor */ + code span.sc { color: #4070a0; } /* SpecialChar */ + code span.ss { color: #bb6688; } /* SpecialString */ + code span.st { color: #4070a0; } /* String */ + code span.va { color: #19177c; } /* Variable */ + code span.vs { color: #4070a0; } /* VerbatimString */ + code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ + + + + @@ -3272,77 +184,90 @@

    It’s super nifty! starting a plot

    +

    First, get some data..

    + +

    We can use the CO heat-related ER visits dataset. This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.

    + +
    er <-
    +  read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
    +
    +head(er)
    + +
    ## # A tibble: 6 Γ— 6
    +##   county  rate lower95cl upper95cl visits  year
    +##   <chr>  <dbl>     <dbl>     <dbl>  <dbl> <dbl>
    +## 1 Adams   6.73     NA         9.24     29  2011
    +## 2 Adams   4.84      2.85     NA        23  2012
    +## 3 Adams   6.84      4.36      9.31     31  2013
    +## 4 Adams   3.08      1.71      4.85     15  2014
    +## 5 Adams   3.36      1.89      5.23     16  2015
    +## 6 Adams   8.85      6.12     11.6      42  2016
    +

    Starting a plot

    Using the esquisser() function you can start creating a plot for a data.frame or tibble. That’s it!

    -
    esquisser(mtcars)
    +
    esquisser(er)
    -

    starting a plot

    +

    starting a plot

    Show the plot in the browser

    -
    esquisse::esquisser(iris, viewer = "browser")
    +
    esquisse::esquisser(er, viewer = "browser")

    Select Variables

    To select variables you can drag and drop variables to the respective axis that you would like the variable to be plotted on.

    -

    select variables

    +

    select variables

    Find code

    To select variables you can drag and drop variables to the respective axis that you would like the variable to be plotted on.

    -

    select variables

    +

    select variables

    Change plot type

    esquisse automatically assumes a plot type, but you might want to change this.

    -

    change plot type

    +

    change plot type

    Add Facets

    Facets create multiple plots based on the different values of a variable.

    -

    add facets

    +

    add facets

    Add size

    Sometimes it is useful to change the way points are plotted so that size represents a variable. This can especially be helpful if you need your plot to be black and white.

    -

    add color

    +

    add color

    Add color

    For plots with points use the color region to change coloring according to a variable. (use “fill” for bar plots)

    -

    add color

    +

    add color

    Appearance

    -

    You can change the overall appearance with the appearance tab.

    - -

    change overall appearance

    - -

    Smooth Lines

    +

    You can change the overall appearance with “Geometries” and “Theme”.

    -

    Especially when you have a scatter plot, it can be helpful to add a smooth/trend line.

    - -

    add smooth line

    +

    change overall appearance

    Change titles

    -

    To change titles on your plot, use the titles tab.

    +

    To change titles on your plot, use the “Labels & Titles” tab.

    -

    change titles

    +

    change titles

    View data

    You can also easily view data

    -

    Click on the table button to view a table of your data.

    +

    Click on the table button to view a table of your data.

    Interrupting Esquisse

    @@ -3352,58 +277,79 @@

    If you don’t see the stop button, you need to resize your window.

    -

    Click the stop button to interrupt the Esquisse app.

    +

    Click the stop button to interrupt the Esquisse app.

    + +

    Wide & Long Data ?

    -

    Wide & Long Data Example

    +

    Let’s look at why we might want long data using Esquisse.

    -

    Let’s examine a subset of the dataset for heat-related ER visits in Colorado, showing only data for Boulder and Denver counties.

    +
    library(tidyverse)
    +er <- read_csv(file =
    +    "https://daseh.org/data/CO_ER_heat_visits.csv")
    +long_er <- er %>% 
    +  filter(county == c("Denver", "Boulder")) %>% 
    +  select(c("county", "year", "visits"))
    +glimpse(long_er)
    -
    library(dasehr)
    -library(dplyr)
    +
    ## Rows: 12
    +## Columns: 3
    +## $ county <chr> "Boulder", "Boulder", "Boulder", "Boulder", "Boulder", "Boulder…
    +## $ year   <dbl> 2012, 2014, 2016, 2018, 2020, 2022, 2011, 2013, 2015, 2017, 201…
    +## $ visits <dbl> 13, 19, 18, 18, 12, 19, 42, 19, 25, 24, 34, 28
    -wide_heat <- CO_heat_ER_wide -glimpse(wide_heat)
    +

    Wide Data

    -
    ## Rows: 2
    -## Columns: 13
    -## $ county <chr> "Boulder", "Denver"
    -## $ `2011` <dbl> 4.034535, 7.114236
    -## $ `2012` <dbl> 4.079101, 6.793702
    -## $ `2013` <dbl> 3.792548, 2.945863
    -## $ `2014` <dbl> 6.290258, 3.556912
    -## $ `2015` <dbl> 4.755544, 3.843781
    -## $ `2016` <dbl> 5.676678, 6.182937
    -## $ `2017` <dbl> 3.509453, 3.315021
    -## $ `2018` <dbl> 5.07285, 5.80526
    -## $ `2019` <dbl> 3.706147, 4.537266
    -## $ `2020` <dbl> 3.641105, 4.422049
    -## $ `2021` <dbl> 5.512484, 3.847478
    -## $ `2022` <dbl> 5.484899, 6.475107
    +

    As a comparison, let’s also load a wide version of this dataset.

    -

    Long Data

    +
    wide_er <- read_csv(file =
    +    "https://daseh.org/data/CO_heat_er_visits_DenverBoulder_wide.csv")
    -
    library(tidyr)
    -long_heat <- wide_heat %>%
    -  pivot_longer(
    -    cols = starts_with("20"),
    -    names_to = "year",
    -    values_to = "visit_rate"
    -  )
    +
    ## Rows: 2 Columns: 13
    +## ── Column specification ────────────────────────────────────────────────────────
    +## Delimiter: ","
    +## chr  (1): county
    +## dbl (12): 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, ...
    +## 
    +## β„Ή Use `spec()` to retrieve the full column specification for this data.
    +## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
    -

    Long Data

    +

    Wide vs Long Data

    -
    glimpse(long_heat)
    +
    head(long_er)
    -
    ## Rows: 24
    -## Columns: 3
    -## $ county     <chr> "Boulder", "Boulder", "Boulder", "Boulder", "Boulder", "Bou…
    -## $ year       <chr> "2011", "2012", "2013", "2014", "2015", "2016", "2017", "20…
    -## $ visit_rate <dbl> 4.034535, 4.079101, 3.792548, 6.290258, 4.755544, 5.676678,…
    +
    ## # A tibble: 6 Γ— 3
    +##   county   year visits
    +##   <chr>   <dbl>  <dbl>
    +## 1 Boulder  2012     13
    +## 2 Boulder  2014     19
    +## 3 Boulder  2016     18
    +## 4 Boulder  2018     18
    +## 5 Boulder  2020     12
    +## 6 Boulder  2022     19
    + +
    head(wide_er)
    + +
    ## # A tibble: 2 Γ— 13
    +##   county  `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
    +##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
    +## 1 Boulder   4.03   4.08   3.79   6.29   4.76   5.68   3.51   5.07   3.71   3.64
    +## 2 Denver    7.11   6.79   2.95   3.56   3.84   6.18   3.32   5.81   4.54   4.42
    +## # β„Ή 2 more variables: `2021` <dbl>, `2022` <dbl>

    Make a plot of visit rates by year for different counties

    -
    esquisser(wide_heat) # county as x...? Tricky!
    -esquisser(long_heat) #county as x, visit rate as y, year as fill
    +
    esquisser(wide_er) # county as x...? Tricky!
    +esquisser(long_er) #county as x, visit rate as y, year as fill
    + +

    GUT CHECK!

    + +

    Why use Esquisse?

    + +

    A. Explore your data

    + +

    B. Get a “head start” on your code

    + +

    C. Both of these!

    Some Alternatives to esquisse

    diff --git a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.pdf b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.pdf index 49fea8cb..b3cf3835 100644 Binary files a/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.pdf and b/modules/Esquisse_Data_Visualization/Esquisse_Data_Visualization.pdf differ diff --git a/modules/Esquisse_Data_Visualization/images/appearance.gif b/modules/Esquisse_Data_Visualization/images/appearance.gif index 25de4642..e065a846 100644 Binary files a/modules/Esquisse_Data_Visualization/images/appearance.gif and b/modules/Esquisse_Data_Visualization/images/appearance.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/change_type_short.gif b/modules/Esquisse_Data_Visualization/images/change_type_short.gif index 4abf2846..d892418d 100644 Binary files a/modules/Esquisse_Data_Visualization/images/change_type_short.gif and b/modules/Esquisse_Data_Visualization/images/change_type_short.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/color.gif b/modules/Esquisse_Data_Visualization/images/color.gif index c78cc6e5..7067ba76 100644 Binary files a/modules/Esquisse_Data_Visualization/images/color.gif and b/modules/Esquisse_Data_Visualization/images/color.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/facet.gif b/modules/Esquisse_Data_Visualization/images/facet.gif index ea509c97..b1c3059c 100644 Binary files a/modules/Esquisse_Data_Visualization/images/facet.gif and b/modules/Esquisse_Data_Visualization/images/facet.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/get_code.gif b/modules/Esquisse_Data_Visualization/images/get_code.gif index 310012d3..02f722b0 100644 Binary files a/modules/Esquisse_Data_Visualization/images/get_code.gif and b/modules/Esquisse_Data_Visualization/images/get_code.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/size.gif b/modules/Esquisse_Data_Visualization/images/size.gif index cbb8b81c..5d55704e 100644 Binary files a/modules/Esquisse_Data_Visualization/images/size.gif and b/modules/Esquisse_Data_Visualization/images/size.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/start_a_plot.png b/modules/Esquisse_Data_Visualization/images/start_a_plot.png index 3f21062f..3c428260 100644 Binary files a/modules/Esquisse_Data_Visualization/images/start_a_plot.png and b/modules/Esquisse_Data_Visualization/images/start_a_plot.png differ diff --git a/modules/Esquisse_Data_Visualization/images/stop.png b/modules/Esquisse_Data_Visualization/images/stop.png index 9cdfed95..c0c64cfd 100644 Binary files a/modules/Esquisse_Data_Visualization/images/stop.png and b/modules/Esquisse_Data_Visualization/images/stop.png differ diff --git a/modules/Esquisse_Data_Visualization/images/title.gif b/modules/Esquisse_Data_Visualization/images/title.gif new file mode 100644 index 00000000..da1cf29e Binary files /dev/null and b/modules/Esquisse_Data_Visualization/images/title.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/titles.gif b/modules/Esquisse_Data_Visualization/images/titles.gif deleted file mode 100644 index ab660306..00000000 Binary files a/modules/Esquisse_Data_Visualization/images/titles.gif and /dev/null differ diff --git a/modules/Esquisse_Data_Visualization/images/variables.gif b/modules/Esquisse_Data_Visualization/images/variables.gif index 4180924c..63bdd297 100644 Binary files a/modules/Esquisse_Data_Visualization/images/variables.gif and b/modules/Esquisse_Data_Visualization/images/variables.gif differ diff --git a/modules/Esquisse_Data_Visualization/images/view_data.png b/modules/Esquisse_Data_Visualization/images/view_data.png index 462c551b..3627c784 100644 Binary files a/modules/Esquisse_Data_Visualization/images/view_data.png and b/modules/Esquisse_Data_Visualization/images/view_data.png differ diff --git a/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.Rmd b/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.Rmd index 8aaa0421..035c330d 100644 --- a/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.Rmd +++ b/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.Rmd @@ -13,16 +13,25 @@ install.packages("ggplot2") ```{r, comment = FALSE} library(esquisse) library(ggplot2) -library(dplyr) -library(dasehr) +library(tidyverse) ``` ### 1.1 -Try creating a plot in `esquisse` using the `calenviroscreen` data from the `dasehr` packaged. This dataset has a lot of variables, so first run the below code to subset it so that you're only working with these variables: `CES4.0Percentile`, `Asthma`, and `ChildrenPercLess10`. We will also categorize `CES4.0Percentile` into three categories (high, middle, and low) to make visualization a little easier! +Let's look at the relationship between exposure to pollution and visits to the ER for asthma issues. + +Try creating a plot in `esquisse` using the `calenviroscreen` data. This dataset has a lot of variables, so first run the below code to subset it so that you're only working with these variables: `CES4.0Percentile`, `Asthma`, and `ChildrenPercLess10`. We will also categorize `CES4.0Percentile` into three categories (high, middle, and low) to make visualization a little easier! + +`CES4.0Percentile`: a measure of how much pollution people in a census tract experience, relative to the other census tracts in California + +`Asthma`: Age-adjusted rate of emergency department visits for asthma + +`ChildrenPercLess10`: estimates of the percent per census tract of children under 10 years old ```{r} -ces_sub <- select(calenviroscreen, c("CES4.0Percentile", "Asthma", "ChildrenPercLess10")) +ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv") + +ces_sub <- select(ces, c("CES4.0Percentile", "Asthma", "ChildrenPercLess10")) ces_sub <- ces_sub %>% mutate(CES4.0Perc_cat = @@ -57,13 +66,12 @@ Click where it says "point" (may say "auto" depending on how you did the last qu Launch Esquisse on any selection of the following datasets we have worked with before and explore! ```{r} -covid_wastewater -CO_heat_ER -CO_heat_ER_byage -CO_heat_ER_bygender -yearly_co2_emissions -nitrate -haa5 +co2 <- read_csv("https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv") + +cc <- read_csv("https://daseh.org/data/Yearly_CC_Disasters.csv") + +nitrate <- read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv") + ``` ```{r P.1response} diff --git a/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.html b/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.html index 5a0e5bc9..16b12021 100644 --- a/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.html +++ b/modules/Esquisse_Data_Visualization/lab/Esquisse_Data_Visualization_Lab.html @@ -13,228 +13,34 @@ Esquisse Data Visualization Lab - - + + - - - - + + + + - - + h1.title {font-size: 38px;} + h2 {font-size: 30px;} + h3 {font-size: 24px;} + h4 {font-size: 18px;} + h5 {font-size: 16px;} + h6 {font-size: 12px;} + code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} + pre:not([class]) { background-color: white } + + + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + span.underline{text-decoration: underline;} + div.column{display: inline-block; vertical-align: top; width: 50%;} + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} + ul.task-list{list-style: none;} + - + + - - - - + + + + - - + h1.title {font-size: 38px;} + h2 {font-size: 30px;} + h3 {font-size: 24px;} + h4 {font-size: 18px;} + h5 {font-size: 16px;} + h6 {font-size: 12px;} + code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} + pre:not([class]) { background-color: white } + + + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + span.underline{text-decoration: underline;} + div.column{display: inline-block; vertical-align: top; width: 50%;} + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} + ul.task-list{list-style: none;} + - - - - - - - - - - - - - - + code span.al { color: #ff0000; font-weight: bold; } /* Alert */ + code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ + code span.at { color: #7d9029; } /* Attribute */ + code span.bn { color: #40a070; } /* BaseN */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ + code span.ch { color: #4070a0; } /* Char */ + code span.cn { color: #880000; } /* Constant */ + code span.co { color: #60a0b0; font-style: italic; } /* Comment */ + code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ + code span.do { color: #ba2121; font-style: italic; } /* Documentation */ + code span.dt { color: #902000; } /* DataType */ + code span.dv { color: #40a070; } /* DecVal */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.fl { color: #40a070; } /* Float */ + code span.fu { color: #06287e; } /* Function */ + code span.im { } /* Import */ + code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ + code span.kw { color: #007020; font-weight: bold; } /* Keyword */ + code span.op { color: #666666; } /* Operator */ + code span.ot { color: #007020; } /* Other */ + code span.pp { color: #bc7a00; } /* Preprocessor */ + code span.sc { color: #4070a0; } /* SpecialChar */ + code span.ss { color: #bb6688; } /* SpecialString */ + code span.st { color: #4070a0; } /* String */ + code span.va { color: #19177c; } /* Variable */ + code span.vs { color: #4070a0; } /* VerbatimString */ + code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ + + + + @@ -3311,7 +223,7 @@

    Both can change a variable to be of class factor.

      -
    • factor() will order alphabetically unless told otherwise.
    • +
    • factor() will order alphanumerically unless told otherwise.
    • as_factor() will order by first appearance unless told otherwise.
    @@ -3334,41 +246,40 @@

    A Factor Example

    -

    We will use data on heat-related visits to the ER from the State of Colorado, separated by age category, for 2011-2022. More on this data can be found here: https://coepht.colorado.gov/heat-related-illness

    - -

    You can download the data from the DaSEH website here: https://daseh.org/data/CO_ER_heat_visits_by_age_data.csv

    - -

    This dataset is also available in the dasehr package.

    - -

    We will limit the data to only one of the gender categories - we will choose “Both genders” because of data missingness.

    +

    We will use a slightly different version of the data on heat-related visits to the ER from the State of Colorado.

    -
    library(dasehr)
    -er_visits_age <- CO_heat_ER_byage
    +

    For today, we are looking at data that reports ER visits by age category.

    -#er_visits_age <- read_csv("https://daseh.org/data/CO_ER_heat_visits_by_age_data.csv") +
    er_visits_age <- read_csv("https://daseh.org/data/CO_ER_heat_visits_by_age.csv")
    -er_visits_age <- er_visits_age %>% - filter(str_detect(GENDER, "Both genders"))
    +
    ## Rows: 60 Columns: 6
    +## ── Column specification ────────────────────────────────────────────────────────
    +## Delimiter: ","
    +## chr (1): age
    +## dbl (5): year, rate, lower95cl, upper95cl, visits
    +## 
    +## β„Ή Use `spec()` to retrieve the full column specification for this data.
    +## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

    The data

    head(er_visits_age)
    -
    ## # A tibble: 6 Γ— 7
    -##    YEAR GENDER       AGE              RATE L95CL U95CL VISITS
    -##   <dbl> <chr>        <chr>           <dbl> <dbl> <dbl>  <dbl>
    -## 1  2011 Both genders 0-4 years old    3.52  1.82  6.16     12
    -## 2  2011 Both genders 15-34 years old  7.34  5.95  8.74    106
    -## 3  2011 Both genders 35-64 years old  5.84  4.80  6.88    121
    -## 4  2011 Both genders 5-14 years old   5.20  3.50  6.90     36
    -## 5  2011 Both genders 65+ years old    8.34  5.98 10.7      48
    -## 6  2011 Both genders All ages         6.30  5.62  6.99    323
    +
    ## # A tibble: 6 Γ— 6
    +##    year age          rate lower95cl upper95cl visits
    +##   <dbl> <chr>       <dbl>     <dbl>     <dbl>  <dbl>
    +## 1  2011 0-4 years    3.52      1.82      6.16     12
    +## 2  2011 15-34 years  7.34      5.95      8.74    106
    +## 3  2011 35-64 years  5.84      4.80      6.88    121
    +## 4  2011 5-14 years   5.20      3.50      6.90     36
    +## 5  2011 65+ years    8.34      5.98     10.7      48
    +## 6  2012 0-4 years    3.58      1.85      6.25     12
    -

    Notice that AGE is a chr variable. This indicates that the values are character strings.

    +

    Notice that age is a chr variable. This indicates that the values are character strings.

    -

    R does not realize that there is any order related to the AGE values. It will assume that it is alphabetical (for numbers, this means ascending order).

    +

    R does not realize that there is any order related to the AGE values. It will assume that it is alphanumeric (for numbers, this means ascending order).

    -

    However, we know that the order is: 0-4 years old, 5-14 years old, 15-34 years old, 35-64 years old, 65+ years old, and All ages.

    +

    However, we know that the order is: 0-4 years old, 5-14 years old, 15-34 years old, 35-64 years old, and 65+ years old.

    For the next steps, let’s take a subset of data.

    @@ -3382,78 +293,78 @@

    Let’s make a plot first.

    er_visits_age_subset %>%
    -  ggplot(mapping = aes(x = AGE, y = RATE)) +
    +  ggplot(mapping = aes(x = age, y = rate)) +
       geom_boxplot() +
       theme_bw(base_size = 12) # make all labels size 12
    -

    +

    OK this is very useful, but it is a bit difficult to read. We expect the values to be plotted by the order that we know, not by alphabetical order.

    Change to factor

    -

    Currently AGE is class character but let’s change that to class factor which allows us to specify the levels or order of the values.

    +

    Currently age is class character but let’s change that to class factor which allows us to specify the levels or order of the values.

    er_visits_age_fct <-
       er_visits_age_subset %>%
    -  mutate(AGE = factor(AGE,
    -    levels = c("0-4 years old", "5-14 years old", "15-34 years old", "35-64 years old", "65+ years old", "All ages")
    +  mutate(age = factor(age,
    +    levels = c("0-4 years old", "5-14 years old", "15-34 years old", "35-64 years old", "65+ years old")
       ))
     
     er_visits_age_fct %>%
    -  pull(AGE) %>%
    +  pull(age) %>%
       levels()
    ## [1] "0-4 years old"   "5-14 years old"  "15-34 years old" "35-64 years old"
    -## [5] "65+ years old"   "All ages"
    +## [5] "65+ years old"

    Change to a factor

    head(er_visits_age_fct)
    -
    ## # A tibble: 6 Γ— 7
    -##    YEAR GENDER       AGE              RATE L95CL U95CL VISITS
    -##   <dbl> <chr>        <fct>           <dbl> <dbl> <dbl>  <dbl>
    -## 1  2016 Both genders 0-4 years old    4.19  2.29  7.03     14
    -## 2  2019 Both genders 35-64 years old  7.19  6.07  8.30    159
    -## 3  2013 Both genders 15-34 years old  8.13  6.69  9.58    121
    -## 4  2022 Both genders 0-4 years old   NA    NA    NA        NA
    -## 5  2017 Both genders All ages         5.77  5.14  6.40    323
    -## 6  2019 Both genders 15-34 years old  8.34  6.94  9.73    137
    +
    ## # A tibble: 6 Γ— 6
    +##    year age    rate lower95cl upper95cl visits
    +##   <dbl> <fct> <dbl>     <dbl>     <dbl>  <dbl>
    +## 1  2017 <NA>   3.29      1.64      5.89     11
    +## 2  2013 <NA>   4.50      2.86      6.14     29
    +## 3  2021 <NA>  NA        NA        NA        NA
    +## 4  2013 <NA>   5.51      3.78      7.23     39
    +## 5  2011 <NA>   5.84      4.80      6.88    121
    +## 6  2019 <NA>   8.34      6.94      9.73    137

    Plot again

    Now let’s make our plot again:

    er_visits_age_fct %>%
    -  ggplot(mapping = aes(x = AGE, y = RATE)) +
    +  ggplot(mapping = aes(x = age, y = rate)) +
       geom_boxplot() +
       theme_bw(base_size = 12)
    -

    +

    Now that’s more like it! Notice how the data is automatically plotted in the order we would like.

    -

    What about if we arrange() the data by grade ?

    +

    What about if we arrange() the data by age?

    Character data is arranged alphabetically (if letters) or by ascending first number (if numbers).

    er_visits_age_subset %>%
    -  arrange(AGE)
    - -
    ## # A tibble: 32 Γ— 7
    -##     YEAR GENDER       AGE              RATE L95CL U95CL VISITS
    -##    <dbl> <chr>        <chr>           <dbl> <dbl> <dbl>  <dbl>
    -##  1  2016 Both genders 0-4 years old    4.19  2.29  7.03     14
    -##  2  2022 Both genders 0-4 years old   NA    NA    NA        NA
    -##  3  2018 Both genders 0-4 years old    3.91  2.08  6.68     13
    -##  4  2015 Both genders 0-4 years old   NA    NA    NA        NA
    -##  5  2021 Both genders 0-4 years old   NA    NA    NA        NA
    -##  6  2012 Both genders 0-4 years old    3.58  1.85  6.25     12
    -##  7  2020 Both genders 0-4 years old   NA    NA    NA        NA
    -##  8  2014 Both genders 0-4 years old   NA    NA    NA        NA
    -##  9  2013 Both genders 15-34 years old  8.13  6.69  9.58    121
    -## 10  2019 Both genders 15-34 years old  8.34  6.94  9.73    137
    +  arrange(age)
    + +
    ## # A tibble: 32 Γ— 6
    +##     year age          rate lower95cl upper95cl visits
    +##    <dbl> <chr>       <dbl>     <dbl>     <dbl>  <dbl>
    +##  1  2017 0-4 years    3.29      1.64      5.89     11
    +##  2  2021 0-4 years   NA        NA        NA        NA
    +##  3  2016 0-4 years    4.19      2.29      7.03     14
    +##  4  2018 0-4 years    3.91      2.08      6.68     13
    +##  5  2019 15-34 years  8.34      6.94      9.73    137
    +##  6  2018 15-34 years 10.1       8.60     11.7     165
    +##  7  2022 15-34 years 10.0       8.52     11.6     167
    +##  8  2016 15-34 years 10.9       9.23     12.5     171
    +##  9  2012 15-34 years  8.88      7.36     10.4     130
    +## 10  2014 15-34 years  6.28      5.02      7.54     95
     ## # β„Ή 22 more rows

    Notice that the order is not what we would hope for!

    @@ -3463,21 +374,21 @@

    Factor data is arranged by level.

    er_visits_age_fct %>%
    -  arrange(AGE)
    - -
    ## # A tibble: 32 Γ— 7
    -##     YEAR GENDER       AGE             RATE L95CL U95CL VISITS
    -##    <dbl> <chr>        <fct>          <dbl> <dbl> <dbl>  <dbl>
    -##  1  2016 Both genders 0-4 years old   4.19  2.29  7.03     14
    -##  2  2022 Both genders 0-4 years old  NA    NA    NA        NA
    -##  3  2018 Both genders 0-4 years old   3.91  2.08  6.68     13
    -##  4  2015 Both genders 0-4 years old  NA    NA    NA        NA
    -##  5  2021 Both genders 0-4 years old  NA    NA    NA        NA
    -##  6  2012 Both genders 0-4 years old   3.58  1.85  6.25     12
    -##  7  2020 Both genders 0-4 years old  NA    NA    NA        NA
    -##  8  2014 Both genders 0-4 years old  NA    NA    NA        NA
    -##  9  2022 Both genders 5-14 years old  3.75  2.31  5.19     26
    -## 10  2015 Both genders 5-14 years old  5.03  3.38  6.67     36
    +  arrange(age)
    + +
    ## # A tibble: 32 Γ— 6
    +##     year age    rate lower95cl upper95cl visits
    +##    <dbl> <fct> <dbl>     <dbl>     <dbl>  <dbl>
    +##  1  2017 <NA>   3.29      1.64      5.89     11
    +##  2  2013 <NA>   4.50      2.86      6.14     29
    +##  3  2021 <NA>  NA        NA        NA        NA
    +##  4  2013 <NA>   5.51      3.78      7.23     39
    +##  5  2011 <NA>   5.84      4.80      6.88    121
    +##  6  2019 <NA>   8.34      6.94      9.73    137
    +##  7  2020 <NA>   8.02      6.14      9.90     70
    +##  8  2019 <NA>   7.19      6.07      8.30    159
    +##  9  2018 <NA>  10.1       8.60     11.7     165
    +## 10  2022 <NA>  10.0       8.52     11.6     167
     ## # β„Ή 22 more rows

    Nice! Now this is what we would want!

    @@ -3487,49 +398,43 @@

    Tables grouped by a character are arranged alphabetically (if letters) or by ascending first number (if numbers).

    er_visits_age_subset %>%
    -  group_by(AGE) %>%
    -  summarize(total_visits = sum(VISITS, na.rm = T))
    - -
    ## # A tibble: 6 Γ— 2
    -##   AGE             total_visits
    -##   <chr>                  <dbl>
    -## 1 0-4 years old             39
    -## 2 15-34 years old          831
    -## 3 35-64 years old          649
    -## 4 5-14 years old            62
    -## 5 65+ years old            389
    -## 6 All ages                1943
    + group_by(age) %>% + summarize(total_visits = sum(visits, na.rm = T)) + +
    ## # A tibble: 5 Γ— 2
    +##   age         total_visits
    +##   <chr>              <dbl>
    +## 1 0-4 years             38
    +## 2 15-34 years          986
    +## 3 35-64 years          983
    +## 4 5-14 years           215
    +## 5 65+ years            296

    Making tables with factors

    Tables grouped by a factor are arranged by level.

    er_visits_age_fct %>%
    -  group_by(AGE) %>%
    -  summarize(total_visits = sum(VISITS, na.rm = T))
    - -
    ## # A tibble: 6 Γ— 2
    -##   AGE             total_visits
    -##   <fct>                  <dbl>
    -## 1 0-4 years old             39
    -## 2 5-14 years old            62
    -## 3 15-34 years old          831
    -## 4 35-64 years old          649
    -## 5 65+ years old            389
    -## 6 All ages                1943
    + group_by(age) %>% + summarize(total_visits = sum(visits, na.rm = T)) + +
    ## # A tibble: 1 Γ— 2
    +##   age   total_visits
    +##   <fct>        <dbl>
    +## 1 <NA>          2518

    forcats for ordering

    -

    What if we wanted to order AGE by increasing `RATE``?

    +

    What if we wanted to order age by increasing rate?

    library(forcats)
     
     er_visits_age_fct %>%
    -  ggplot(mapping = aes(x = AGE, y = RATE)) +
    +  ggplot(mapping = aes(x = age, y = rate)) +
       geom_boxplot() +
       theme_bw(base_size = 12)
    -

    +

    This would be useful for identifying easily which age group to focus on.

    @@ -3547,59 +452,71 @@

    library(forcats)
     
     er_visits_age_fct %>%
    -  ggplot(mapping = aes(x = fct_reorder(AGE, RATE, mean), y = RATE)) +
    +  ggplot(mapping = aes(x = fct_reorder(age, rate, mean), y = rate)) +
       geom_boxplot() +
       labs(x = "Age Category") +
       theme_bw(base_size = 12)
    -

    +

    forcats for ordering.. with .desc = argument

    library(forcats)
     
     er_visits_age_fct %>%
    -  ggplot(mapping = aes(x = fct_reorder(AGE, RATE, mean, .desc = TRUE), y = RATE)) +
    +  ggplot(mapping = aes(x = fct_reorder(age, rate, mean, .desc = TRUE), y = rate)) +
       geom_boxplot() +
       labs(x = "Age Category") +
       theme_bw(base_size = 12)
    -

    +

    -

    forcats for ordering.. can be used to sort datasets

    +

    forcats for ordering… can be used to sort datasets

    -
    er_visits_age_fct %>% pull(AGE) %>% levels() # By year order
    +
    er_visits_age_fct %>% pull(age) %>% levels() # By year order
    ## [1] "0-4 years old"   "5-14 years old"  "15-34 years old" "35-64 years old"
    -## [5] "65+ years old"   "All ages"
    +## [5] "65+ years old"
    er_visits_age_fct <- er_visits_age_fct %>%
       mutate(
    -    AGE = fct_reorder(AGE, RATE, mean)
    +    age = fct_reorder(age, rate, mean)
       )
     
    -er_visits_age_fct %>% pull(AGE) %>% levels() # by increasing mean dropouts
    +er_visits_age_fct %>% pull(age) %>% levels() # by increasing mean visits -
    ## [1] "0-4 years old"   "5-14 years old"  "35-64 years old" "All ages"       
    -## [5] "65+ years old"   "15-34 years old"
    +
    ## [1] "0-4 years old"   "5-14 years old"  "15-34 years old" "35-64 years old"
    +## [5] "65+ years old"

    Checking Proportions with fct_count()

    The fct_count() function of the forcats package is helpful for checking that the proportions of each level for a factor are similar. Need the prop = TRUE argument otherwise just counts are reported.

    er_visits_age_fct %>%
    -  pull(AGE) %>%
    +  pull(age) %>%
       fct_count(prop = TRUE)
    ## # A tibble: 6 Γ— 3
    -##   f                   n      p
    -##   <fct>           <int>  <dbl>
    -## 1 0-4 years old       8 0.25  
    -## 2 5-14 years old      2 0.0625
    -## 3 35-64 years old     5 0.156 
    -## 4 All ages            5 0.156 
    -## 5 65+ years old       6 0.188 
    -## 6 15-34 years old     6 0.188
    +## f n p +## <fct> <int> <dbl> +## 1 0-4 years old 0 0 +## 2 5-14 years old 0 0 +## 3 15-34 years old 0 0 +## 4 35-64 years old 0 0 +## 5 65+ years old 0 0 +## 6 <NA> 32 1 + +

    GUT CHECK: Why is it useful to have the factor class as an option?

    + +

    A. It helps us check the factual accuracy of our datasets.

    + +

    B. It helps us change the order of variables in case the order has meaning.

    + +

    GUT CHECK: What does the fct_reorder() function do?

    + +

    A. It helps us reorder a factor based on the values of another variable.

    + +

    B. It helps us reorder a factor based on a random change in the order.

    Summary

    @@ -3607,7 +524,7 @@

  • the factor class allows us to have a different order from alphanumeric for categorical data
  • we can change data to be a factor variable using mutate and a factor creating function like factor() or as_factor
  • the as_factor() is from the forcats package (first appearance order by default)
  • -
  • the factor() base R function (alphabetical order by default)
  • +
  • the factor() base R function (alphanumeric order by default)
  • with factor() we can specify the levels with the levels argument if we want a specific order
  • the fct_reorder({variable_to_reorder}, {variable_to_order_by}, {summary function}) helps us reorder a variable by the values of another variable
  • arranging, tabulating, and plotting the data will reflect the new order
  • @@ -3615,7 +532,7 @@

    Lab

    -

    🏠 Class Website
    πŸ’» Lab

    +

    🏠 Class Website
    πŸ’» Lab. πŸ“ƒDay 6 Cheatsheet πŸ“ƒPosit’s forcats cheatsheet

    The End

    diff --git a/modules/Factors/Factors.pdf b/modules/Factors/Factors.pdf index 031407ce..d9bc3fc4 100644 Binary files a/modules/Factors/Factors.pdf and b/modules/Factors/Factors.pdf differ diff --git a/modules/Factors/lab/Factors_Lab.Rmd b/modules/Factors/lab/Factors_Lab.Rmd index 5d301e82..eb02060c 100644 --- a/modules/Factors/lab/Factors_Lab.Rmd +++ b/modules/Factors/lab/Factors_Lab.Rmd @@ -13,17 +13,20 @@ library(tidyverse) ### 1.0 -Load the Youth Tobacco Survey data and `select` "Sample_Size", "Education", and "LocationAbbr". Name this data "yts". +Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces". + +`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas. ```{r} -yts <- - read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>% - select(Sample_Size, Education, LocationAbbr) +ces <- + read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>% + select(CaliforniaCounty, ImpWaterBodies, ZIP) %>% + filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco")) ``` ### 1.1 -Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`. +Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`. ```{r 1.1response} @@ -31,7 +34,7 @@ Create a boxplot showing the difference in "Sample_Size" between Middle School a ### 1.2 -Use `count` to count up the number of observations of data for each "Education" group. +Use `count` to count up the number of observations of data for each `CaliforniaCounty` group. ```{r 1.2response} @@ -39,7 +42,7 @@ Use `count` to count up the number of observations of data for each "Education" ### 1.3 -Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct". +Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct". ```{r 1.3response} @@ -47,7 +50,7 @@ Make "Education" a factor using the `mutate` and `factor` functions. Use the `le ### 1.4 -Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table. +Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table. ```{r 1.4response} @@ -57,8 +60,7 @@ Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different o # Practice on Your Own! ### P.1 - -Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument. +Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument. ```{r P.1response} @@ -66,11 +68,11 @@ Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and ### P.2 -We want to create a new column that contains the group-level median sample size. +We want to create a new column that contains the group-level median values for `ImpWaterBodies`. -- Using the "yts_fct" data, `group_by` "LocationAbbr". -- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size". -- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`. +- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by` +- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`. +- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`. ```{r P.2response} @@ -78,18 +80,18 @@ We want to create a new column that contains the group-level median sample size. ### P.3 -We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that: +We want to make a plot of the `med_ImpWaterBodies` column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that: -- Has "LocationAbbr" on the x-axis -- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size" -- Has "Sample_Size" on the y-axis +- Has `ZIP` on the x-axis +- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies` +- Has `med_ImpWaterBodies` on the y-axis - Is a boxplot (`geom_boxplot`) -- Has the x axis label of `State` +- Has the x axis label of "Zipcode" (Don't worry if you get a warning about not being able to plot `NA` values.) Save your plot using `ggsave()` with a width of 10 and height of 3. -Which state has the largest median sample size? +Which zipcode has the largest median measure of water pollution? ```{r P.3response} diff --git a/modules/Factors/lab/Factors_Lab.html b/modules/Factors/lab/Factors_Lab.html index aa5aedc5..5fc5e9da 100644 --- a/modules/Factors/lab/Factors_Lab.html +++ b/modules/Factors/lab/Factors_Lab.html @@ -13,228 +13,34 @@ Factors Lab - - + + - - - - + + + + - - + h1.title {font-size: 38px;} + h2 {font-size: 30px;} + h3 {font-size: 24px;} + h4 {font-size: 18px;} + h5 {font-size: 16px;} + h6 {font-size: 12px;} + code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} + pre:not([class]) { background-color: white } + + + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + span.underline{text-decoration: underline;} + div.column{display: inline-block; vertical-align: top; width: 50%;} + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} + ul.task-list{list-style: none;} + - + + - - - - + + + + - - + h1.title {font-size: 38px;} + h2 {font-size: 30px;} + h3 {font-size: 24px;} + h4 {font-size: 18px;} + h5 {font-size: 16px;} + h6 {font-size: 12px;} + code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} + pre:not([class]) { background-color: white } + + + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + span.underline{text-decoration: underline;} + div.column{display: inline-block; vertical-align: top; width: 50%;} + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} + ul.task-list{list-style: none;} +