diff --git a/data/slide_tidyverse/tibble_tweet.jpg b/data/slide_tidyverse/tibble_tweet.jpg new file mode 100644 index 00000000..fe4b1ede Binary files /dev/null and b/data/slide_tidyverse/tibble_tweet.jpg differ diff --git a/lab_tidyverse.Rmd b/lab_tidyverse.Rmd index f0021ddf..08e4735e 100644 --- a/lab_tidyverse.Rmd +++ b/lab_tidyverse.Rmd @@ -171,7 +171,7 @@ flights %>% select(carrier, tailnum, origin) flights %>% select(-(day:carrier)) ``` -- Select all columns that have to do with `arr`ival (hint: `?tidyselect`) +- Select all columns that have to do with `arr`_ival (hint: `?tidyselect`) ```{r,accordion=TRUE} flights %>% select(contains('arr_')) diff --git a/slide_tidyverse.Rmd b/slide_tidyverse.Rmd index 5ff04b71..b691de7a 100644 --- a/slide_tidyverse.Rmd +++ b/slide_tidyverse.Rmd @@ -2,7 +2,7 @@ title: "Tidy work in Tidyverse" subtitle: "R Foundation for Life Scientists" author: "Marcin Kierczak" -keywords: r, r programming, markdown, tidyverse +keywords: r, rstats, r programming, markdown, tidyverse output: xaringan::moon_reader: encoding: 'UTF-8' @@ -45,16 +45,36 @@ library(tidyverse) library(ggplot2) # static graphics library(kableExtra) library(magrittr) +library(emo) ``` +--- +name: learning_outcomes +# Learning Outcomes + +
+ +Upon completing this module, you will: + +* know what `tidyverse` is and a bit about its history + +* be aware of useful packages within `tidyverse` + +* be able to use basic pipes (including native R pipe) + +* know whether the data you are working with are tidy + +* will be able to do basic tidying of your data + +--- +name: tidyverse_overview # Tidyverse -- What is it all About? -* [Tidyverse](http://www.tidyverse.org) is a collection of packages. -* Created by [Hadley Wickham](http://hadley.nz). -* Gains popularity, on the way to become a *de facto* standard in data analyses. -* Knowing how to use it can increase your salary :-) -* A philosophy of programming or a programing paradigm. -* Everything is about the flow of *tidy data*. +* [tidyverse](http://www.tidyverse.org) is a collection of   `r emo::ji('package')` `r emo::ji('package')` +* created by [Hadley Wickham](http://hadley.nz) +* has become a *de facto* standard in data analyses +* a philosophy of programming or a **programming paradigm**: everything is about the  `r emo::ji('water_wave')`   flow of   `r emo::ji('broom')`   tidy data + .center[ @@ -63,15 +83,14 @@ library(magrittr) .vsmall[sources of images: www.tidyverse.org, Wikipedia, www.tidyverse.org] --- -name: tidyverse_workflow - -# Typical Tidyverse Workflow +name: tidyverse_curse +# ?(Tidyverse OR !Tidyverse) -The tidyverse curse? +> `r emo::ji('skull_and_crossbones')`  There are still some people out there talking about the tidyverse curse though...  `r emo::ji('skull_and_crossbones')`
-- -> Navigating the balance between base R and the tidyverse is a challenge to learn. [-Robert A. Muenchen](http://r4stats.com/articles/why-r-is-hard-to-learn/) +> Navigating the balance between base R and the tidyverse is a challenge to learn.
[-Robert A. Muenchen](http://r4stats.com/articles/why-r-is-hard-to-learn/) -- @@ -81,8 +100,7 @@ The tidyverse curse? --- name: intro_to_pipes - -# Introduction to Pipes +# Pipes or Let my Data Flow   `r emo::ji('water_wave')` .pull-left-50[ @@ -131,6 +149,21 @@ iris %>% head(n=3) ] +--- +name: native_r_pipe +# Native R Pipe + +From R 4.1.0, we have a native pipe operator `|>` that is a bit faster than the `magrittr` pipe `%>%`. +It, however, differs from the `magrittr` pipe in some aspects, e.g., it does not allow for the use of the dot `.` as a placeholder (it has a simple `_` placeholder though). + +```{r native_pipe1} +c(1:5) |> mean() +``` + +```{r native_pipe2} +c(1:5) %>% mean() +``` + --- name: tibble_intro @@ -139,12 +172,7 @@ name: tibble_intro .pull-left-50[ -.center[] - -```{r} -head(as_tibble(iris)) -``` - +.center[] ] .pull-right-50[ @@ -152,21 +180,41 @@ head(as_tibble(iris)) * `tibble` is one of the unifying features of tidyverse, * it is a *better* `data.frame` realization, * objects `data.frame` can be coerced to `tibble` using `as_tibble()` +] + +--- +name: convert_to_tibble +# Convert `data.frame` to `tibble` +```{r} +as_tibble(iris) +``` -```{r tibble_from_scratch} +--- +name: tibble_from_scratch +# Tibbles from scratch with `tibble()` + +```{r tibble_from_scratch, eval=FALSE} tibble( x = 1, # recycling - y = runif(8), + y = runif(4), z = x + y^2, - outcome = rnorm(8) + outcome = rnorm(4) ) ``` -] +-- ---- -name: tibble2 +```{r tibble_from_scratch_eval, echo = F, eval=TRUE} +tibble( + x = 1, # recycling + y = runif(4), + z = x + y^2, + outcome = rnorm(4) +) +``` +--- +name: more_on_tibbles # More on Tibbles * When you print a `tibble`: @@ -175,36 +223,42 @@ name: tibble2 + data type for each column is shown. ```{r tibble_printing} -as_tibble(cars) %>% print(n = 5) +as_tibble(cars) ``` +--- +name: tibble_printing_options +# Tibble Printing Options + * `my_tibble %>% print(n = 50, width = Inf)`, * `options(tibble.print_min = 15, tibble.print_max = 25)`, * `options(dplyr.print_min = Inf)`, * `options(tibble.width = Inf)` --- -name: tibble2 - +name: subsetting_tibbles # Subsetting Tibbles ```{r tibble_subs} vehicles <- as_tibble(cars[1:5,]) +vehicles %>% print(n = 5) +``` + + +-- + +We can subset tibbles in a number of ways: -vehicles[['speed']] +```{r tibble_subs1} +vehicles[['speed']] # try also vehicles['speed'] vehicles[[1]] vehicles$speed - -# Using placeholders - -vehicles %>% .$dist -vehicles %>% .[['dist']] -vehicles %>% .[[2]] ``` - + + -- -**Note!** Not all old R functions work with tibbles, than you have to use `as.data.frame(my_tibble)`. +> **Note!** Not all old R functions work with tibbles, than you have to use `as.data.frame(my_tibble)`. --- name: tibbles_partial_matching @@ -244,78 +298,13 @@ In `tidyverse` you import data using `readr` package that provides a number of u * `read_log()` for reading Apache-style logs. -- -The most commonly used `read_csv()` has some familiar arguments like: + +>The most commonly used `read_csv()` has some familiar arguments like: * `skip` -- to specify the number of rows to skip (headers), * `col_names` -- to supply a vector of column names, * `comment` -- to specify what character designates a comment, * `na` -- to specify how missing values are represented. ---- -name: readr - -# Importing Data Using `readr` - -When reading and parsing a file, `readr` attempts to guess proper parser for each column by looking at the 1000 first rows. - -```{r tricky_dataset, echo=TRUE, message=TRUE, warning=T} -tricky_dataset <- read_csv(readr_example('challenge.csv')) -``` - -OK, so there are some parsing failures. We can examine them more closely using `problems()` as suggested in the above output. - ---- -name: readr_problems - -# Looking at Problematic Columns - -```{r tricky_dataset_problems} -(p <- problems(tricky_dataset)) -``` - -OK, let's see which columns cause trouble: - -```{r problems_table} -p %$% table(col) -``` - -Looks like the problem occurs only in the `x` column. - ---- -name: readr_problems_fixing - -# Fixing Problematic Columns - -So, how can we fix the problematic columns? - -1. We can explicitely tell what parser to use: - -```{r fix_problematic_explicite_parser, echo=TRUE, message=TRUE, warning=T} -tricky_dataset <- read_csv(readr_example('challenge.csv'), - col_types = cols(x = col_double(), - y = col_character())) -tricky_dataset %>% tail(n = 5) -``` - -As you can see, we can still do better by parsing the `y` column as *date*, not as *character*. - ---- -name: readr_problems_fixing2 - -# Fixing Problematic Columns cted. - -But knowing that the parser is guessed based on the first 1000 lines, we can see what sits past the 1000-th line in the data: - -```{r} -tricky_dataset %>% head(n = 1002) %>% tail(n = 4) -``` - -It seems, we were very unlucky, because up till 1000-th line there are only integers in the x column and `NA`s in the y column so the parser cannot be guessed correctly. To fix this: - -```{r guess_max_fix, echo=TRUE, message=TRUE, warning=T} -tricky_dataset <- read_csv(readr_example('challenge.csv'), - guess_max = 1001) -``` - --- name: readr_writing @@ -345,10 +334,11 @@ name: basic_data_transformations Let us create a tibble: ```{r} -(bijou <- as_tibble(diamonds) %>% head(n = 10)) +bijou <- as_tibble(diamonds) %>% head() +bijou[1:5, ] ``` -.center[] +.center[ ] --- name: filter @@ -356,25 +346,37 @@ name: filter # Picking Observations using `filter()` ```{r} -bijou %>% filter(cut == 'Ideal' | cut == 'Premium', carat >= 0.23) %>% head(n = 5) +bijou %>% filter(cut == 'Ideal' | cut == 'Premium', carat >= 0.23) %>% head(n = 4) ``` + -Be careful with floating point comparisons! Also, rows with comparison resulting in `NA` are skipped by default! +-- -```{r} -bijou %>% filter(near(0.23, carat) | is.na(carat)) %>% head(n = 5) -``` +>`r emo::ji('boat')`   Be careful with floating point comparisons!
+`r emo::ji('pirate')`   Also, rows with comparison resulting in `NA` are skipped by default! +```{r, echo=T, eval=F} +bijou %>% filter(near(0.23, carat) | is.na(carat)) %>% head(n = 4) +``` + --- name: arrange # Rearranging Observations using `arrange()` -```{r} +```{r, echo=T, eval=FALSE} bijou %>% arrange(cut, carat, desc(price)) ``` + +-- -The `NA`s always end up at the end of the rearranged tibble. +```{r, echo=FALSE, eval=TRUE} +bijou %>% arrange(cut, carat, desc(price)) +``` + +-- + +>The `NA`s always end up at the end of the rearranged `tibble`! --- name: select @@ -395,21 +397,34 @@ bijou %>% select(-(x:z)) %>% head(n = 4) ``` --- -name: select2 +name: rename +# Renaming Variables -# Selecting Variables with `select()` cted. +>`rename` is a variant of `select`, here used with `everything()` to move `x` to the beginning and rename it to `var_x` -`rename` is a variant of `select`, here used with `everything()` to move `x` to the beginning and rename it to `var_x` +```{r, eval=FALSE, echo=TRUE} +bijou %>% rename(var_x = x) %>% head(n = 5) +``` + +-- -```{r} +```{r, eval=T, echo=F} bijou %>% rename(var_x = x) %>% head(n = 5) ``` + +--- +name: bring_to_front +# Bring columns to front --- +>use `everything()` to bring some columns to the front: -use `everything()` to bring some columns to the front: +```{r, echo=TRUE, eval=FALSE} +bijou %>% select(x:z, everything()) %>% head(n = 4) +``` + +-- -```{r} +```{r, echo=FALSE, eval=TRUE} bijou %>% select(x:z, everything()) %>% head(n = 4) ``` @@ -418,21 +433,34 @@ name: mutate # Create/alter new Variables with `mutate` -```{r} -bijou %>% mutate(p = x + z, q = p + y) %>% select(-(depth:price)) %>% head(n = 5) +```{r, echo=T, eval=F} +bijou %>% mutate(p = x + z, q = p + y) %>% + select(-(depth:price)) %>% + head(n = 5) ``` - + + -- -or with `transmute` (only the transformed variables will be retained) +```{r, echo=F, eval=T} +bijou %>% mutate(p = x + z, q = p + y) %>% + select(-(depth:price)) %>% + head(n = 5) +``` + +--- +name: transmute +# Create/alter new Variables with `transmute` `r emo::ji('wizard')` + +>Only the transformed variables will be retained. ```{r} bijou %>% transmute(carat, cut, sum = x + y + z) %>% head(n = 5) ``` + --- name: grouped_summaries - # Group and Summarize ```{r} @@ -440,6 +468,7 @@ bijou %>% group_by(cut) %>% summarize(max_price = max(price), mean_price = mean(price), min_price = min(price)) ``` + -- @@ -611,8 +640,7 @@ bijou4 %>% ``` --- -name: tidying_data_separate - +name: tidying_data_unite # Tidying Data with `unite` If some of your columns contain more than one value, use `separate`: @@ -627,8 +655,6 @@ bijou5 bijou5 %>% unite(clarity, clarity_prefix, clarity_suffix, sep='') ``` -**Note:** that `sep` is here interpreted as the position to split on. It can also be a *regular expression* or a delimiting string/character. Pretty flexible approach! - --- name: missing_complete @@ -644,7 +670,7 @@ bijou %>% bijou %>% head(n = 10) %>% select(cut, clarity, price) %>% mutate(continent = sample(c('AusOce', 'Eur'), - size = 10, + size = 6, replace = T)) -> missing_stones ``` ```{r}